profile
viewpoint
Rahul Huilgol rahul003 Amazon Web Services Palo Alto, CA AWS Deep Learning

aws-samples/deep-learning-models 98

Natural language processing & computer vision models optimized for AWS

ctcyang/horovod 4

Distributed training framework for TensorFlow, Keras, and PyTorch.

rahul003/hudl 4

Bash utility to help ease the task of managing a cluster of machines

rahul003/bayou 1

C++ implementation of Bayou protocol for weakly replicated, eventually consistent distributed database for playlist of songs

rahul003/distributed_assignments 1

Implementation of causal broadcast and unicast using vector clocks in C++

rahul003/hmm 1

Hidden Markov Model for isolated word recognition

rahul003/http-server-client 1

A simple HTTP server anc client supporting GET and PUSH

rahul003/kalaha 1

The ancient board game Kalaha implemented as a Human vs Computer Game in Java. The game runs on a modified version of the Minimax and Alpha Beta Pruning

andreaolgiati/montecarlo_tf 0

Runtime System for Montecarlo/TF

pull request commentawslabs/sagemaker-debugger

Add ability to only save shapes of tensors

I don't have access to the CI account. What's the error?

rahul003

comment created time in 17 hours

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 5dc47ffa2da577f2066fa90808547ea78ad47c52

Add s3 and json tests

view details

push time in 17 hours

PR opened awslabs/sagemaker-debugger

Add ability to only save shapes of tensors

Description of changes:

  • Added a config to ReductionConfig to save shape.
  • Created a class ShapeWriter which uses the same index writer as (Event)FileWriter and adds the shapes to the index file.
  • Added an API shape to tensor class.
  • Added tests for mxnet, tensorflow, pytorch

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+554 -128

0 comment

19 changed files

pr created time in 18 hours

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 681e35c21b5cb4f2108e3d59dcb1d7024f55fb95

Add mxnet test

view details

push time in 18 hours

issue openedawslabs/sagemaker-debugger

MXNet hook saving more tensors than specified


def test_save_shapes(out_dir, hook=None):
    hook_created = False
    if hook is None:
        hook_created = True
        global_reduce_config = ReductionConfig(save_raw_tensor=True)
        global_save_config = SaveConfig(save_steps=[0, 1])

        hook = t_hook(
            out_dir=out_dir,
            save_config=global_save_config,
            include_collections=[
                "weights",
                "biases",
                "gradients",
                "default",
                "ReluActivation",
                "flatten",
            ],
            reduction_config=global_reduce_config,
        )
        hook.get_collection("ReluActivation").include(["relu*"])
        hook.get_collection("ReluActivation").save_config = SaveConfig(save_steps=[1])
        hook.get_collection("flatten").include(["flatten*"])
        hook.get_collection("ReluActivation").save_config = SaveConfig(save_steps=[1])
    
    run_mnist_gluon_model(hook=hook, num_steps_train=10, num_steps_eval=10)
    
    tr = create_trial(out_dir)
    print(0, len(tr.tensor_names(step=0)))
    print(1, len(tr.tensor_names(step=1)))
    if hook_created:
        shutil.rmtree(out_dir)

In step 0 it should only save 21 tensors, and 31 in step 1. But both steps save 31 tensors.

created time in 18 hours

pull request commentawslabs/sagemaker-debugger

Refactor: Move SagemakerSimulator to test utils

Did this get merged? I see two copies of SagemakerSimulator class in the codebase

NihalHarish

comment created time in 20 hours

pull request commentawslabs/sagemaker-debugger

Tf version compare

Looks like the title has nothing to do with the changes

NihalHarish

comment created time in 20 hours

Pull request review commentawslabs/sagemaker-debugger

Adding filtering node logic issue:321

 def set_tensor_ref(self, tensor, tensor_name: tf.Tensor = None):         else:             name = tensor.name             export_name = tensor.export_name+        get_logger().debug(f"In set_tensor_ref: tensor_name:{name} export_name:{export_name}")         self._tensors[name] = tensor         self.add_tensor_name(export_name) -    def has_tensor(self, name):-        # tf object name+        if name != export_name:+            get_logger().debug(+                f"Export_name:{export_name} != name:{name} . Adding export_name:{export_name} to include in "+                f"collection collection_name:{self.name} selfId:{id(self)} "+            )+        self.include("^" + export_name + "$")

This doesn't look correct. name is supposed to be the the actual name of tensor from TF's point of view. export_name is a more meaningful name for that tensor for user's benefit. Including export name is not useful and in fact might be wrong in some situations if that refers to a different tensor.

Vikas-kum

comment created time in 20 hours

Pull request review commentawslabs/sagemaker-debugger

Adding filtering node logic issue:321

 def add_for_mode(self, arg, mode=None):          Adds tensors to the collection from a given Operation, Tensor, Variable or MirroredVariable          :param arg: the argument to add to collection          """+        get_logger().debug(f"type:{type(arg)} arg:{arg} self:{id(self)}")

Did you intend to remove before PR? If not, you might want to add some description before.

Vikas-kum

comment created time in 20 hours

Pull request review commentawslabs/sagemaker-debugger

Adding filtering node logic issue:321

 def get_export_names_of_tensors(self):         return self.tensor_names      def get_tensor(self, name):-        return self._tensors[name]+        if name in self._tensors:+            return self._tensors[name]+        else:+            return None

Could you check if returning None causes issues wherever this function is used? I don't think None is expected. Is it actually possible? Did you see it happening?

Vikas-kum

comment created time in 20 hours

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha f146c77bcdc0f0020caf3b56f8c197a699ed4098

Simplify read code

view details

Rahul Huilgol

commit sha 5906e5aeaff65323080129806efcf9af2bfced79

Add read API and tests

view details

push time in 20 hours

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 44358ee6f7aeebc82e9aa3707884dd3208da0891

Add tests for TF

view details

push time in 2 days

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 1357f5d025018829262bc9952f0f0a3179700a3c

Import

view details

push time in 4 days

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha fc25940a8b1e2fdb898c55cf5ba248dc38acf075

Import

view details

push time in 4 days

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 651c4408394a4adff6bc75589a8f0f5e42fc1718

fix syntax

view details

push time in 4 days

push eventawslabs/sagemaker-debugger

Rahul Huilgol

commit sha 86842e6f57be54a9ea686a9edf0a6f9f27d0e552

fix syntax

view details

push time in 4 days

create barnchawslabs/sagemaker-debugger

branch : shapes

created branch time in 4 days

issue openedzhuwenxi/pytorch-profiling-tool

How does this differ from using torch.autograd.profiler?

Hi, I'm trying to understand how this tool differs from using torch.autograd.profiler like this? https://gist.github.com/XinDongol/fe066cb76e1c5238ecbc0cb729806410

created time in a month

issue openedmsr-fiddle/pipedream

Can the profiler handle dynamic graphs?

Can the profiler which generates the graph handle conditionals and loops?

created time in a month

startedKarypisLab/METIS

started time in a month

issue openedmicrosoft/vscode-python

Creating a new terminal does not automatically activate the chosen conda interpreter

<!-- Please search existing issues to avoid creating duplicates. -->

Environment data

  • VS Code version: Version: 1.46.1 Commit: cd9ea6488829f560dc949a8b2fb789f3cdc05f5d Date: 2020-06-17T21:17:14.222Z Electron: 7.3.1 Chrome: 78.0.3904.130 Node.js: 12.8.1 V8: 7.8.279.23-electron.0 OS: Darwin x64 18.7.0

  • Extension version (available under the Extensions sidebar): XXX

  • OS and version: MacOS Mojave

  • Python version (& distribution if applicable, e.g. Anaconda): conda 4.8.3 python 3.7

  • Type of virtual environment used (N/A | venv | virtualenv | conda | ...): conda

  • Relevant/affected Python packages and their versions: XXX

  • Relevant/affected Python-related VS Code extensions and their versions: XXX

  • Value of the python.languageServer setting: XXX

Expected behaviour

My understanding is that when creating a new terminal it should automatically load the chosen interpreter

Actual behaviour

It starts terminal with base conda environment.

Steps to reproduce:

Not sure what's relevant here. But I installed conda, setup an env in an outside terminal. Then restarted VSCode and chose the correct interpreter. But terminals still open with base conda env

created time in a month

startedtimothycrosley/isort

started time in 2 months

issue commenthorovod/horovod

RecursionError with Pytorch

The error does not show up when using older version of Horovod with older version of Torch. I couldn't use an older version of Horovod with current version of Torch due to build failures.

rahul003

comment created time in 2 months

issue openedhorovod/horovod

RecursionError with Pytorch

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) Pytorch
  2. Framework version: 1.15.1
  3. Horovod version: 0.19.4
  4. MPI version: 4.0.1
  5. CUDA version: 10.2
  6. NCCL version: ?
  7. Python version: 3.6
  8. Spark / PySpark version: NA
  9. OS and version: Ubuntu 16.04
  10. GCC version: 5.4(?)

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes
  2. If your question is about hang, did you read this doc? N/A
  3. If your question is about docker, did you read this doc? N/A
  4. Did you check if you question is answered in the troubleshooting guide? Yes

Bug report: When I start my training job each process crashes with this exception

File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in _get_types
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in <listcomp>
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in _get_types
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in <listcomp>
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in _get_types
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in <listcomp>
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in _get_types
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in <listcomp>
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in _get_types
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 533, in <listcomp>
    return type(x), [_get_types(xi) for xi in x]
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/horovod/torch/__init__.py", line 532, in _get_types
    if isinstance(x, Iterable):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/abc.py", line 184, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/_weakrefset.py", line 72, in __contains__
    wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object

created time in 2 months

issue openedNVIDIA/apex

Recent commits raise this error

TypeError: multi_tensor_lamb_stage1_cuda(): incompatible function arguments. The following argument types are supported:
    1. (arg0: int, arg1: at::Tensor, arg2: List[List[at::Tensor]], arg3: at::Tensor, arg4: int, arg5: float, arg6: float, arg7: float, arg8: at::Tensor, arg9: float) -> None

I see this error on current HEAD: c3fad1ad120b23055f6630da0b029c8b626db78f I don't see this error on commit from May 14 : 3bae8c83494184673f01f3867fa051518e930895

created time in 2 months

issue closedtensorflow/models

What's the plan for this repository in terms of version compatibility?

What's the plan for this repository in terms of version compatibility? There are lot of API calls which don't work in tf2.0. Should I create a PR to fix some of them? Or is this code supposed to work as is for tf1.x?

closed time in 2 months

rahul003

pull request commentawslabs/sagemaker-debugger

Revert "TF 2.x: Support for keras to estimator"

Why? Did it introduce some problem?

NihalHarish

comment created time in 2 months

create barnchrahul003/deep-learning-containers

branch : tf2.2

created branch time in 2 months

push eventrahul003/deep-learning-containers

satish Kumar Gollaprolu

commit sha 94f8ea08f4ab66b057e6818b2e86dafdeb7c06c7

use symlinking instead of env variables for ssl_certs_dir (#207) * use symlinking instead of env variables for ssl_certs_dir * skip the test_smbdebug for sagemaker remote tests Co-authored-by: Satish Gollaprolu <sgollapr@amazon.com>

view details

Arjuna

commit sha 50f387c029c26531c67103673e6b2b052ae9435f

[mxnet, tensorflow, pytorch] | [build, test] | [sagemaker] Update Pillow version (#208)

view details

Zhuo Weng

commit sha 6f0b0cd68a8cadcb3ad71b9969f6b215918e4a4b

MXNet benchmark (#204)

view details

Tushar Dey

commit sha 0d287aeb3686263e706c146512ba4712e349eca2

[MXNet][Sanity Test]Updated packages for MXNet (#217) * Updated packages * Minor syntax error fix * Minor syntax error fix Co-authored-by: Tushar Dey <tshdy@amazon.com>

view details

Zhuo Weng

commit sha b9929da2bc61d88c1f3ea92e12636d1121ebe7c1

[pytorch] | [build, test] | [ec2, ecs, eks, sagemaker] PT 1.4 patch release (#216) * PT 1.4 patch release * Apply example image change to PT1.4 * fix dependency issue Co-authored-by: Tushar Dey <dey.tushar@yahoo.com>

view details

Sai Parthasarathy Miduthuri

commit sha 1551238b591d8fb6a1a1f2de9df95aba7200b8e7

[tensorflow] | [test] change: Skip standalone keras smdebug test (#213)

view details

Tushar Dey

commit sha 4a3748ff90732880b2be6167307d3dad54237d3b

[Build]Adding buildspec for TF1.x (#215) * New buildspec for TF1.x * disabling the test * Minor Correction * Making sure that correct buildspec is been picked * Minor Correction * Few more minor changes * Testing of using correct buildspec * Mionor syntax correction * Changing the condition * Enabling all the test * Removing of debug statements * Adding of pattern to include buildspec-tf1 * Adding Py2 and Inference for TF1.x buildspec Co-authored-by: Tushar Dey <tshdy@amazon.com>

view details

Zhuo Weng

commit sha 4746522177f2b09453cf1c8932d9ddf40b8bb7a4

Remove the temp patch for PT example tests (#223)

view details

Sai Parthasarathy Miduthuri

commit sha 1207ce7fb60430e234c1c64ecc8d7496be0ec6e2

[test] | [eks] change: Create a new EKS cluster for each PR test set (#218)

view details

Nihal Harish

commit sha 4143da2098cdf15c338c7e386b4089a76237e416

[mxnet, tensorflow, pytorch] | [build, test] | [ec2, sagemaker]|Smdebug Version Bump (#225) * smdebug version bump * Correct SMDebug test file name * rename tests * switch to pytest runner * change smdebug output path * add to 2.0.1 * revert test changes * Update testSmdebug Co-authored-by: Zhuo Weng <wenzhuo@amazon.com>

view details

Zhuo Weng

commit sha f55370587097430db3af3deb9d7277397e7c1833

nit for mxnet benchmark helper script (#231)

view details

Tushar Dey

commit sha 753b9c2ec529f48edbf0bd2986509173ea9b10da

Release Images Changes for PyTorch 1.4.0 (#226) Co-authored-by: Tushar Dey <tshdy@amazon.com> Co-authored-by: Zhuo Weng <wenzhuo@amazon.com>

view details

Zhuo Weng

commit sha 6d5935d853612ac3f4cf541d7b8cf5354f95b876

[pytorch] | [build, test] | [ec2, ecs, eks, sagemaker] Update py36 version to python3.6.10 for PT1.4 Images (#234) * Update py36 version to python3.6.10 for PT1.4 Images * Include Inference Images for this release

view details

Arjuna

commit sha 8bf671cbd94a40a2f535cd75adb14ac4c17d7805

[mxnet, tensorflow, pytorch] | [test] | [ec2] feature: Add function to conditionally determine EC2 instance type fr… (#219)

view details

Lauren Yu

commit sha 80665bbaaaa6248bbf6915b47d0c806867974e75

update sagemaker-tensorflow-training versions for TF 1.15.2 and 2.2 (#211)

view details

Arjuna

commit sha 18777dd5f4ff3ce80beab14a42d0227ae5829b6a

[test]|[ec2] Add instance type to ec2 key pair (#238) * Add instance type to ec2 key pair * Address review comments Co-authored-by: Arjuna Keshavan <arjunake@amazon.com>

view details

Zhuo Weng

commit sha 99c5778743bb68ead1899798f5f8c991f0d78574

[pytorch] [build, test] | [ec2, ecs, eks, sagemaker] Disable single-node pytorch training EKS for mainline pipeline (#237) * Disable single-node pytorch training EKS for mainline pipeline * Also disable PT dgl EKS single node training for mainline pipeline

view details

Zhuo Weng

commit sha 9c6a724ce23fd49e9cc3fe5388a390980337b542

Release logic change for PT14 patch release (#244)

view details

Zhuo Weng

commit sha 57b53dc88e16822e6117035a88ff7dab36e60aaf

Update python36 version for PT1.5 Images and change buildspec.yml in favor of PT1.5 patch release (#242) Co-authored-by: Tushar Dey <dey.tushar@yahoo.com>

view details

Sai Parthasarathy Miduthuri

commit sha 28b89613ad8f7232ded3a4b756a15c964c5de765

[tensorflow] | [test] | [benchmark] | [sagemaker] feature: Add Tensorflow sagemaker benchmark tests (#241)

view details

push time in 2 months

delete branch rahul003/deep-learning-containers

delete branch : tf2train2

delete time in 2 months

PR closed aws/deep-learning-containers

Updating TF binaries for version 2.2.0

This is an automated PR by the TF deployment pipeline r2-2-aws-tensorflow

+250 -4

0 comment

4 changed files

rahul003

pr closed time in 2 months

pull request commentpytorch/pytorch

[autograd] enable graph level thread parallelism on CPU

Hi @mruberry and @albanD thanks for your replies.

@mruberry I think the analogous behavior would be to have different backward calls happen in parallel (with multiple streams maybe) instead of using multiple streams inside a backward call. Did you mean that the former is possible today? I tried to have the two backward calls in two threads but the second one seemed to wait for the first to finish. Is there a way around it?

The motivation here is that in an op I added I have an IO call during backward, and I am trying to parallelize backward passes while accumulating the gradients, so as to not be serialized during the backward pass. Is this possible today with GPU? I see that this PR allows this for CPU workloads if the calls are made from different threads.

wanchaol

comment created time in 2 months

PR opened aws/deep-learning-containers

Updating TF binaries for version 2.2.0

This is an automated PR by the TF deployment pipeline r2-2-aws-tensorflow

+250 -4

0 comment

4 changed files

pr created time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha a0241cb9385e9dedadf1b022fb9ba3c3cbd64c65

Update binaries from new build

view details

push time in 2 months

PR closed aws/deep-learning-containers

Updating TF binaries for version 2.2.0

This is an automated PR by the TF deployment pipeline tf-pr-195-tensorflow

+250 -4

0 comment

4 changed files

rahul003

pr closed time in 2 months

PR opened aws/deep-learning-containers

Updating TF binaries for version 2.2.0

This is an automated PR by the TF deployment pipeline tf-pr-195-tensorflow

+250 -4

0 comment

4 changed files

pr created time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha fe32e77e48f47b042bd68afa45ecb95cb18dc2a5

Update binaries from new build

view details

push time in 2 months

pull request commentaws/deep-learning-containers

Resolve missing libcuda.so issue

Yes we should ensure 10.2 also has this

Elizaaaaa

comment created time in 2 months

PR closed aws/deep-learning-containers

[tensorflow][build] Update TF 2.2 binaries with some performance optimization

Issue #, if available:

Checklist

  • [x] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [build] | [test] | [build, test] | [ec2, ecs, eks, sagemaker]
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [x] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • [x] (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Description: Binaries are now optimized for the CPU architectures we have on AWS. FIxes the warning that says binaries not built with AVX2, etc

Tests run: Tests from TF pipeline

DLC image/dockerfile: TF2.2 py37 cpu and gpu

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+250 -4

0 comment

4 changed files

rahul003

pr closed time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 5a321e3696f1069c19cd63b5d20c3fe8808ca086

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 74fdaae85d7943a1cedb2a9547eb639677ad9964

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha c5adec16816f308498350bad22c36e52613ba987

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha e4c34cef7d6fc326ed89c1b544cfc78c240860a2

Update binaries from new build

view details

push time in 2 months

pull request commentpytorch/pytorch

[autograd] enable graph level thread parallelism on CPU

Is there any plan to extend the same to execution on GPU?

wanchaol

comment created time in 2 months

pull request commentaws/deep-learning-containers

Resolve missing libcuda.so issue

Don't we need to do this for previous releases too?

Elizaaaaa

comment created time in 2 months

MemberEvent

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha b664e4a1fb91cf4d8e5ce4ef55413e1e2915558b

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 5662977df72beb3d1363b8386ea9f5e35ac3da8c

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 137cb4d9b367785b013448e78d715e5e6f957726

Update binaries from new build

view details

push time in 2 months

Pull request review commentawslabs/sagemaker-debugger

TF 2.x: Support for keras to estimator

 def __init__(self, collections=None, create_default=True):                 self.create_collection(n)             if is_tf_version_2x() and tf.executing_eagerly():                 self.get(CollectionKeys.BIASES).include("^(?!gradient).*bias")-                self.get(CollectionKeys.WEIGHTS).include("^weights/.*/((?!bias).)*$")-                self.get(CollectionKeys.LOSSES).include(".*loss.*")-                self.get(CollectionKeys.GRADIENTS).include("^gradient")             else:                 self.get(CollectionKeys.BIASES).include("bias")

Why are the above being removed btw?

vandanavk

comment created time in 2 months

Pull request review commentawslabs/sagemaker-debugger

TF 2.x: Support for keras to estimator

 def __init__(self, collections=None, create_default=True):                 self.create_collection(n)             if is_tf_version_2x() and tf.executing_eagerly():                 self.get(CollectionKeys.BIASES).include("^(?!gradient).*bias")-                self.get(CollectionKeys.WEIGHTS).include("^weights/.*/((?!bias).)*$")-                self.get(CollectionKeys.LOSSES).include(".*loss.*")-                self.get(CollectionKeys.GRADIENTS).include("^gradient")             else:                 self.get(CollectionKeys.BIASES).include("bias")

Because SessionHook doesn't identify these tensors based on regex patterns. They come from global collections maintained by TF. (Not gradients where we manually add them)

vandanavk

comment created time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 9db609f8168fa29a42c600e44be64ec2c77fc89e

Update binaries from new build

view details

push time in 2 months

pull request commentawslabs/sagemaker-debugger

Avoiding Basehook object pickling

Nice, didn't know about this function!

Vikas-kum

comment created time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 5d42a0e96b4b840149a80cc806cc6b742bc2650e

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 3481c0f9014c02a2ca8def6b6b255e0c15897fc9

Update binaries from new build

view details

push time in 2 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha b3eda0825a79a94f74325bb2cbd3e5280b716233

Update binaries from new build

view details

push time in 3 months

push eventrahul003/deep-learning-containers

Rahul Huilgols bot

commit sha 0a644af839f0ae4dbd1498a3205c4b8d0a3b2b92

Update binaries from new build

view details

push time in 3 months

push eventrahul003/deep-learning-containers

satish Kumar Gollaprolu

commit sha 94f8ea08f4ab66b057e6818b2e86dafdeb7c06c7

use symlinking instead of env variables for ssl_certs_dir (#207) * use symlinking instead of env variables for ssl_certs_dir * skip the test_smbdebug for sagemaker remote tests Co-authored-by: Satish Gollaprolu <sgollapr@amazon.com>

view details

Rahul Huilgol

commit sha ff6d5527085bd7f5a45b963ef86c7a078294d989

Merge branch 'master' into tf2train2

view details

push time in 3 months

push eventrahul003/tensorflow

Rahul Huilgol

commit sha ae6061da8311d40faf9ee91a4466a15811c7efbe

Whitespace changes

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Improving the performance of reads from S3 by parallelizing downloads

 def repo():     third_party_http_archive(         name = "aws",         urls = [-            "https://mirror.bazel.build/github.com/aws/aws-sdk-cpp/archive/1.7.266.tar.gz",-            "https://github.com/aws/aws-sdk-cpp/archive/1.7.266.tar.gz",-        ],-        sha256 = "39fd8a2999260d2b8fcbc8187f1ed5299972c2b8bd14adb7850fd674fea67fb7",-        strip_prefix = "aws-sdk-cpp-1.7.266",+             "https://mirror.bazel.build/github.com/aws/aws-sdk-cpp/archive/1.7.336.tar.gz",+             "https://github.com/aws/aws-sdk-cpp/archive/1.7.336.tar.gz",+         ],+        sha256 = "758174f9788fed6cc1e266bcecb20bf738bd5ef1c3d646131c9ed15c2d6c5720",+        strip_prefix = "aws-sdk-cpp-1.7.336",

Actually the rest of the PR depends on this new version as I added a change in the SDK to enable download of a given range of file instead of the whole file only.

rahul003

comment created time in 3 months

push eventrahul003/tensorflow

Rahul Huilgol

commit sha 2662f079df1fcbc6995c26443d5c362c20d905be

Use new transfer manager to improve read performance Finish multi part download implementatione xcept error near end of file Fix bug in get Fix build error Add test Fix build error Fix test Fix test modify test Reenable test Add override for tfrecord dataset buffer, Recognize the error when end of file is reached as a special case of error Fix build error Cleanup Allow testing to compare old and new function behaviors Remove logs in the test Fix build error Update test to improve time log Remove new lines Fix uploads due to them being too small. Made chunk size 5MB Use separate transfer managers for upload and download, with different chunk sizes

view details

Rahul Huilgol

commit sha f5ece2f397172edc1d9dc6f529c459682cafcafc

Use new release of aws sdk

view details

push time in 3 months

PR opened tensorflow/tensorflow

Improving the performance of reads from S3 using Transfer Manager's Multi Part Download from AWS SDK
  • Updates the S3 File system in TensorFlow to read a file from S3 in parallel using the AWS SDK's transfer manager.
  • To support this PR, I made a PR to AWS SDK https://github.com/aws/aws-sdk-cpp/pull/1349, and have updated the version of SDK to match the release after that PR.
  • This PR also adds an override of the buffer size for TFRecordDataset similar to what's done for TPU GCS file system. This helps reduce overhead of network calls and thus improves performance, by reading larger parts of a file at once.

Benchmarking The following code snippet when run with TFRecords prepared from the ImageNet dataset with images resized to 480px (prepared using scripts here) gives ~2600 images/sec when data is on S3. Same script when run with vanilla TF gives ~200 images/sec, i.e. 13x speedup.

import tensorflow as tf
import pathlib
from timeit import default_timer as timer
import time
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--buffer-size', type=int, default=None)
parser.add_argument('--path-type', type=str, default='s3')
parser.add_argument('--path', type=str, default='s3://aws-tensorflow-benchmarking/imagenet-armand/train-480px/')
parser.add_argument('--num-batches', type=int, default=1000)
parser.add_argument('--batch-size', type=int, default=128)
parser.add_argument('--cache-processing', type=str, default='true')
args = parser.parse_args()

if args.path_type == 'local':
    data_dir = pathlib.Path('/home/ubuntu/imagenet-480px/')
    train_tdf = tf.data.TFRecordDataset([i.as_posix() for i in data_dir.glob('train*')], buffer_size=args.buffer_size)
elif args.path_type == 's3':
    files = tf.data.Dataset.list_files(os.path.join(args.path, 'train*'))
    train_tdf = tf.data.TFRecordDataset(files, buffer_size=args.buffer_size)
else:
    raise NotImplementedError()

features = {'image/encoded' : tf.io.FixedLenFeature((), tf.string, ""),
            'image/class/label': tf.io.FixedLenFeature([1], tf.int64,  -1),
                                         }
CACHE = {'image': None, 'label': None}

def parse(record):
    record = tf.io.parse_single_example(record, features)
    if args.cache_processing != 'true' or CACHE['image'] is None:
        image = record['image/encoded']
        image = tf.image.decode_jpeg(image)
        image = tf.image.resize(image, (224, 224))/255.
        CACHE['image'] = image
    else:
        image = CACHE['image']
    if args.cache_processing != 'true' or CACHE['label'] is None:
        label = record['image/class/label']
        label -= 1
        CACHE['label'] = label
    else:
        label = CACHE['label']
    return image, label


def benchmark(iterator):
    start_time = time.perf_counter()
    i = 0
    for sample in iterator:
        # Performing a training step
        i += 1
        if i% 10 == 0:
            print(f"Batch {i}: Time from start {time.perf_counter() - start_time}, Speed (imgs/sec): {i * args.batch_size / (time.perf_counter() - start_time)}")
        if i == args.num_batches:
            break
    cur_time = time.perf_counter()
    print(f"Total execution time: {cur_time - start_time}, Speed (imgs/sec):{i *args.batch_size / (cur_time - start_time)}")


iterator = train_tdf.map(parse, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(args.batch_size) \
                 .prefetch(tf.data.experimental.AUTOTUNE)

benchmark(iterator)
+302 -47

0 comment

5 changed files

pr created time in 3 months

push eventrahul003/tensorflow

Qiao Zhang

commit sha 75681ff151ed0b285cfd3bda28550668569f9166

Re-arrange TFRT SavedModel directory. PiperOrigin-RevId: 310238268 Change-Id: I10a638d1e1726c2bad6065029d254c8bd48501b8

view details

Rick Chao

commit sha 5f56e8201d083e9a8e580bfb5debddba04cbd8de

Add test to cover the code path run by the public multi-worker with Keras tutorial. PiperOrigin-RevId: 310239734 Change-Id: I17ce8e5217b776cb182ff2955a1ebefd37455966

view details

Alexandre Passos

commit sha faaffa91e516b2aba0d98a4b89f77fbbea30c97b

Don't put iterator deleters as host memory. PiperOrigin-RevId: 310240511 Change-Id: I8a24b5af7cc39c5ef1c2072ee43ba8450060b9d7

view details

A. Unique TensorFlower

commit sha 79f0af0c905c0d7d2ad6cc1269ec2993d2c92687

Update index_lookup to return a UTF-8 string in vocabulary lookup instead of a bytestring. PiperOrigin-RevId: 310243815 Change-Id: If61dc5158b3960acc28d555bc6e0fb87b2439318

view details

TensorFlower Gardener

commit sha ea89cdefcd6865b16b33d1a8c6a5e2ab3ca92dd4

Merge pull request #39181 from yair-ehrenwald:master PiperOrigin-RevId: 310245725 Change-Id: I25e17974b06f239de17638097917eda678ec161a

view details

Ruoxin Sang

commit sha d5cefa4813c1d10fba95b7981fbb610abe2eb4ab

Don't put iterator deleters as host memory. PiperOrigin-RevId: 310246960 Change-Id: I92a03be74e3e82346672d51d9caf38dd9d9b1b3e

view details

Xinyi Wang

commit sha f95ae3fef0e6c0f06fd73c2fd34aab1002dbbcb2

In AggregatingVariable.assign, when aggregating values across replicas in a merge_call, also check if the "name" kwarg is a PerReplica object and if so, change to a single value. PiperOrigin-RevId: 310248002 Change-Id: If1cebef2ce9522e2f27e34f1f9212053ed5ff7d1

view details

A. Unique TensorFlower

commit sha 63c7b63acf857d09941369ad3f00ef0c79d1b9c9

Fixes: GitHub Pull Request #37017 PiperOrigin-RevId: 310248064 Change-Id: Ia79b707acc4ae9e6db4be7468585264dd9dfb5f5

view details

Skye Wanderman-Milne

commit sha 237864e2bbff608c9267a66b6f1437cb43fdaafd

[XLA:Python] Remove xla_client.Buffer class. The only remaining method was Buffer.from_pyval. Callers should use LocalClient.buffer_from_pyval instead. PiperOrigin-RevId: 310248281 Change-Id: I3cca4e5ea85b7632ac5ef2f40fec488e50fe0fc8

view details

A. Unique TensorFlower

commit sha 87a9c6491da162381b755c8c4da40a68c1f41cca

Disable broken Windows test PiperOrigin-RevId: 310248521 Change-Id: I00d434491399ca5983e7bd34725dc305f159765a

view details

Smit Hinsu

commit sha 1ed118724051a0b652d39c481471ddd4e5937ed1

Fix ConvertHalfTensor bug with elements attribute of size one PiperOrigin-RevId: 310250092 Change-Id: I3d8a22ef859e2ee2c187fd4894efbb74f7314e60

view details

Smit Hinsu

commit sha 9debaffa6994e77a50b53c7f2c74543071254d74

Auto-generate some of the TensorFlow ops Added DataFormatDimMapOp, EluGradOp, LeakyReluGradOp, Relu6GradOp, SeluOp, SeluGradOp and SqrtGradOp ops PiperOrigin-RevId: 310250948 Change-Id: Iac770687704faf9c856ef214cac036a5eb857348

view details

A. Unique TensorFlower

commit sha 6a116a3a1dcd79f61be28c6969d5b7ddee763bc6

Update ops-related pbtxt files. PiperOrigin-RevId: 310252218 Change-Id: I02ca09adbc63759251c9b7fd98527f6422db0cfb

view details

Robert Suderman

commit sha 477f12233774b623fe27d4c9bdeef81081362f27

Fixed int64 vs int32 issue in empty op folder. PiperOrigin-RevId: 310252768 Change-Id: Ibc9d75f0bb19d1a39d6d3abbc9641e9c3f0b340b

view details

Yujing Zhang

commit sha b5b150f79c1e009baa6b7e0a67762d00af4cf297

Fix an issue of out of order execution. Don't serialize a remote input handle for function execution until it's ready on a remote device. Otherwise, on a remote worker, a remote function execution request could be enqueued before a request for producing a function input. PiperOrigin-RevId: 310253012 Change-Id: I20e649494ec27f4bd581798d2ed458453f75d30f

view details

A. Unique TensorFlower

commit sha fe972004ab02ff454749bea5780e70d4a4633c3a

Fixes: GitHub Issue #39222 PiperOrigin-RevId: 310254359 Change-Id: Ibc5879288859552ba58d4fb8591de3825d694dba

view details

A. Unique TensorFlower

commit sha cff8cf4fa106d75b5c6ddd22ceb7980df0a6b9b8

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 310257373 Change-Id: I5638601b8e0017a958ab4c9b1c7c469f4b30943a

view details

Lucy Fox

commit sha 469de83a9c485563bfda0006f0dbc0673ee75f14

Emit error messages for all missing legalizations in TF to XLA full legalization pass. A full legalization conversion stops after the first failed conversion encountered. For building the TF to XLA bridge, it is useful for this pass to continue through and emit information about all of the missing ops. Instead, use the Partial conversion mode to get the full set of operations that are not legalizable. The "full" conversion succeeds if this set is empty. This does not change the behavior when the full legalization pass succeeds. However, if the conversion fails, the outputted error message is now much more useful. For the sake of demonstrating what this might look like with a large model, I've run this on Transformer with the Unary op lowerings removed. Resulting error message output: Before this change: ``` Compilation failure: MLIR TF to XLA legalization failed-:64:11: error: failed to legalize operation 'tf.Rsqrt' -:64:11: note: see current operation: %37 = "tf.Rsqrt"(%33) : (tensor<f32>) -> tensor<f32> ``` After this change (default case): ``` Compilation failure: MLIR TF to XLA legalization failed-:4:3: error: The following operations cannot be legalized: tf.Rsqrt (count: 217); tf.SoftmaxCrossEntropyWithLogits (count: 1); tf.Sqrt (count: 370). These legalization failure(s) may be due to missing TF to HLO lowerings and/or unsupported attributes, etc. -:4:3: error: Emitting more detail about one op that failed to legalize... -:251:12: error: 'tf.Rsqrt' op is not legalizable -:251:12: note: see current operation: %224 = "tf.Rsqrt"(%220) : (tensor<f32>) -> tensor<f32> ``` After this change (verbose case, with logging set to 1): ``` Compilation failure: MLIR TF to XLA legalization failed-:4:3: error: The following operations cannot be legalized: tf.Rsqrt (count: 217); tf.SoftmaxCrossEntropyWithLogits (count: 1); tf.Sqrt (count: 370). These legalization failure(s) may be due to missing TF to HLO lowerings and/or unsupported attributes, etc. -:4:3: error: Emitting more detail about one of each type of op that failed to legalize... -:1769:13: error: 'tf.Rsqrt' op is not legalizable -:1769:13: note: see current operation: %1742 = "tf.Rsqrt"(%1738) : (tensor<f32>) -> tensor<f32> -:3308:24: error: 'tf.SoftmaxCrossEntropyWithLogits' op is not legalizable -:3308:24: note: see current operation: %loss, %backprop = "tf.SoftmaxCrossEntropyWithLogits"(%3495, %3503) : (tensor<768x33708xf32>, tensor<768x33708xf32>) -> (tensor<768xf32>, tensor<768x33708xf32>) -:6944:13: error: 'tf.Sqrt' op is not legalizable -:6944:13: note: see current operation: %7319 = "tf.Sqrt"(%7318) : (tensor<f32>) -> tensor<f32> ``` PiperOrigin-RevId: 310258485 Change-Id: Id6f8709c2548e7ded9fb6fe690c9d17e6c6d394f

view details

Rachel Lim

commit sha a967cad22b06fd24a400c7b3c27d4a573ee9f68f

[tf.data] Add map and batch fusion rewrite in MLIR PiperOrigin-RevId: 310260964 Change-Id: I6505bcd35f21a3f9ff520f1900038c3c4be15536

view details

Jared Duke

commit sha 154044d0f23665fc8904be08233f48057deb5cf9

Internal test infra change PiperOrigin-RevId: 310261779 Change-Id: Iac550ffeb52a444c6ae58dbc85bb67bf80f50dd8

view details

push time in 3 months

PR opened aws/deep-learning-containers

[tensorflow][build] Update TF 2.2 binaries with some performance optimization

Issue #, if available:

Checklist

  • [x] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [build] | [test] | [build, test] | [ec2, ecs, eks, sagemaker]
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [x] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • [x] (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Description: Binaries are now optimized for the CPU architectures we have on AWS. FIxes the warning that says binaries not built with AVX2, etc

Tests run: Tests from TF pipeline

DLC image/dockerfile: TF2.2 py37 cpu and gpu

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+4 -4

0 comment

2 changed files

pr created time in 3 months

PR closed aws/deep-learning-containers

Updating TF binaries for version 2.2.0

This is an automated PR by the TF deployment pipeline r2-2-aws-tensorflow

+4 -4

0 comment

2 changed files

rahul003

pr closed time in 3 months

create barnchrahul003/deep-learning-containers

branch : tf2train2

created branch time in 3 months

fork rahul003/deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.

https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

fork in 3 months

fork rahul003/deep-learning-containers-1

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.

https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

fork in 3 months

PR closed aws/deep-learning-containers

Updating TF binaries for version 2.1.0

This is an automated PR by the TF deployment pipeline aws-tensorflow-beta-pipeline-r2-1-aws

+9 -9

0 comment

7 changed files

rahul003

pr closed time in 3 months

more