profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/andfoy/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Edgar Andrés Margffoy Tuay andfoy @quansight Bogotá, Colombia http://margffoy-tuay.com Systems and Computing Engineering MSc (Uniandes - 2020). Spyder and torchvision developer. Former Computer Vision and Natural Language Understanding researcher

andfoy/extorch 4

Elixir/Erlang bindings for libtorch (PyTorch)

andfoy/3d-cv-exercises 1

Repository for 3D Computer Vision exercises at TU Kaiserslautern

andfoy/deep-learning-models 1

Collection of Supervised, Semisupervised and Unsupervised models suitable for deep learning applications. (MATLAB/Python)

andfoy/distributed-gpu-monitor 1

ZMQ Distributed GPU monitor for multiple machines based on nvtop

andfoy/action-tmate 0

Debug your GitHub Actions via SSH by using tmate to get access to the runner system itself.

andfoy/adventofcode2020 0

Solutions for Advent of Code 2020

andfoy/andfoy.github.io 0

Test Repository

andfoy/AntennaPod 0

A podcast manager for Android

andfoy/bsc-dissertation 0

Dynamic Multimodal Object Segmentation based on natural language referring expressions and its applications

andfoy/builder 0

Continuous builder and binary build scripts for pytorch

pull request commentpytorch/pytorch

Fix typo in ChainDataset docs

:pill: CI failures summary and remediations

As of commit 982d83e262 (more details on the Dr. CI page and at hud.pytorch.org/pr/60336):


Commit 982d83e262 was recently pushed. Waiting for builds...


<details><summary>This comment was automatically generated by <a href="https://code.intern.facebook.com/ci/dr-ci-info/">Dr. CI</a> (expand for details).</summary>Follow <a href="https://code.intern.facebook.com/ci/settings/">this link to opt-out</a> of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) <a href="https://fburl.com/ujo0mikv">Dr. CI Users group</a>. </details>Click<a href="https://our.intern.facebook.com/intern/opensource/ci/regenerate_comment/502992687677433/"> here </a> to manually regenerate this comment.

simonseo

comment created time in 12 minutes

pull request commentpytorch/pytorch

Fix typo in ChainDataset docs

Hi @simonseo!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

simonseo

comment created time in 12 minutes

PR opened pytorch/pytorch

Fix typo in ChainDataset docs
  • chainning -> chaining
+2 -2

0 comment

1 changed file

pr created time in 12 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

 def test_average_parameters(self):                 for p in model.parameters():                     self.assertEqual(p.data, torch.ones_like(p.data) * 0.5) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        @skip_if_odd_num_of_gpus()+        def test_periodic_model_averager(self):+            rank = dist.get_rank()+            rank_to_GPU = self._init_multigpu_helper()+            device_id = rank_to_GPU[rank][0]+            world_size = dist.get_world_size()++            model = nn.Linear(1, 5, bias=False).cuda(device_id)+            param = next(model.parameters())

This is because there is only a single linear layer, so there is only a single parameter tensor.

SciPioneer

comment created time in 19 minutes

pull request commentpytorch/pytorch

Copy Tensor for tests to avoid in-place transform modifying the original tensor

I agree this change fixes the test, so that's cool, but it doesn't resolve the NaN propagation issue. Should the original issue be left open and modified to explain what you discovered while debugging, @Flamefire?

Flamefire

comment created time in 19 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

 def test_average_parameters(self):                 for p in model.parameters():                     self.assertEqual(p.data, torch.ones_like(p.data) * 0.5) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        @skip_if_odd_num_of_gpus()

Yeah, actually I forgot to label this PR [WIP]. I plan to remove skip_if_odd_num_of_gpus decorator and simplify the test code.

SciPioneer

comment created time in 20 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

+import torch.distributed as dist+import torch.distributed.algorithms.model_averaging.utils as utils+++class PeriodicModelAverager:+    r"""+    Averages parameters periodically or during the warm-up stage.++    This can be used for running `post-local SDG <https://arxiv.org/abs/1808.07217>`_,+    by running :class:`~torch.nn.DistributedDataParallel` (DDP)+    using the subgroups created by :meth:`~torch.distributed.new_subgroups`.++    Args:+        module (torch.nn.Module): The module where its parameters will be averaged.+        period (int): The number of steps per model averaging.+                      Usually the period should be greater than ``1`` to reduce the communication cost.+                      Otherwise, only DDP needs to be used.+        warmup_steps (int): The number of warm-up steps. During this stage,+                            ``period`` is viewed as 1, and the parameters are averaged at every step.+        process_group: The process group to be used for all-reduce.+                       If ``None``, the default process group, which+                       is created by :func:`torch.distributed.init_process_group`,+                       will be used. (default: ``None``)++    Example::++        >>>  import torch+        >>>  import torch.distributed as dist+        >>>  import torch.distributed .algorithms.model_averaging.averagers as averagers+        >>>  import torch.nn as nn+        >>>+        >>>  dist.init_process_group("nccl", rank=rank, world_size=16)+        >>>  torch.cuda.set_device(rank)+        >>>  module = nn.Linear(1, 1, bias=False).to(rank)+        >>>  subgroup, subgroups = dist.new_subgroups()+        >>>  # Gradients are averaged by each intra-node subgroup during the backward pass.+        >>>  model = nn.parallel.DistributedDataParallel(+        >>>     module, device_ids=[rank], output_device=rank, process_group=subgroup+        >>>  )+        >>>+        >>>  # In the first 100 steps, run model averaging every step.+        >>>  # After 100 steps, run model averaging every 4 steps.+        >>>  averager = averagers.PeriodicModelAverager(model, warmup_steps=100, period=4)+        >>>  for step in range(0, 20):+        >>>     optimizer.zero_grad()+        >>>     loss = loss_fn(output, labels)+        >>>     loss.backward()+        >>>     optimizer.step()+        >>>     # Average parameters globally after ``optimizer.step()``.+        >>>     # Thus, the inter-node communication only occurs periodically after ``warmup_steps``.+        >>>     averager.average_parameters(step)++    .. warning ::+        `PeriodicModelAverager` is experimental and subject to change.+    """++    def __init__(+        self,+        module,+        period,+        warmup_steps=0,+        process_group=None,+    ):+        self.module = module+        if warmup_steps < 0:+            raise ValueError("Arg ``warmup_steps`` must be a non-negative number.")+        self.warmup_steps = warmup_steps+        if period < 1:+            raise ValueError("Arg ``period`` must be a positive value.")+        elif period == 1:+            warnings.warn(+                "When period is 1, no need to use model averaging because the communication cost "+                "of all-reducing parameters will be no less than the cost of all-reducing gradients "+                "by DistributedDataParall in the backward pass. Therefore, only "+                "DistributedDataParallel should be used for this case."+            )+        self.period = period+        self.process_group = (+            process_group if process_group is not None else dist.group.WORLD+        )++    def average_parameters(self, step: int):

Are you suggesting add an internal counter which has exactly the same increment as the step in the training loop?

I think this is a good suggestion. I have seen some training loops that were implemented across multiple files, so step variable may not be very obviously traceable -- to pass step to the averager, it may cause some unnecessary user code change.

Let me update the API in a separate PR.

SciPioneer

comment created time in 21 minutes

pull request commentpytorch/pytorch

Types for torch.jit.script and friends

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. <br> Feel free to remove the Stale label if you feel this was a mistake. <br> Stale pull requests will automatically be closed 30 days after being marked Stale <br>

r-barnes

comment created time in 22 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

 def test_average_parameters(self):                 for p in model.parameters():                     self.assertEqual(p.data, torch.ones_like(p.data) * 0.5) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        @skip_if_odd_num_of_gpus()+        def test_periodic_model_averager(self):+            rank = dist.get_rank()+            rank_to_GPU = self._init_multigpu_helper()+            device_id = rank_to_GPU[rank][0]+            world_size = dist.get_world_size()++            model = nn.Linear(1, 5, bias=False).cuda(device_id)+            param = next(model.parameters())+            tensor = torch.ones_like(param.data) * ((rank + 1) // 2)+            averager = averagers.PeriodicModelAverager(model, warmup_steps=10, period=4)+            for step in range(0, 20):+                # Reset the parameters at every step.+                param.data = copy.deepcopy(tensor)+                averager.average_parameters(step)

It's hard, unless we move optimizer inside DDP. This is because model averaging must occur after optimizer.step(), which is actually out of DDP scope. It's a hard boundary at this time.

SciPioneer

comment created time in 23 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

+import torch.distributed as dist+import torch.distributed.algorithms.model_averaging.utils as utils+++class PeriodicModelAverager:+    r"""+    Averages parameters periodically or during the warm-up stage.++    This can be used for running `post-local SDG <https://arxiv.org/abs/1808.07217>`_,+    by running :class:`~torch.nn.DistributedDataParallel` (DDP)+    using the subgroups created by :meth:`~torch.distributed.new_subgroups`.++    Args:+        module (torch.nn.Module): The module where its parameters will be averaged.+        period (int): The number of steps per model averaging.+                      Usually the period should be greater than ``1`` to reduce the communication cost.+                      Otherwise, only DDP needs to be used.+        warmup_steps (int): The number of warm-up steps. During this stage,+                            ``period`` is viewed as 1, and the parameters are averaged at every step.+        process_group: The process group to be used for all-reduce.+                       If ``None``, the default process group, which+                       is created by :func:`torch.distributed.init_process_group`,+                       will be used. (default: ``None``)++    Example::++        >>>  import torch+        >>>  import torch.distributed as dist+        >>>  import torch.distributed .algorithms.model_averaging.averagers as averagers+        >>>  import torch.nn as nn+        >>>+        >>>  dist.init_process_group("nccl", rank=rank, world_size=16)+        >>>  torch.cuda.set_device(rank)+        >>>  module = nn.Linear(1, 1, bias=False).to(rank)+        >>>  subgroup, subgroups = dist.new_subgroups()+        >>>  # Gradients are averaged by each intra-node subgroup during the backward pass.+        >>>  model = nn.parallel.DistributedDataParallel(+        >>>     module, device_ids=[rank], output_device=rank, process_group=subgroup

Not really, DDP only run allreduce within each node via the subgroup, so there is no global allreduce. In contrast, averager runs global allreduce across nodes.

SciPioneer

comment created time in 28 minutes

issue commentpytorch/pytorch

[Mkldnn] Support Prelu operator

Hi, thanks for the info. May I also know your batch size?

xsacha

comment created time in 35 minutes

push eventpytorch/pytorch

wayi

commit sha 58377fa5fa16a809807ec82e0436a16ad0c63e8d

[Model Averaging] Provide a util function for model averaging Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303 The util function can be used for averaging parameters. More optimizations can be done in the future. ghstack-source-id: 131907564 Differential Revision: [D29242806](https://our.internmc.facebook.com/intern/diff/D29242806/)

view details

push time in 38 minutes

push eventpytorch/pytorch

wayi

commit sha 74d652f896787631046562f7fb9fc61d7d5d5e4f

Update on "[Model Averaging] Provide a util function for model averaging" The util function can be used for averaging parameters. More optimizations can be done in the future. Differential Revision: [D29242806](https://our.internmc.facebook.com/intern/diff/D29242806/) [ghstack-poisoned]

view details

push time in 38 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Provide a util function for model averaging

+# flake8: noqa C101+import torch+import torch.distributed as dist+++def average_parameters(module: torch.nn.Module, process_group: dist.ProcessGroup):+    """+    Averages all the parameters of a given module.+    For allreduce efficiency, all the parameters are flattened into a contiguous buffer.+    Thus, it requires extra memory of the same size as the module's parameters.+    """+    group_to_use = process_group if process_group is not None else dist.group.WORLD++    flat_params = torch.cat([p.data.view(-1) for p in module.parameters()])+    flat_params /= dist.get_world_size(group_to_use)

Good question. The input of any rank out of the subgroup will be divided by -1, so the result of the input rank will be -rank.

SciPioneer

comment created time in 40 minutes

Pull request review commentpytorch/pytorch

[Model Averaging] Provide a util function for model averaging

 def test_new_subgroups_overlap_not_allowed(self):                     ranks_per_subgroup_list=[[0], [1, 2], [1, 3]]                 ) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        def test_average_parameters(self):+            rank = dist.get_rank()+            rank_to_GPU = self._init_multigpu_helper()+            device_id = rank_to_GPU[rank][0]++            model = (+                nn.Sequential(+                    nn.Conv2d(3, 3, kernel_size=3, padding=1),+                    nn.ReLU(),+                    nn.Linear(1, 5, bias=False)+                ).cuda(device_id)+            )++            # Test global model averaging+            for p in model.parameters():+                p.data = torch.ones_like(p.data)+            model_averaging_utils.average_parameters(module=model, process_group=None)+            # Every element will be the same as the input.+            for p in model.parameters():+                self.assertEqual(p.data, torch.ones_like(p.data))++            # Test partial model averaging+            for p in model.parameters():+                p.data = torch.ones_like(p.data) * rank+            group_nccl = dist.new_group(ranks=[0, 1], backend="nccl")+            model_averaging_utils.average_parameters(module=model, process_group=group_nccl)+            if not dist._rank_not_in_group(group_nccl):

Added

SciPioneer

comment created time in 42 minutes

issue commentpytorch/pytorch

[CI stats] sharded test skew in test1/2

also question, do we really need to run all 6 combination during CI for PR? cc @mruberry

We should ask someone from the distributed team. I'd actually like to consider moving distributed and RPC tests to their own jobs. They're very flaky.

walterddr

comment created time in 44 minutes

pull request commentpytorch/pytorch

[WIP][ONNX]Enhance shape inference

:pill: CI failures summary and remediations

As of commit 606bdba82a (more details on the Dr. CI page and at hud.pytorch.org/pr/60335):


Commit 606bdba82a was recently pushed. Waiting for builds...


<details><summary>This comment was automatically generated by <a href="https://code.intern.facebook.com/ci/dr-ci-info/">Dr. CI</a> (expand for details).</summary>Follow <a href="https://code.intern.facebook.com/ci/settings/">this link to opt-out</a> of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) <a href="https://fburl.com/ujo0mikv">Dr. CI Users group</a>. </details>Click<a href="https://our.intern.facebook.com/intern/opensource/ci/regenerate_comment/482110232859626/"> here </a> to manually regenerate this comment.

jiafatom

comment created time in an hour

PR opened pytorch/pytorch

[WIP][ONNX]Enhance shape inference

Fixes #{issue number}

+426 -30

0 comment

13 changed files

pr created time in an hour

issue commentpytorch/pytorch

test_dataloader.py fails to pass test with error: Can't get attribute 'RandomDataset'... on MacOS

Created a conda env with Python=3.7, followed the same steps to install Pytorch and tried to reproduce the issue.

test_dataloader.py passes the test.

Seems that it is an issue related to Python 3.8 on Mac

DamonDeng

comment created time in an hour

issue openedpytorch/pytorch

USE_SYSTEM_SLEEF: undefined reference to symbol 'Sleef_expd4_u10'

🐛 Bug

I'm trying to build PyTorch 1.7.1 with USE_SYSTEM_SLEEF and I'm seeing the following error:

[4267/4704] Linking CXX executable bin/vec256_test_all_types_AVX2
FAILED: bin/vec256_test_all_types_AVX2 
: && /home/t-astewart/spack/lib/spack/env/gcc/g++ -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,-rpath -Wl,/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/hwloc-2.5.0-e7ygv5b2kvnvzl3bxbft2b72xzaqkciu/lib -Wl,-rpath -Wl,/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/libevent-2.1.12-p2dxdtdk7dzoe3hifzobmiesuhssfjto/lib -Wl,-rpath -Wl,/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/zlib-1.2.11-fegtmskef3daxdtyp3x3m6zz2fjfpxpq/lib -Wl,-rpath -Wl,/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/openmpi-4.0.5-7qbqtxj5elk634wyigfpl4uuecfknflk/lib -L/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/hwloc-2.5.0-e7ygv5b2kvnvzl3bxbft2b72xzaqkciu/lib -L/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/libevent-2.1.12-p2dxdtdk7dzoe3hifzobmiesuhssfjto/lib -L/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/zlib-1.2.11-fegtmskef3daxdtyp3x3m6zz2fjfpxpq/lib -pthread caffe2/CMakeFiles/vec256_test_all_types_AVX2.dir/__/aten/src/ATen/test/vec256_test_all_types.cpp.o -o bin/vec256_test_all_types_AVX2 -L/opt/intel/mkl/lib/intel64 -Wl,-rpath,/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/protobuf-3.12.2-kk7635xzwbwci5fs25kan4gnlu6rqepl/lib:/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/zlib-1.2.11-fegtmskef3daxdtyp3x3m6zz2fjfpxpq/lib:/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/openblas-0.3.15-oqbdvprnws5d63wrdu54x2gv6l3rbbpc/lib:/opt/intel/mkl/lib/intel64:/tmp/t-astewart/spack-stage/spack-stage-py-torch-1.7.1-yufiftqx3amdddpg23zwtys6izz2ukjq/spack-src/build/lib:/usr/local/cuda/lib64:/home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/cudnn-8.0.5.39-10.1-gzf5kd5mcxqyrojkrpj3gdijiu2kvybi/lib64  lib/libgtest_main.a  -Wl,--no-as-needed,"/tmp/t-astewart/spack-stage/spack-stage-py-torch-1.7.1-yufiftqx3amdddpg23zwtys6izz2ukjq/spack-src/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/tmp/t-astewart/spack-stage/spack-stage-py-torch-1.7.1-yufiftqx3amdddpg23zwtys6izz2ukjq/spack-src/build/lib/libtorch_cpu.so" -Wl,--as-needed  /home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/protobuf-3.12.2-kk7635xzwbwci5fs25kan4gnlu6rqepl/lib/libprotobuf.so.3.12.2.0  -lpthread  /home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/zlib-1.2.11-fegtmskef3daxdtyp3x3m6zz2fjfpxpq/lib/libz.so  /home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/openblas-0.3.15-oqbdvprnws5d63wrdu54x2gv6l3rbbpc/lib/libopenblas.so  -lmkl_intel_lp64  -lmkl_gnu_thread  -lmkl_core  -fopenmp  -lpthread  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  lib/libdnnl.a  -ldl  -Wl,--no-as-needed,"/tmp/t-astewart/spack-stage/spack-stage-py-torch-1.7.1-yufiftqx3amdddpg23zwtys6izz2ukjq/spack-src/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart.so  /usr/local/cuda/lib64/libnvToolsExt.so  /usr/local/cuda/lib64/libcufft.so  /usr/local/cuda/lib64/libcurand.so  /usr/local/cuda/lib64/libcublas.so  /home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/cudnn-8.0.5.39-10.1-gzf5kd5mcxqyrojkrpj3gdijiu2kvybi/lib64/libcudnn.so  lib/libgtest.a  -pthread && :
/usr/bin/ld: caffe2/CMakeFiles/vec256_test_all_types_AVX2.dir/__/aten/src/ATen/test/vec256_test_all_types.cpp.o: undefined reference to symbol 'Sleef_expd4_u10'
//home/t-astewart/spack/opt/spack/linux-ubuntu18.04-haswell/gcc-7.5.0/sleef-2019-07-30-iihgvx52ioaaxk5ndxrlb6as6c2i5h7j/lib/libsleef.so.3: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

To Reproduce

Steps to reproduce the behavior:

  1. Set USE_SYSTEM_SLEEF
  2. python setup.py install

Expected behavior

I would expect to be able to build with a system installation of sleef.

Environment

  • PyTorch Version: 1.9.0
  • OS: Linux (Ubuntu 18.04)
  • How you installed PyTorch: spack
  • Build command you used: python setup.py install
  • Python version: 3.8.10
  • CUDA/cuDNN version: 10.1.243/8.0.5.39
  • GPU models and configuration: K80 (sm_37)

Additional context

created time in an hour

issue openedspyder-ide/spyder

how to install Spyder-Notebook and Spyder-Terminal in Spyder 5 standalone installation?

I cannot find instructions for installing Spyder-Notebook and Spyder-Terminal in my Spyder 5 standalone installation.

First of all, do these 2 plugins work in Spyder 5 stand alone?

If so, can someone please provide these instructions on the Spyder 5 Home Page?

How do I point conda to the Spyder 5 standalone directory to install these plugins? What is the command line syntax to do this?

Do I have to use setup.py or some other tool to install these packages into the pkgs directory?

image

This is a suggestion for the next release of documentation

Thank you.

Rich Lysakowski

created time in an hour

issue commentpytorch/pytorch

Subtensor operations like `tril/triu` to have an option to respect the strides of the input

Functionally fixing it is as easy as replacing empty here with empty_like https://github.com/pytorch/pytorch/blob/5824a866b72c251ad47a9c16dc652e49cfd7e234/aten/src/ATen/native/TriangularOps.cpp#L88, however, beware that at least cuda kernel is written in such a way that perf will be pretty bad for discontiguous tensors.

nikitaved

comment created time in 2 hours

issue openedspyder-ide/spyder

Visual Plugin Manager needed for Spyder

A simpler visual GUI for managing plugins is needed in Spyder... a GUI plugin for managing plugins (8^)).

Right now I have to hack the command line with conda to get information about installed plugins and it is tedious.

Perhaps put another menu under the "Tools... Preferences" menu and/or as another panel inside the Preferences dialog as a "Plugins" menu item in the left-hand navigator.

This will let people see what their current configurations are for Spyder plugins. It will help for reporting on installed plugins too, because a "send to Spyder team" button could be added right there to do context-sensitive reporting.

It will also help to extend the Spyder architecture to accommodate more plugins more easily.

Please consider enhancing the plugin interface to include a better GUI for managing plugins.

Thank you.

created time in 2 hours

issue openedpytorch/pytorch

Two-element ModuleList results in error in inference_mode when jit'ed

🐛 Bug

A ModuleList with more than one module results in an error in inference_mode when jit'ed . A ModuleList with only one module works fine with inference_mode and JIT. inference_mode + ModuleList work fine when not jit'ed.

To Reproduce

import torch

class MLP(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()

        # works with one module
        # self.layers = torch.nn.ModuleList([torch.nn.Linear(1, 1)])

        # fails with two modules
        self.layers = torch.nn.ModuleList(
            [torch.nn.Linear(1, 1), torch.nn.Linear(1, 1)]
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for layer in self.layers:
            x = layer.forward(x)
        return x

model = MLP()
model = torch.jit.script(model)

# training
optimizer = torch.optim.Adam(model.parameters())
optimizer.zero_grad()
model(torch.rand(100, 1)).mean().backward()
optimizer.step()
optimizer.zero_grad()

# testing
with torch.inference_mode():
    model(torch.rand(100, 1))
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/python-tools-XWCSm5JO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd

Expected behavior

No error is thrown.

Environment

PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64) GCC version: (Ubuntu 9.3.0-10ubuntu2~16.04) 9.3.0 Clang version: 3.8.0-2ubuntu4 (tags/RELEASE_380/final) CMake version: version 3.15.1 Libc version: glibc-2.23

Python version: 3.9 (64-bit runtime) Python platform: Linux-4.13.0-36-generic-x86_64-with-glibc2.23 Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 440.64 cuDNN version: /usr0/local/cuda-9.0/lib64/libcudnn.so.7.0.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.20.3 [pip3] torch==1.9.0+cu102 [pip3] torchvision==0.10.0+cu102 [conda] Could not collect

created time in 2 hours

issue closedpytorch/pytorch

Mutable add with indexing does not act as the scatter add.

🐛 Bug

<!-- A clear and concise description of what the bug is. -->

To Reproduce

import torch
q = torch.zeros(5)
w = torch.rand(4)
e = torch.tensor([1,1,1,1])
q[e] += w
>>> import torch
>>> w = torch.rand(4)
>>> e = torch.tensor([1,1,1,1])
>>> q[e] += w
>>> w
tensor([0.2019, 0.6139, 0.7754, 0.0676])
>>> q
tensor([0.0000, 0.0676, 0.0000, 0.0000, 0.0000])

Expected behavior

<!-- A clear and concise description of what you expected to happen. -->

>>> torch.scatter_add(q,0,e,w)
tensor([0.0000, 1.7265, 0.0000, 0.0000, 0.0000])

Environment

PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Home GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.8 (64-bit runtime) Python platform: Windows-10-10.0.19041-SP0 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 970 Nvidia driver version: 456.71 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] numpydoc==1.1.0 [pip3] torch==1.7.1 [pip3] torchaudio==0.7.2 [pip3] torchvision==0.2.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.0.221 h74a9793_0 [conda] mkl 2020.2 256 [conda] mkl-service 2.3.0 py38hb782905_0 [conda] mkl_fft 1.2.0 py38h45dec08_0 [conda] mkl_random 1.1.1 py38h47e9c7a_0 [conda] numpy 1.19.2 py38hadc3359_0 [conda] numpy-base 1.19.2 py38ha3acd2a_0 [conda] numpydoc 1.1.0 pyhd3eb1b0_1 [conda] pytorch 1.7.1 py3.8_cuda110_cudnn8_0 pytorch [conda] torchaudio 0.7.2 py38 pytorch [conda] torchvision 0.2.1 py_2 soumith

closed time in 2 hours

mrakgr

push eventpytorch/pytorch

Yukio Siraichi

commit sha 7809494c68dd885392871e7dbc82c27ae0de3727

Port `all` kernel to structured kernels. (#59371) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59371 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104399 Pulled By: ezyang fbshipit-source-id: 18bb747b7a19d873427d52c1145ef7cede333a0e

view details

Yukio Siraichi

commit sha 519698362dd23808a093480986b0a4ba0b1044a8

Port `any` kernel to structured kernels. (#59372) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59372 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104395 Pulled By: ezyang fbshipit-source-id: 0cfde57c22ba88607945c98f28b18df7709becd0

view details

Yukio Siraichi

commit sha c078cefa7d90357bfb871096efd2685163181723

Using meta checks for unary `torch.all` and `torch.any`. (#59373) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59373 This PR makes use of the newly implemented unified `at::meta::check_reduction` for validating the inputs and configuring its `TensorIterator`. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104398 Pulled By: ezyang fbshipit-source-id: 6771b80130c91c2f1360853127de0acebcfff183

view details

Yukio Siraichi

commit sha 6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4

Port `argmax` to structured kernels. (#59937) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59937 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104397 Pulled By: ezyang fbshipit-source-id: 580355cf3b4e9e5c934b4e51a16196087bcb3459

view details

Yukio Siraichi

commit sha 226d745a0bf6ba174a08b92659613f4174aa393a

Port `argmin` kernel to structured kernels. (#59938) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59938 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104396 Pulled By: ezyang fbshipit-source-id: 39c59bcc044649c1ec9c9685366c4dda87f76aa7

view details

Sam Estep

commit sha 010f4b6f2d37f46e48b6422e353dbfe6bfea3a1e

Add .isort.cfg (#60119) Summary: This adds the `.isort.cfg` file from https://github.com/pytorch/pytorch/issues/55928, but doesn't try to enforce it in CI because as that PR showed, that is currently difficult to do. We could use this to gradually sort the codebase according to this configuration (enforcing bits and pieces in CI) but I don't do that here. The advantage of including this file (even if we don't enforce it) is that it affects how certain tools work, thus encouraging a specific import style for people who happen to use those tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60119 Test Plan: Open `test/run_test.py` in VS Code and run the **Python Refactor: Sort Imports** command. Compare with and without this PR. Reviewed By: 1ntEgr8 Differential Revision: D29199504 Pulled By: samestep fbshipit-source-id: 83e937b0f517c60e3e7dedb6c0306173908fbbb0

view details

Alexander Golynski

commit sha ed1da5be210c31cc07b033ac0f19f3dd6366feac

PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59899 Test Plan: Imported from OSS Reviewed By: cbalioglu, osalpekar Differential Revision: D29080299 Pulled By: agolynski fbshipit-source-id: 9ae368f91e81f19471e0a20fc913d8e9df1b9dec

view details

Bin Bao

commit sha 96b3537e71ed1c5a2aa5af183c83dc6497ce6174

[NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449 Make dtypeToCppString as a virtual method so that a child class can easily override the dtype string generation rule. This is needed as a preparation to make loop and tensor index as int64_t. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173969 Pulled By: desertfire fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776

view details

Bin Bao

commit sha 3dc8112187c5a4162581b9725695455ca959e752

[NNC] Handle int64 indices and loop bounds (#59769) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59769 Allow loop bound and tensor indice to be either int32 or int64, and avoid unnecessary cast op. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173970 Pulled By: desertfire fbshipit-source-id: 859a876ddb1b41535b2266089aa1222884295c78

view details

Brian Hirsh

commit sha 6b5e77904f8d2477cbbff4a9c59a3479f3a0b770

Revert D29104396: Port `argmin` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104396 (https://github.com/pytorch/pytorch/commit/226d745a0bf6ba174a08b92659613f4174aa393a) Original commit changeset: 39c59bcc0446 fbshipit-source-id: 82de26f925a885f65572a785fa45a9980d3a974b

view details

Brian Hirsh

commit sha 873dac4b5a11ec82904a5dfc6fba6f169280e93f

Revert D29104397: Port `argmax` to structured kernels. Test Plan: revert-hammer Differential Revision: D29104397 (https://github.com/pytorch/pytorch/commit/6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4) Original commit changeset: 580355cf3b4e fbshipit-source-id: e51fb79329066bc1a6364cfa44a8732908a684ed

view details

Brian Hirsh

commit sha 81baa7fb0d346d0f87c3f1935019193a1025ac71

Revert D29104398: Using meta checks for unary `torch.all` and `torch.any`. Test Plan: revert-hammer Differential Revision: D29104398 (https://github.com/pytorch/pytorch/commit/c078cefa7d90357bfb871096efd2685163181723) Original commit changeset: 6771b80130c9 fbshipit-source-id: 10e5a34370113fcd2f87aea2c2e76108fa9328d8

view details

Brian Hirsh

commit sha 3ff5507fb037e489487adcc6026520c3be29f3b1

Revert D29104395: Port `any` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104395 (https://github.com/pytorch/pytorch/commit/519698362dd23808a093480986b0a4ba0b1044a8) Original commit changeset: 0cfde57c22ba fbshipit-source-id: ac5ebdc4b9d3aeb4c5eeab55c92ac931599d39d1

view details

Brian Hirsh

commit sha ef09428804d9b2b580f988c723b3e4cc479d03ec

Revert D29104399: Port `all` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104399 (https://github.com/pytorch/pytorch/commit/7809494c68dd885392871e7dbc82c27ae0de3727) Original commit changeset: 18bb747b7a19 fbshipit-source-id: f57043df5646f1e675e8a555cb4fa0e436953751

view details

Richard Zou

commit sha ebafd2aadfcf04c0918197598a063e80aa7580f7

Stop warning on .names() access in max_pool2d and max_pool2d_backward (#60059) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60059 Fixes #60053. The problem is that `.names()` always triggers the named tensor warning. To not trigger it, one has to guard it with has_names: `x.has_names() ? x.names() : DimnameList{}` This is not the first time this has happened; we should probably make it so that .names() doesn't raise a warning unless it is actually populated with names. That's a little tricky to implement so I'm leaving it for the future. Test Plan: - New test, also run `python test/test_nn.py -v -k "max_pool"` and confirm there are no warnings. Reviewed By: gchanan Differential Revision: D29152737 Pulled By: zou3519 fbshipit-source-id: 89a2fdbe6a6064a7044b5b75f7d0c58e51e57509

view details

Shen Li

commit sha bbedfd913d53d677f9128caf3b8b6ea6311fe3b3

Run an dummy rpc._all_gather in init_rpc to avoid shutdown timeout (#59801) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59801 Fixes https://github.com/pytorch/pytorch/issues/59795. The RPC calls in shutdown no longer able to finish within 5s if there is no other RPCs before `rpc.shutdown()` in that process, because agent initialization can take longer than 5s. We don't have this problem previously, because TensorPipe's backend registry used to use RPC to communicate CUDA devices in `init_rpc`. However, after #58753, `init_rpc` uses ProcessGroup to communicate devices, and hence the channels/transport could be uninitialized after `init_rpc`. Differential Revision: D29039238 D29039238 Test Plan: Imported from OSS Reviewed By: rohan-varma Pulled By: mrshenli fbshipit-source-id: 46f89b01a058a51d271ddef9084a67b220a067b7

view details

Jane Xu

commit sha 462448f07ab9f2f2909e062185832e33843431fa

Enable GHA sharding on linux (#60124) Summary: This is branch off of https://github.com/pytorch/pytorch/issues/59970 to only shard on linux so far (we're running in issues with windows gflags). This would enable sharding of tests on a few Linux jobs on GHA, allowing tts to be essentially halved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60124 Reviewed By: zou3519 Differential Revision: D29204211 Pulled By: janeyx99 fbshipit-source-id: 1cc31d1eccd564d96e2aef14c0acae96a3f0fcd0

view details

Brian Hirsh

commit sha e2129d1c067326efba4eac53255b94af05a45b1b

beef up at::_ops API (#59115) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115 This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D28833086 Pulled By: bdhirsh fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b

view details

Tao Xu

commit sha 2062cafaa5ede56d63ecfc8b9edc2b69494f2247

[iOS GPU][MaskRCNN] Implement RoIAlign in Metal shaders using Sampler (#56075) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56075 Inspired by the CUDA implementation - https://fburl.com/diffusion/e90tabkj. The main difference is the way we implement bilinear interpolation. CUDA does this manually by iterating every point in each bin box. Whereas, Metal does this by calling sampler's sample function, which is a bit easier and faster. The result is almost identical to the result from CPU - P365102522. We'll do another round of refactor once we have figured out how to support custom ops on GPU. ghstack-source-id: 131720620 Test Plan: 1. Circle CI 2. Sandcastle Reviewed By: ajtulloch Differential Revision: D27485068 fbshipit-source-id: 31e831aead9d3799a3fde96e99dd677d96bd3da1

view details

Rohan Varma

commit sha acd914f03909a70631ecadde121f8a771876cd9f

Fix Pipe + DDP for unused parameters, static graph (#60118) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8

view details

push time in 2 hours

push eventpytorch/pytorch

Yukio Siraichi

commit sha 7809494c68dd885392871e7dbc82c27ae0de3727

Port `all` kernel to structured kernels. (#59371) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59371 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104399 Pulled By: ezyang fbshipit-source-id: 18bb747b7a19d873427d52c1145ef7cede333a0e

view details

Yukio Siraichi

commit sha 519698362dd23808a093480986b0a4ba0b1044a8

Port `any` kernel to structured kernels. (#59372) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59372 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104395 Pulled By: ezyang fbshipit-source-id: 0cfde57c22ba88607945c98f28b18df7709becd0

view details

Yukio Siraichi

commit sha c078cefa7d90357bfb871096efd2685163181723

Using meta checks for unary `torch.all` and `torch.any`. (#59373) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59373 This PR makes use of the newly implemented unified `at::meta::check_reduction` for validating the inputs and configuring its `TensorIterator`. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104398 Pulled By: ezyang fbshipit-source-id: 6771b80130c91c2f1360853127de0acebcfff183

view details

Yukio Siraichi

commit sha 6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4

Port `argmax` to structured kernels. (#59937) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59937 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104397 Pulled By: ezyang fbshipit-source-id: 580355cf3b4e9e5c934b4e51a16196087bcb3459

view details

Yukio Siraichi

commit sha 226d745a0bf6ba174a08b92659613f4174aa393a

Port `argmin` kernel to structured kernels. (#59938) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59938 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104396 Pulled By: ezyang fbshipit-source-id: 39c59bcc044649c1ec9c9685366c4dda87f76aa7

view details

Sam Estep

commit sha 010f4b6f2d37f46e48b6422e353dbfe6bfea3a1e

Add .isort.cfg (#60119) Summary: This adds the `.isort.cfg` file from https://github.com/pytorch/pytorch/issues/55928, but doesn't try to enforce it in CI because as that PR showed, that is currently difficult to do. We could use this to gradually sort the codebase according to this configuration (enforcing bits and pieces in CI) but I don't do that here. The advantage of including this file (even if we don't enforce it) is that it affects how certain tools work, thus encouraging a specific import style for people who happen to use those tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60119 Test Plan: Open `test/run_test.py` in VS Code and run the **Python Refactor: Sort Imports** command. Compare with and without this PR. Reviewed By: 1ntEgr8 Differential Revision: D29199504 Pulled By: samestep fbshipit-source-id: 83e937b0f517c60e3e7dedb6c0306173908fbbb0

view details

Alexander Golynski

commit sha ed1da5be210c31cc07b033ac0f19f3dd6366feac

PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59899 Test Plan: Imported from OSS Reviewed By: cbalioglu, osalpekar Differential Revision: D29080299 Pulled By: agolynski fbshipit-source-id: 9ae368f91e81f19471e0a20fc913d8e9df1b9dec

view details

Bin Bao

commit sha 96b3537e71ed1c5a2aa5af183c83dc6497ce6174

[NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449 Make dtypeToCppString as a virtual method so that a child class can easily override the dtype string generation rule. This is needed as a preparation to make loop and tensor index as int64_t. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173969 Pulled By: desertfire fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776

view details

Bin Bao

commit sha 3dc8112187c5a4162581b9725695455ca959e752

[NNC] Handle int64 indices and loop bounds (#59769) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59769 Allow loop bound and tensor indice to be either int32 or int64, and avoid unnecessary cast op. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173970 Pulled By: desertfire fbshipit-source-id: 859a876ddb1b41535b2266089aa1222884295c78

view details

Brian Hirsh

commit sha 6b5e77904f8d2477cbbff4a9c59a3479f3a0b770

Revert D29104396: Port `argmin` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104396 (https://github.com/pytorch/pytorch/commit/226d745a0bf6ba174a08b92659613f4174aa393a) Original commit changeset: 39c59bcc0446 fbshipit-source-id: 82de26f925a885f65572a785fa45a9980d3a974b

view details

Brian Hirsh

commit sha 873dac4b5a11ec82904a5dfc6fba6f169280e93f

Revert D29104397: Port `argmax` to structured kernels. Test Plan: revert-hammer Differential Revision: D29104397 (https://github.com/pytorch/pytorch/commit/6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4) Original commit changeset: 580355cf3b4e fbshipit-source-id: e51fb79329066bc1a6364cfa44a8732908a684ed

view details

Brian Hirsh

commit sha 81baa7fb0d346d0f87c3f1935019193a1025ac71

Revert D29104398: Using meta checks for unary `torch.all` and `torch.any`. Test Plan: revert-hammer Differential Revision: D29104398 (https://github.com/pytorch/pytorch/commit/c078cefa7d90357bfb871096efd2685163181723) Original commit changeset: 6771b80130c9 fbshipit-source-id: 10e5a34370113fcd2f87aea2c2e76108fa9328d8

view details

Brian Hirsh

commit sha 3ff5507fb037e489487adcc6026520c3be29f3b1

Revert D29104395: Port `any` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104395 (https://github.com/pytorch/pytorch/commit/519698362dd23808a093480986b0a4ba0b1044a8) Original commit changeset: 0cfde57c22ba fbshipit-source-id: ac5ebdc4b9d3aeb4c5eeab55c92ac931599d39d1

view details

Brian Hirsh

commit sha ef09428804d9b2b580f988c723b3e4cc479d03ec

Revert D29104399: Port `all` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104399 (https://github.com/pytorch/pytorch/commit/7809494c68dd885392871e7dbc82c27ae0de3727) Original commit changeset: 18bb747b7a19 fbshipit-source-id: f57043df5646f1e675e8a555cb4fa0e436953751

view details

Richard Zou

commit sha ebafd2aadfcf04c0918197598a063e80aa7580f7

Stop warning on .names() access in max_pool2d and max_pool2d_backward (#60059) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60059 Fixes #60053. The problem is that `.names()` always triggers the named tensor warning. To not trigger it, one has to guard it with has_names: `x.has_names() ? x.names() : DimnameList{}` This is not the first time this has happened; we should probably make it so that .names() doesn't raise a warning unless it is actually populated with names. That's a little tricky to implement so I'm leaving it for the future. Test Plan: - New test, also run `python test/test_nn.py -v -k "max_pool"` and confirm there are no warnings. Reviewed By: gchanan Differential Revision: D29152737 Pulled By: zou3519 fbshipit-source-id: 89a2fdbe6a6064a7044b5b75f7d0c58e51e57509

view details

Shen Li

commit sha bbedfd913d53d677f9128caf3b8b6ea6311fe3b3

Run an dummy rpc._all_gather in init_rpc to avoid shutdown timeout (#59801) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59801 Fixes https://github.com/pytorch/pytorch/issues/59795. The RPC calls in shutdown no longer able to finish within 5s if there is no other RPCs before `rpc.shutdown()` in that process, because agent initialization can take longer than 5s. We don't have this problem previously, because TensorPipe's backend registry used to use RPC to communicate CUDA devices in `init_rpc`. However, after #58753, `init_rpc` uses ProcessGroup to communicate devices, and hence the channels/transport could be uninitialized after `init_rpc`. Differential Revision: D29039238 D29039238 Test Plan: Imported from OSS Reviewed By: rohan-varma Pulled By: mrshenli fbshipit-source-id: 46f89b01a058a51d271ddef9084a67b220a067b7

view details

Jane Xu

commit sha 462448f07ab9f2f2909e062185832e33843431fa

Enable GHA sharding on linux (#60124) Summary: This is branch off of https://github.com/pytorch/pytorch/issues/59970 to only shard on linux so far (we're running in issues with windows gflags). This would enable sharding of tests on a few Linux jobs on GHA, allowing tts to be essentially halved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60124 Reviewed By: zou3519 Differential Revision: D29204211 Pulled By: janeyx99 fbshipit-source-id: 1cc31d1eccd564d96e2aef14c0acae96a3f0fcd0

view details

Brian Hirsh

commit sha e2129d1c067326efba4eac53255b94af05a45b1b

beef up at::_ops API (#59115) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115 This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D28833086 Pulled By: bdhirsh fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b

view details

Tao Xu

commit sha 2062cafaa5ede56d63ecfc8b9edc2b69494f2247

[iOS GPU][MaskRCNN] Implement RoIAlign in Metal shaders using Sampler (#56075) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56075 Inspired by the CUDA implementation - https://fburl.com/diffusion/e90tabkj. The main difference is the way we implement bilinear interpolation. CUDA does this manually by iterating every point in each bin box. Whereas, Metal does this by calling sampler's sample function, which is a bit easier and faster. The result is almost identical to the result from CPU - P365102522. We'll do another round of refactor once we have figured out how to support custom ops on GPU. ghstack-source-id: 131720620 Test Plan: 1. Circle CI 2. Sandcastle Reviewed By: ajtulloch Differential Revision: D27485068 fbshipit-source-id: 31e831aead9d3799a3fde96e99dd677d96bd3da1

view details

Rohan Varma

commit sha acd914f03909a70631ecadde121f8a771876cd9f

Fix Pipe + DDP for unused parameters, static graph (#60118) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8

view details

push time in 2 hours

push eventpytorch/pytorch

Yukio Siraichi

commit sha 7809494c68dd885392871e7dbc82c27ae0de3727

Port `all` kernel to structured kernels. (#59371) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59371 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104399 Pulled By: ezyang fbshipit-source-id: 18bb747b7a19d873427d52c1145ef7cede333a0e

view details

Yukio Siraichi

commit sha 519698362dd23808a093480986b0a4ba0b1044a8

Port `any` kernel to structured kernels. (#59372) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59372 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104395 Pulled By: ezyang fbshipit-source-id: 0cfde57c22ba88607945c98f28b18df7709becd0

view details

Yukio Siraichi

commit sha c078cefa7d90357bfb871096efd2685163181723

Using meta checks for unary `torch.all` and `torch.any`. (#59373) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59373 This PR makes use of the newly implemented unified `at::meta::check_reduction` for validating the inputs and configuring its `TensorIterator`. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104398 Pulled By: ezyang fbshipit-source-id: 6771b80130c91c2f1360853127de0acebcfff183

view details

Yukio Siraichi

commit sha 6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4

Port `argmax` to structured kernels. (#59937) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59937 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104397 Pulled By: ezyang fbshipit-source-id: 580355cf3b4e9e5c934b4e51a16196087bcb3459

view details

Yukio Siraichi

commit sha 226d745a0bf6ba174a08b92659613f4174aa393a

Port `argmin` kernel to structured kernels. (#59938) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59938 Tracking issue: #55070 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29104396 Pulled By: ezyang fbshipit-source-id: 39c59bcc044649c1ec9c9685366c4dda87f76aa7

view details

Sam Estep

commit sha 010f4b6f2d37f46e48b6422e353dbfe6bfea3a1e

Add .isort.cfg (#60119) Summary: This adds the `.isort.cfg` file from https://github.com/pytorch/pytorch/issues/55928, but doesn't try to enforce it in CI because as that PR showed, that is currently difficult to do. We could use this to gradually sort the codebase according to this configuration (enforcing bits and pieces in CI) but I don't do that here. The advantage of including this file (even if we don't enforce it) is that it affects how certain tools work, thus encouraging a specific import style for people who happen to use those tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60119 Test Plan: Open `test/run_test.py` in VS Code and run the **Python Refactor: Sort Imports** command. Compare with and without this PR. Reviewed By: 1ntEgr8 Differential Revision: D29199504 Pulled By: samestep fbshipit-source-id: 83e937b0f517c60e3e7dedb6c0306173908fbbb0

view details

Alexander Golynski

commit sha ed1da5be210c31cc07b033ac0f19f3dd6366feac

PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59899 Test Plan: Imported from OSS Reviewed By: cbalioglu, osalpekar Differential Revision: D29080299 Pulled By: agolynski fbshipit-source-id: 9ae368f91e81f19471e0a20fc913d8e9df1b9dec

view details

Bin Bao

commit sha 96b3537e71ed1c5a2aa5af183c83dc6497ce6174

[NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449 Make dtypeToCppString as a virtual method so that a child class can easily override the dtype string generation rule. This is needed as a preparation to make loop and tensor index as int64_t. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173969 Pulled By: desertfire fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776

view details

Bin Bao

commit sha 3dc8112187c5a4162581b9725695455ca959e752

[NNC] Handle int64 indices and loop bounds (#59769) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59769 Allow loop bound and tensor indice to be either int32 or int64, and avoid unnecessary cast op. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173970 Pulled By: desertfire fbshipit-source-id: 859a876ddb1b41535b2266089aa1222884295c78

view details

Brian Hirsh

commit sha 6b5e77904f8d2477cbbff4a9c59a3479f3a0b770

Revert D29104396: Port `argmin` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104396 (https://github.com/pytorch/pytorch/commit/226d745a0bf6ba174a08b92659613f4174aa393a) Original commit changeset: 39c59bcc0446 fbshipit-source-id: 82de26f925a885f65572a785fa45a9980d3a974b

view details

Brian Hirsh

commit sha 873dac4b5a11ec82904a5dfc6fba6f169280e93f

Revert D29104397: Port `argmax` to structured kernels. Test Plan: revert-hammer Differential Revision: D29104397 (https://github.com/pytorch/pytorch/commit/6f3da4f4bf0ddecdb13b006a1bb4b7ee9cf473a4) Original commit changeset: 580355cf3b4e fbshipit-source-id: e51fb79329066bc1a6364cfa44a8732908a684ed

view details

Brian Hirsh

commit sha 81baa7fb0d346d0f87c3f1935019193a1025ac71

Revert D29104398: Using meta checks for unary `torch.all` and `torch.any`. Test Plan: revert-hammer Differential Revision: D29104398 (https://github.com/pytorch/pytorch/commit/c078cefa7d90357bfb871096efd2685163181723) Original commit changeset: 6771b80130c9 fbshipit-source-id: 10e5a34370113fcd2f87aea2c2e76108fa9328d8

view details

Brian Hirsh

commit sha 3ff5507fb037e489487adcc6026520c3be29f3b1

Revert D29104395: Port `any` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104395 (https://github.com/pytorch/pytorch/commit/519698362dd23808a093480986b0a4ba0b1044a8) Original commit changeset: 0cfde57c22ba fbshipit-source-id: ac5ebdc4b9d3aeb4c5eeab55c92ac931599d39d1

view details

Brian Hirsh

commit sha ef09428804d9b2b580f988c723b3e4cc479d03ec

Revert D29104399: Port `all` kernel to structured kernels. Test Plan: revert-hammer Differential Revision: D29104399 (https://github.com/pytorch/pytorch/commit/7809494c68dd885392871e7dbc82c27ae0de3727) Original commit changeset: 18bb747b7a19 fbshipit-source-id: f57043df5646f1e675e8a555cb4fa0e436953751

view details

Richard Zou

commit sha ebafd2aadfcf04c0918197598a063e80aa7580f7

Stop warning on .names() access in max_pool2d and max_pool2d_backward (#60059) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60059 Fixes #60053. The problem is that `.names()` always triggers the named tensor warning. To not trigger it, one has to guard it with has_names: `x.has_names() ? x.names() : DimnameList{}` This is not the first time this has happened; we should probably make it so that .names() doesn't raise a warning unless it is actually populated with names. That's a little tricky to implement so I'm leaving it for the future. Test Plan: - New test, also run `python test/test_nn.py -v -k "max_pool"` and confirm there are no warnings. Reviewed By: gchanan Differential Revision: D29152737 Pulled By: zou3519 fbshipit-source-id: 89a2fdbe6a6064a7044b5b75f7d0c58e51e57509

view details

Shen Li

commit sha bbedfd913d53d677f9128caf3b8b6ea6311fe3b3

Run an dummy rpc._all_gather in init_rpc to avoid shutdown timeout (#59801) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59801 Fixes https://github.com/pytorch/pytorch/issues/59795. The RPC calls in shutdown no longer able to finish within 5s if there is no other RPCs before `rpc.shutdown()` in that process, because agent initialization can take longer than 5s. We don't have this problem previously, because TensorPipe's backend registry used to use RPC to communicate CUDA devices in `init_rpc`. However, after #58753, `init_rpc` uses ProcessGroup to communicate devices, and hence the channels/transport could be uninitialized after `init_rpc`. Differential Revision: D29039238 D29039238 Test Plan: Imported from OSS Reviewed By: rohan-varma Pulled By: mrshenli fbshipit-source-id: 46f89b01a058a51d271ddef9084a67b220a067b7

view details

Jane Xu

commit sha 462448f07ab9f2f2909e062185832e33843431fa

Enable GHA sharding on linux (#60124) Summary: This is branch off of https://github.com/pytorch/pytorch/issues/59970 to only shard on linux so far (we're running in issues with windows gflags). This would enable sharding of tests on a few Linux jobs on GHA, allowing tts to be essentially halved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60124 Reviewed By: zou3519 Differential Revision: D29204211 Pulled By: janeyx99 fbshipit-source-id: 1cc31d1eccd564d96e2aef14c0acae96a3f0fcd0

view details

Brian Hirsh

commit sha e2129d1c067326efba4eac53255b94af05a45b1b

beef up at::_ops API (#59115) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115 This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D28833086 Pulled By: bdhirsh fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b

view details

Tao Xu

commit sha 2062cafaa5ede56d63ecfc8b9edc2b69494f2247

[iOS GPU][MaskRCNN] Implement RoIAlign in Metal shaders using Sampler (#56075) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56075 Inspired by the CUDA implementation - https://fburl.com/diffusion/e90tabkj. The main difference is the way we implement bilinear interpolation. CUDA does this manually by iterating every point in each bin box. Whereas, Metal does this by calling sampler's sample function, which is a bit easier and faster. The result is almost identical to the result from CPU - P365102522. We'll do another round of refactor once we have figured out how to support custom ops on GPU. ghstack-source-id: 131720620 Test Plan: 1. Circle CI 2. Sandcastle Reviewed By: ajtulloch Differential Revision: D27485068 fbshipit-source-id: 31e831aead9d3799a3fde96e99dd677d96bd3da1

view details

Rohan Varma

commit sha acd914f03909a70631ecadde121f8a771876cd9f

Fix Pipe + DDP for unused parameters, static graph (#60118) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8

view details

push time in 2 hours

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

 def test_average_parameters(self):                 for p in model.parameters():                     self.assertEqual(p.data, torch.ones_like(p.data) * 0.5) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        @skip_if_odd_num_of_gpus()+        def test_periodic_model_averager(self):+            rank = dist.get_rank()+            rank_to_GPU = self._init_multigpu_helper()+            device_id = rank_to_GPU[rank][0]+            world_size = dist.get_world_size()++            model = nn.Linear(1, 5, bias=False).cuda(device_id)+            param = next(model.parameters())+            tensor = torch.ones_like(param.data) * ((rank + 1) // 2)+            averager = averagers.PeriodicModelAverager(model, warmup_steps=10, period=4)+            for step in range(0, 20):+                # Reset the parameters at every step.+                param.data = copy.deepcopy(tensor)+                averager.average_parameters(step)+                if step < 10 or step % 4 == 0:+                    # After the model averaging is 0.5+                    self.assertEqual(param.data, torch.ones_like(param.data) * 0.5)

Does it work for larger groups?

If world_size = 4, then tensor will be 0.5, 1, 1.5, 2 on respective ranks and the average would be 1.25?

SciPioneer

comment created time in 2 hours

Pull request review commentpytorch/pytorch

[Model Averaging] Periodic model averager

 def test_average_parameters(self):                 for p in model.parameters():                     self.assertEqual(p.data, torch.ones_like(p.data) * 0.5) +        @unittest.skipIf(+            BACKEND != "nccl" and BACKEND != "gloo",+            "MPI backend does not support creating subgroups on CUDA devices",+        )+        @skip_if_lt_x_gpu(2)+        @skip_if_odd_num_of_gpus()+        def test_periodic_model_averager(self):+            rank = dist.get_rank()+            rank_to_GPU = self._init_multigpu_helper()+            device_id = rank_to_GPU[rank][0]+            world_size = dist.get_world_size()++            model = nn.Linear(1, 5, bias=False).cuda(device_id)+            param = next(model.parameters())

any reason we check only one param vs the entire model.parameters?

SciPioneer

comment created time in 2 hours