profile
viewpoint

Ask questionsScript freezes with no output when using DistributedDataParallel

🐛 Bug

<!-- A clear and concise description of what the bug is. -->

I was trying to evaluate the performance of the system with static data but different models, batch sizes and AMP optimization levels. However, when using DDP, the script gets frozen at a random point. The GPU usage is stuck at 100% and the process is also using 100% CPU. I tried this on three different system configs but the error persists. I even tried the nightly version of PyTorch but it wasn't useful.

If I switch from NCCL backend to gloo backend, the code works, but very slow. I suspect that the problem might be with NCCL somehow.

Here is the NCCL log that I retrieved.

sas-desktop:26887:26887 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
sas-desktop:26887:26910 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
sas-desktop:26887:26910 [1] NCCL INFO comm 0x7fa4080022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1
sas-desktop:26886:26902 [0] NCCL INFO Channel 00 :    0   1
sas-desktop:26887:26910 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
sas-desktop:26886:26902 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
sas-desktop:26886:26902 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
sas-desktop:26886:26902 [0] NCCL INFO comm 0x7fbb380022b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
sas-desktop:26886:26886 [0] NCCL INFO Launch mode Parallel
sas-desktop:26887:26910 [1] NCCL INFO comm 0x7fa4080022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
Time taken: 5.01 secs
sas-desktop:26886:26886 [0] NCCL INFO Destroyed comm 0x7fbb380022b0 rank 0
sas-desktop:26887:26887 [1] NCCL INFO Destroyed comm 0x7fa4080022b0 rank 1
Using batch size: 3
sas-desktop:26921:26921 [0] NCCL INFO NET/Socket : Using [0]eno1:10.110.39.89<0>
sas-desktop:26921:26921 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

sas-desktop:26921:26921 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL version 2.4.2+cuda10.0
sas-desktop:26921:26949 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
sas-desktop:26921:26949 [0] NCCL INFO comm 0x7f23100022b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0
sas-desktop:26922:26922 [1] NCCL INFO NET/Socket : Using [0]eno1:10.110.39.89<0>
sas-desktop:26922:26922 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

sas-desktop:26922:26922 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
sas-desktop:26922:26952 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
sas-desktop:26922:26952 [1] NCCL INFO comm 0x7fc3940022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1
sas-desktop:26921:26949 [0] NCCL INFO Channel 00 :    0   1
sas-desktop:26922:26952 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
sas-desktop:26921:26949 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
sas-desktop:26921:26949 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
sas-desktop:26921:26949 [0] NCCL INFO comm 0x7f23100022b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
sas-desktop:26921:26921 [0] NCCL INFO Launch mode Parallel
sas-desktop:26922:26952 [1] NCCL INFO comm 0x7fc3940022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
Time taken: 5.01 secs
sas-desktop:26921:26921 [0] NCCL INFO Destroyed comm 0x7f23100022b0 rank 0
sas-desktop:26922:26922 [1] NCCL INFO Destroyed comm 0x7fc3940022b0 rank 1
Using batch size: 4
sas-desktop:26967:26967 [0] NCCL INFO NET/Socket : Using [0]eno1:10.110.39.89<0>
sas-desktop:26967:26967 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

sas-desktop:26967:26967 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
NCCL version 2.4.2+cuda10.0
sas-desktop:26967:26984 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
sas-desktop:26967:26984 [0] NCCL INFO comm 0x7f3e4c0022b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0
sas-desktop:26968:26968 [1] NCCL INFO NET/Socket : Using [0]eno1:10.110.39.89<0>
sas-desktop:26968:26968 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

sas-desktop:26968:26968 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
sas-desktop:26968:26985 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
sas-desktop:26968:26985 [1] NCCL INFO comm 0x7f28dc0022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1
sas-desktop:26967:26984 [0] NCCL INFO Channel 00 :    0   1
sas-desktop:26967:26984 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
sas-desktop:26968:26985 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
sas-desktop:26967:26984 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
sas-desktop:26967:26984 [0] NCCL INFO comm 0x7f3e4c0022b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
sas-desktop:26967:26967 [0] NCCL INFO Launch mode Parallel
sas-desktop:26968:26985 [1] NCCL INFO comm 0x7f28dc0022b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
Time taken: 5.01 secs
sas-desktop:26967:26967 [0] NCCL INFO Destroyed comm 0x7f3e4c0022b0 rank 0

I closed the script by Ctrl+C after an hour as it was stuck at this point. I also did a backtrace using GDB:

#0  0x00007ffd20ec5b62 in clock_gettime ()
#1  0x00007f243f141ea6 in __GI___clock_gettime (clock_id=4, tp=0x7ffd20e9f1e0) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f23e1d2aa2e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f23e1dce807 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f23e1d130ec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f23e1d13249 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007f23e1c1f051 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007f23e1d7dbc2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007f23e465eb60 in ?? () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0
#9  0x00007f23e469d61d in cudaStreamSynchronize () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0
#10 0x00007f23f7625e6b in commDestroy (comm=0x7f23940022b0) at init.cu:1141
#11 ncclCommDestroy (comm=0x7f23940022b0) at init.cu:1158
#12 0x00007f242f0bcadb in std::_Sp_counted_ptr_inplace<c10d::NCCLComm, std::allocator<c10d::NCCLComm>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#13 0x00007f242f0bd11e in std::_Hashtable<std::string, std::pair<std::string const, std::vector<std::shared_ptr<c10d::NCCLComm>, std::allocator<std::shared_ptr<c10d::NCCLComm> > > >, std::allocator<std::pair<std::string const, std::vector<std::shared_ptr<c10d::NCCLComm>, std::allocator<std::shared_ptr<c10d::NCCLComm> > > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() ()
   from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#14 0x00007f242f0b5955 in c10d::ProcessGroupNCCL::~ProcessGroupNCCL() () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#15 0x00007f242f0b5a79 in c10d::ProcessGroupNCCL::~ProcessGroupNCCL() () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#16 0x00007f242ea94d82 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#17 0x00007f242f02b50b in pybind11::class_<c10d::ProcessGroupNCCL, std::shared_ptr<c10d::ProcessGroupNCCL> >::dealloc(pybind11::detail::value_and_holder&) ()
   from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#18 0x00007f242eaa5e67 in pybind11::detail::clear_instance(_object*) () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#19 0x00007f242eaa60be in pybind11_object_dealloc () from /home/ssiddiqui/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#20 0x000056111beb50ab in free_keys_object (keys=<optimized out>) at /tmp/build/80754af9/python_1553721932202/work/Objects/dictobject.c:559
#21 PyDict_Clear () at /tmp/build/80754af9/python_1553721932202/work/Objects/dictobject.c:1644
#22 0x000056111beb517a in dict_tp_clear (op=<optimized out>) at /tmp/build/80754af9/python_1553721932202/work/Objects/dictobject.c:2997
#23 0x000056111bed44a8 in delete_garbage (old=<optimized out>, collectable=<optimized out>) at /tmp/build/80754af9/python_1553721932202/work/Modules/gcmodule.c:761
#24 collect () at /tmp/build/80754af9/python_1553721932202/work/Modules/gcmodule.c:916
#25 0x000056111bfa21da in _PyGC_CollectNoFail () at /tmp/build/80754af9/python_1553721932202/work/Modules/gcmodule.c:1605
#26 0x000056111bf65650 in PyImport_Cleanup () at /tmp/build/80754af9/python_1553721932202/work/Python/import.c:579
#27 0x000056111bfd88a7 in Py_FinalizeEx () at /tmp/build/80754af9/python_1553721932202/work/Python/pylifecycle.c:1195
#28 0x000056111bff0c63 in pymain_main () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3040
#29 0x000056111bff0f7c in _Py_UnixMain () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3073
#30 0x00007f243f032b97 in __libc_start_main (main=0x56111beaced0 <main>, argc=9, argv=0x7ffd20e9fcf8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd20e9fce8)
    at ../csu/libc-start.c:310
#31 0x000056111bf96122 in _start () at ../sysdeps/x86_64/elf/start.S:103

To Reproduce

Steps to reproduce the behavior:

  1. Download the script: https://gist.github.com/shoaibahmed/1b4bdaf5fdadfc80427b90a32eaa3053
  2. Evaluate using a simple script as follows:
for bs in {1..30}; do
    echo "Using batch size: $bs"
    NCCL_DEBUG=INFO python3 -m torch.distributed.launch --nproc_per_node=2 python ./benchmark.py --mname=resnet18 --numgpus=2 --batchsize=$bs --amp=O4 --distributed
done

The error is non-deterministic, so might not be reproduced on the first run. Please run it again and again if you are unable to reproduce it.

<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->

Expected behavior

The script should have been executed until the end. However, the script gets stuck at some random position where one of the processes is closed by the system, but the other one is still running waiting for the others. I tried with both APEX DDP and PyTorch DDP and the error persists.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.10.2

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: TITAN Xp

Nvidia driver version: 418.56 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.1

Versions of relevant libraries: [pip] numpy==1.16.2 [pip] numpydoc==0.8.0 [pip] torch==1.1.0 [pip] torch-encoding==1.0.1 [pip] torchgpipe==0.0.2 [pip] torchsummary==1.5.1 [pip] torchtrainers==0.0 [pip] torchvision==0.3.0 [conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl-service 1.1.2 py37he904b0f_5
[conda] mkl_fft 1.0.10 py37ha843d7b_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
[conda] torch 1.1.0 pypi_0 pypi [conda] torch-encoding 1.0.1 pypi_0 pypi [conda] torchgpipe 0.0.2 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchtrainers 0.0 pypi_0 pypi [conda] torchvision 0.3.0 pypi_0 pypi

Additional context

<!-- Add any other context about the problem here. -->

pytorch/pytorch

Answer questions shoaibahmed

@pietern I double-checked. There is no desynchronization between the threads. If there had been any desynchronization, gloo backend would also have failed. However, it executes without any issues. Here is the main training loop from the code. I don't see any desynchronization. If you are sure that this is the case, can you please look at the code and point it out.

for trial in range(maxtrials):
    if inference:
        with torch.no_grad():
            ys = model(xs)
    else:
        optimizer.zero_grad()
        ys = model(xs)
        loss = criterion(ys, targets)
        if amp_level is not None:
            with amp.scale_loss(loss, optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            loss.backward()
        optimizer.step()

    finish = time.time()

    if finish-start >= mintime and trial >= mintrials:
        break
useful!

Related questions

TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above hot 3
AttributeError: module 'torch.jit' has no attribute 'unused' hot 3
Adding Pixel Unshuffle hot 2
DataLoader leaking Semaphores. hot 2
[feature request] Add matrix exponential hot 2
cublas runtime error on torch.bmm() with CUDA10 and RTX2080Ti hot 2
libtorch does not initialize OpenMP/MKL by default hot 2
Use torch.device() with torch.load(..., map_location=torch.device()) hot 2
Cuda required when loading a TorchScript with map_location='cpu' hot 2
PyTorch 1.5 failed to import c:miniconda3-x64envs estlibsite-packages orchlibcaffe2_nvrtc.dll - pytorch hot 2
Error during import torch, NameError: name &#39;_C&#39; is not defined - pytorch hot 2
Quantisation of object detection models. hot 2
Problems with install python from source hot 2
torch.utils.tensorboard.SummaryWriter.add_graph do not support non-tensor inputs - pytorch hot 2
a retrained and saved jit module could not be reload. hot 2
source:https://uonfu.com/
Github User Rank List