Ask questionstrain_tacotron.py: Random CUBLAS_STATUS_INTERNAL_ERROR
Occasionally when training tacotron (train_tacotron.py
), CUDA throws an error and kills the training.
| Epoch: 167/1630 (15/45) | Loss: 0.3459 | 1.1 steps/s | Step: 284k |
Traceback (most recent call last):
File "train_tacotron.py", line 204, in <module>
main()
File "train_tacotron.py", line 100, in main
tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example)
File "train_tacotron.py", line 144, in tts_train_loop
loss.backward()
File "C:\Python37\lib\site-packages\torch\tensor.py", line 227, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Python37\lib\site-packages\torch\autograd\__init__.py", line 138, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I don't know why this happens, it seems almost random. Sometimes it happens 12 hours after starting, sometimes it happens 15 minutes after starting.
Answer
questions
serg06
I'm seeing the same thing. Did you find a fix?
I didn't find a fix but I did find a workaround: Automatically restarting after a crash.
train.bat
:
:loop
python train_tacotron.py
echo Crash detected, restarting...
timeout /t 5 /nobreak
goto loop
Also, are you able to pick up training from where you left off?
Yep, it always restarts from the latest step for me.
Related questions