profile
viewpoint

Ask questionstrain_tacotron.py: Random CUBLAS_STATUS_INTERNAL_ERROR

Occasionally when training tacotron (train_tacotron.py), CUDA throws an error and kills the training.

| Epoch: 167/1630 (15/45) | Loss: 0.3459 | 1.1 steps/s | Step: 284k |
Traceback (most recent call last):
  File "train_tacotron.py", line 204, in <module>
    main()
  File "train_tacotron.py", line 100, in main
    tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example)
  File "train_tacotron.py", line 144, in tts_train_loop
    loss.backward()
  File "C:\Python37\lib\site-packages\torch\tensor.py", line 227, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Python37\lib\site-packages\torch\autograd\__init__.py", line 138, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I don't know why this happens, it seems almost random. Sometimes it happens 12 hours after starting, sometimes it happens 15 minutes after starting.

fatchord/WaveRNN

Answer questions serg06

I'm seeing the same thing. Did you find a fix?

I didn't find a fix but I did find a workaround: Automatically restarting after a crash.

train.bat:

:loop
python train_tacotron.py
echo Crash detected, restarting...
timeout /t 5 /nobreak
goto loop

Also, are you able to pick up training from where you left off?

Yep, it always restarts from the latest step for me.

useful!
source:https://uonfu.com/
Github User Rank List