profile
viewpoint

Ask questionstrain_tacotron.py: Random CUBLAS_STATUS_INTERNAL_ERROR

Occasionally when training tacotron (train_tacotron.py), CUDA throws an error and kills the training.

| Epoch: 167/1630 (15/45) | Loss: 0.3459 | 1.1 steps/s | Step: 284k |
Traceback (most recent call last):
  File "train_tacotron.py", line 204, in <module>
    main()
  File "train_tacotron.py", line 100, in main
    tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example)
  File "train_tacotron.py", line 144, in tts_train_loop
    loss.backward()
  File "C:\Python37\lib\site-packages\torch\tensor.py", line 227, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Python37\lib\site-packages\torch\autograd\__init__.py", line 138, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I don't know why this happens, it seems almost random. Sometimes it happens 12 hours after starting, sometimes it happens 15 minutes after starting.

fatchord/WaveRNN

Answer questions danlyth

Nice, that's a good solution. And yeah, I found out this morning that it picks up from where it left off very well. How many epochs did you leave it to train for? I'm on 100k so far and will probably let it run until close to a million I guess.

useful!
source:https://uonfu.com/
Github User Rank List