Intermittent "runtimeerror: Cuda Out Of Memory" Error In Google Colab Fine Tuning Bert Base Cased With Transformers And Pytorch
Solution 1:
I was facing the same problem in transformers,Transformers are extremely memory intensive
. Hence, there is quite a high probability that we will run out of memory or the runtime limit while training larger models or for longer epochs.
There are some promising well-known out of the box strategies to solve these problems and each strategy comes with its own benefits.
- Dyanmic Padding and Uniform Length Batching(Smart batching)
- Gradient Accumulation
- Freeze Embedding
- Numeric Precision Reduction
- Gradient Checkpointing
Training neural networks on a batch of sequences requires them to have the exact same length to build the batch matrix representation. Because real life NLP datasets are always made of texts of variable lengths, we often need to make some sequences shorter by truncating them, and some others longer by adding at the end a repeated fake token called “pad” token.
Because the pad token doesn’t represent a real word, when most computations are done, before computing the loss, we erase the pad token signal by multiplying it by 0 through the “attention mask” matrix for each sample, which identifies the [PAD] tokens and tells Transformer to ignore them.
Dynamic Padding: Here we limit the number of added pad tokens to reach the length of the longest sequence of each mini batch instead of a fixed value set for the whole train set Because the number of added tokens changes across mini batches, we call it "dynamic" padding.
Uniform Length Batching:
We push the logic futher by generating batches made of similar length sequences so we avoid extreme cases where most sequences in the mini batch are short and we are required to add lots of pad tokens to each of them because 1 sequence of the same mini batch is very long.
Post a Comment for "Intermittent "runtimeerror: Cuda Out Of Memory" Error In Google Colab Fine Tuning Bert Base Cased With Transformers And Pytorch"