HN.zip

Bugs in LLM Training – Gradient Accumulation Fix

43 points by apsec112 - 5 comments
imjonse [3 hidden]5 mins ago
Same issue described on HF: https://huggingface.co/blog/gradient_accumulation

It also highlights the main disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately.

xcodevn [3 hidden]5 mins ago
Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.
oergiR [3 hidden]5 mins ago
That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.
danielhanchen [3 hidden]5 mins ago
Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent
danielhanchen [3 hidden]5 mins ago
Oh hey! :) TLDR naively gradient accumulation was over-weighting short sequence lengths in LLM finetuning and training runs, and under-weighting long sequence lengths.

For eg a text with sequence lengths of [1, 100] would be scaled by 1/(100+1) in full batch training, but grad accum of 2 would weight [1] as 1/1 * 1/2 = 1/2, whilst [100] as 1/100 * 1/2 = 1/200. (1/2 since grad accum needs to divide by the # of grad accum steps)