I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.
> So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)
If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?
(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)
https://www.manning.com/books/build-a-large-language-model-f...