In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵
We're thrilled to introduce the 2024 cohort of
#KempnerInstitute
Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six
@Harvard
Ph.D. programs. Read more:
#AI
#NeuroAI
#ML
This is surprising, because it means that the largest impact of Adam’s preconditioning is restricted to the last layer and LayerNorm parameters, and *most language model parameters can be trained with SGD*.
Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:
By analyzing Adalayer effective learning rates, we can identify which parameters of the network need adaptivity. If we train only the *last layer and LayerNorm parameters* with Adalayer and the remaining parameters with SGD, we can recover stability and performance!
We studied an optimizer that we refer to as Adalayer, which is a “block-wise” variant of Adam achieving similar performance and stability. Layer/block-wise variants of Adam have been studied previously in the literature, and we introduce Adalayer solely for ease of analysis.
We also tried training the network using Adalayer with *fixed* second moment estimates after initialization for all blocks except the last layer and LayerNorm parameters (Frozen Adalayer). This nearly matches Adalayer’s performance and stability!
@MrCatid
We only used models with LayerNorm here but as part of our initial optimizer ablations in the first part of the paper, we did try replacing with RMSNorm (llama-style arch) and the performance of all optimizers was very comparable. I would guess the result to still hold there.