Rosie Zhao Profile Banner
Rosie Zhao Profile
Rosie Zhao

@rosieyzh

Followers
278
Following
301
Media
4
Statuses
29

PhD student with @hseas ML Foundations Group. Previously @mcgillu .

Joined October 2019
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@rosieyzh
Rosie Zhao
4 months
In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵
Tweet media one
6
30
186
@rosieyzh
Rosie Zhao
4 months
It’s an honor to be part of the 2024 cohort of Kempner Institute graduate fellows! Excited for what lies ahead :)
@KempnerInst
Kempner Institute at Harvard University
4 months
We're thrilled to introduce the 2024 cohort of #KempnerInstitute Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six @Harvard Ph.D. programs. Read more: #AI #NeuroAI #ML
Tweet media one
2
8
64
1
3
37
@rosieyzh
Rosie Zhao
4 months
This is surprising, because it means that the largest impact of Adam’s preconditioning is restricted to the last layer and LayerNorm parameters, and *most language model parameters can be trained with SGD*.
1
1
17
@rosieyzh
Rosie Zhao
4 months
Work done with @depen_morwani @vyasnikhil96 @brandfonbrener @ShamKakade6 ! For more, see below: Paper: Blog post: Sham's thread about our main ablations:
@ShamKakade6
Sham Kakade
4 months
Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:
Tweet media one
6
37
227
0
2
14
@rosieyzh
Rosie Zhao
4 months
By analyzing Adalayer effective learning rates, we can identify which parameters of the network need adaptivity. If we train only the *last layer and LayerNorm parameters* with Adalayer and the remaining parameters with SGD, we can recover stability and performance!
Tweet media one
2
3
13
@rosieyzh
Rosie Zhao
4 months
We studied an optimizer that we refer to as Adalayer, which is a “block-wise” variant of Adam achieving similar performance and stability. Layer/block-wise variants of Adam have been studied previously in the literature, and we introduce Adalayer solely for ease of analysis.
Tweet media one
1
1
13
@rosieyzh
Rosie Zhao
4 months
We also tried training the network using Adalayer with *fixed* second moment estimates after initialization for all blocks except the last layer and LayerNorm parameters (Frozen Adalayer). This nearly matches Adalayer’s performance and stability!
Tweet media one
1
1
10
@rosieyzh
Rosie Zhao
3 months
@QuanquanGu @ShamKakade6 @depen_morwani @brandfonbrener Yes, we will be releasing the code soon! Our code uses the OLMo repository () so our results should also be reproducible from there :)
0
0
2
@rosieyzh
Rosie Zhao
4 months
@QuanquanGu @ShamKakade6 @depen_morwani @brandfonbrener We don’t use mu-P in this work, all experiments are in the standard setup!
1
0
1
@rosieyzh
Rosie Zhao
4 months
@MrCatid We only used models with LayerNorm here but as part of our initial optimizer ablations in the first part of the paper, we did try replacing with RMSNorm (llama-style arch) and the performance of all optimizers was very comparable. I would guess the result to still hold there.
0
0
1
@rosieyzh
Rosie Zhao
4 months
@boazbaraktcs Thank you Boaz :))
0
0
1