Rosie Zhao @rosieyzh profile

Rosie Zhao

@rosieyzh

Followers

278

Following

301

Media

4

Statuses

29

PhD student with @hseas ML Foundations Group. Previously @mcgillu .

https://t.co/Ekt3cU4pho

Joined October 2019

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Puerto Rico • 559593 Tweets

Cowboys • 60569 Tweets

紙の保険証 • 58070 Tweets

せなけいこさん • 39212 Tweets

Mustafa Kemal Atatürk • 34132 Tweets

ねないこだれだ • 32352 Tweets

不在者投票525人分 • 29071 Tweets

大阪・豊中市 • 28083 Tweets

車の中の段ボール箱 • 27506 Tweets

Efendiler • 25325 Tweets

$ICE • 23425 Tweets

Duterte • 21439 Tweets

Happy New Week • 20465 Tweets

C-295 • 19768 Tweets

#リリリエラ • 15187 Tweets

KabinetPRABOWO GIBRANkompak • 14557 Tweets

Langsung GASSS Kerja CEPATTT • 13838 Tweets

コンパス • 12598 Tweets

Bato • 11779 Tweets

اليوم الاثنين

Pat McFadden

Jinggoy

Selamat Hari Sumpah Pemuda

ダックスさん

阪急京都線

Gary Kirsten

Wochenstart

演説会の登壇巡り

沢村賞該当者なし

鈴木将平

国民玉木雄一郎代表

痛恨の極み

組の若衆

書類送検

おばけのてんぷら

怒りあらわの石丸伸二氏

レッドカーペット

MY JOY IS FULL

ちいかわベーカリー

緊急メンテ

ベストヒット歌謡祭

チャンミ

롯데마트

पप्पू यादव

ラポスタ

#عل_ال

스쿨어택

#محبين_الشتاء

#لا_تملوا_من_الاستغفار

#GOAT_80Mviews

Last Seen Profiles

@teighn99411

@sethadknn

@Jacksonfreeman_

@JesperHensgens_

@BNU_Official

@GoBUHuskies

@HankampScott

@bettadazed

@barstoolsports

@yangsiyulu

@Viniciu11521371

@hansonwng

@mukkacoutinho

@CnmbDublin

@yumeyume_yumeyu

@QnSx9m

@GulsumEryiigit

@dannysantaf

@ProuHostia

@ICFE11

Rosie Zhao

@rosieyzh

4 months

In our new work on evaluating optimizers for LLM training, we perform a series of experiments to investigate the role of adaptivity in optimizers like Adam in achieving good performance and stability. A thread: 🧵

6

30

186

Rosie Zhao

@rosieyzh

4 months

It’s an honor to be part of the 2024 cohort of Kempner Institute graduate fellows! Excited for what lies ahead :)

Kempner Institute at Harvard University

@KempnerInst

4 months

We're thrilled to introduce the 2024 cohort of #KempnerInstitute Graduate Fellows! This year’s recipients include seven incoming and eight continuing graduate students enrolled across six @Harvard Ph.D. programs. Read more: #AI #NeuroAI #ML

2

8

64

1

3

37

Rosie Zhao

@rosieyzh

4 months

This is surprising, because it means that the largest impact of Adam’s preconditioning is restricted to the last layer and LayerNorm parameters, and *most language model parameters can be trained with SGD*.

1

17

Rosie Zhao

@rosieyzh

4 months

Work done with @depen_morwani @vyasnikhil96 @brandfonbrener @ShamKakade6 ! For more, see below: Paper: Blog post: Sham's thread about our main ablations:

Anything but SGD: Evaluating Optimizers for LLM Training - Kempner Institute

The authors study LLM training optimizers and find that they are all fairly similar except for SGD, which is notably worse.

kempnerinstitute.harvard.edu

Sham Kakade

@ShamKakade6

4 months

Which optimizer is opt? Our new work compares SGD, Adam, Adafactor (+ momentum), Lion, and, simply, SignSGD on LLM training wrt performance _and_ hyperparameter stability. tldr: Use anything but SGD, the rest are nearly identical:

6

37

227

0

2

14

Rosie Zhao

@rosieyzh

4 months

By analyzing Adalayer effective learning rates, we can identify which parameters of the network need adaptivity. If we train only the *last layer and LayerNorm parameters* with Adalayer and the remaining parameters with SGD, we can recover stability and performance!

2

3

13

Rosie Zhao

@rosieyzh

4 months

We studied an optimizer that we refer to as Adalayer, which is a “block-wise” variant of Adam achieving similar performance and stability. Layer/block-wise variants of Adam have been studied previously in the literature, and we introduce Adalayer solely for ease of analysis.

1

13

Rosie Zhao

@rosieyzh

4 months

We also tried training the network using Adalayer with *fixed* second moment estimates after initialization for all blocks except the last layer and LayerNorm parameters (Frozen Adalayer). This nearly matches Adalayer’s performance and stability!

1

10

Rosie Zhao

@rosieyzh

3 months

@QuanquanGu @ShamKakade6 @depen_morwani @brandfonbrener Yes, we will be releasing the code soon! Our code uses the OLMo repository () so our results should also be reproducible from there :)

GitHub - allenai/OLMo: Modeling, training, eval, and inference code for OLMo

Modeling, training, eval, and inference code for OLMo - allenai/OLMo

github.com

0

2

Rosie Zhao

@rosieyzh

4 months

@QuanquanGu @ShamKakade6 @depen_morwani @brandfonbrener We don’t use mu-P in this work, all experiments are in the standard setup!

1

0

1

Rosie Zhao

@rosieyzh

4 months

@MrCatid We only used models with LayerNorm here but as part of our initial optimizer ablations in the first part of the paper, we did try replacing with RMSNorm (llama-style arch) and the performance of all optimizers was very comparable. I would guess the result to still hold there.

0

1

Rosie Zhao

@rosieyzh

4 months

@boazbaraktcs Thank you Boaz :))

0

1