Yu Bai @yubai01 profile

Yu Bai

@yubai01

Followers

4K

Following

2K

Statuses

237

Researcher @OpenAI. Previously @SFResearch, PhD @Stanford.

San Francisco, CA

Joined November 2010

Don't wanna be here? Send us removal request.

Yu Bai

@yubai01

5 months

We used RL to train a much stronger reasoning model. Excited to have been part of this journey, and way to go!!!

OpenAI

@OpenAI

5 months

We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

9

13

293

Yu Bai

@yubai01

7 days

Lol don't mind at all if deep research gets pass me in 2025! @EdwardSun0909 it's on you :)

Quanquan Gu

@QuanquanGu

7 days

PhD experts? 🤣🤣 Unless they can perform at @yubai01’s level, they’re irrelevant to the machine learning theory community.

2

0

14

Yu Bai

@yubai01

7 days

@EdwardSun0909 Congrats!

0

2

Yu Bai

@yubai01

8 days

RT @sama: o3-mini is out! smart, fast model. available in ChatGPT and API. it can search the web, and it shows its thinking. available…

0

2K

0

Yu Bai

@yubai01

17 days

RT @sama: big news: the free tier of chatgpt is going to get o3-mini! (and the plus tier will get tons of o3-mini usage)

0

2K

0

Yu Bai

@yubai01

1 month

@_aidan_clark_ Bad local minima were studied a lot, e.g. Auer et al. 1995 "Exponentially many bad local minima for single neurons": tho the bad example there is clearly contrived, and the authors did not explicitly draw an implication like "NNs are bad" based on it

0

8

Yu Bai

@yubai01

2 months

Besides ~saturating AIME, o3-mini is also the first to consistently solve some of the hard math questions in my own "test set" -- have to update that as well 🤣 Congrats @ren_hongyu @shengjia_zhao @_kevinlu + co!

1

0

26

Yu Bai

@yubai01

2 months

That's the poster style we all need! 🤣

Simon Zhai

@simon_zhai

2 months

A huge thanks for everyone who came to the poster session. Posting this to whoever missed the jokes, comments & suggestions are more than welcome.

1

0

7

Yu Bai

@yubai01

3 months

@max_simchowitz Congrats Max!

0

3

Yu Bai

@yubai01

3 months

@thoma_gu @CIS_Penn @PennEngineers Congrats Jiatao!

0

2

Yu Bai

@yubai01

3 months

Great summer research intern opportunity!

Huan Wang

@huan__wang

3 months

We're hiring AI Research Interns for Summer 2025! Spend 3 months with us working on AI Agents, LLMs, Reasoning, Planning & more—with a focus on publishing high-quality academic papers. If you have a strong publication record, apply or DM me! #researchpaper #JobOpening #intern

0

9

Yu Bai

@yubai01

3 months

@lilianweng We will miss you! Good luck on your new journey 🩵

0

1

Yu Bai

@yubai01

3 months

@SebastienBubeck @OpenAI @sama Welcome! Let's rock!

0

1

Yu Bai

@yubai01

4 months

TLDR: Attention sink/massive tokens emerge in LLMs, simply because most heads need to be * Active for some input sequences; * "Dormant" for others. Started as a fun collab during my time @SFResearch, huge shoutout to @TianyuGuo0505 @druv_pai @Song__Mei +co for the amazing work!

Song Mei

@Song__Mei

4 months

Many LLMs, e.g., GPT2 and Llama, exhibit a fascinating attention sink phenomenon: attention weights often concentrate on the first token. We studied the training dynamics of toy models to demystify the sink formation mechanisms in LLMs. With fantastic @TianyuGuo0505 , @druv_pai , @yubai01 , @JiantaoJ , and Mike Jordan! ArXiv link: In detail: Practitioners have consistently found three extreme-token phenomena in LLMs: attention sinks, value-state drains, and residual-state peaks. They often cause trouble in LLM inference and quantization. To understand them, we developed the Bigram-Backcopy task and analyzed a single-layer transformer, revealing two key mechanisms: • Active-dormant mechanism: The attention sink represents the dormant phase of an attention head. • Mutual reinforcement mechanism: Attention sinks and value-state drains mutually reinforce during training. All results can transfer to LLMs! • Llama 2 has a “coding head” that is dormant given Wikipedia texts. • OMLo’s training dynamics closely match the theory and the toy model. We also found that replacing SoftMax attention with ReLU attention can mitigate the extreme-token phenomenon.

0

1

32

Yu Bai

@yubai01

4 months

@liuzhuang1234 @PrincetonCS Congrats Zhuang and Princeton!

0

1

Yu Bai

@yubai01

4 months

@jbhuang0604 Get better soon!!

0

Yu Bai

@yubai01

6 months

@johnschulman2 It's been an honor to have been colleague with you and wished it could be longer. Thank you and all the best!

0

7

Yu Bai

@yubai01

6 months

@sewon__min @uwcse @UCBerkeley @berkeley_ai @BerkeleyNLP @allen_ai Congratulations Sewon!

0

1

Yu Bai

@yubai01

6 months

@Tim_Dettmers @allen_ai @CarnegieMellon @Titus_vK Congratulations Tim!

0

1

Yu Bai

@yubai01

7 months

GPT-4o mini is out!

OpenAI Developers

@OpenAIDevs

7 months

Introducing GPT-4o mini! It’s our most intelligent and affordable small model, available today in the API. GPT-4o mini is significantly smarter and cheaper than GPT-3.5 Turbo.

0

1

25