yubai01 Profile Banner
Yu Bai Profile
Yu Bai

@yubai01

Followers
4K
Following
2K
Statuses
237

Researcher @OpenAI. Previously @SFResearch, PhD @Stanford.

San Francisco, CA
Joined November 2010
Don't wanna be here? Send us removal request.
@yubai01
Yu Bai
5 months
We used RL to train a much stronger reasoning model. Excited to have been part of this journey, and way to go!!!
@OpenAI
OpenAI
5 months
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
9
13
293
@yubai01
Yu Bai
7 days
Lol don't mind at all if deep research gets pass me in 2025! @EdwardSun0909 it's on you :)
@QuanquanGu
Quanquan Gu
7 days
PhD experts? 🤣🤣 Unless they can perform at @yubai01’s level, they’re irrelevant to the machine learning theory community.
2
0
14
@yubai01
Yu Bai
7 days
@EdwardSun0909 Congrats!
0
0
2
@yubai01
Yu Bai
8 days
RT @sama: o3-mini is out! smart, fast model. available in ChatGPT and API. it can search the web, and it shows its thinking. available…
0
2K
0
@yubai01
Yu Bai
17 days
RT @sama: big news: the free tier of chatgpt is going to get o3-mini! (and the plus tier will get tons of o3-mini usage)
0
2K
0
@yubai01
Yu Bai
1 month
@_aidan_clark_ Bad local minima were studied a lot, e.g. Auer et al. 1995 "Exponentially many bad local minima for single neurons": tho the bad example there is clearly contrived, and the authors did not explicitly draw an implication like "NNs are bad" based on it
0
0
8
@yubai01
Yu Bai
2 months
Besides ~saturating AIME, o3-mini is also the first to consistently solve some of the hard math questions in my own "test set" -- have to update that as well 🤣 Congrats @ren_hongyu @shengjia_zhao @_kevinlu + co!
Tweet media one
1
0
26
@yubai01
Yu Bai
2 months
That's the poster style we all need! 🤣
@simon_zhai
Simon Zhai
2 months
A huge thanks for everyone who came to the poster session. Posting this to whoever missed the jokes, comments & suggestions are more than welcome.
Tweet media one
1
0
7
@yubai01
Yu Bai
3 months
@max_simchowitz Congrats Max!
0
0
3
@yubai01
Yu Bai
3 months
0
0
2
@yubai01
Yu Bai
3 months
Great summer research intern opportunity!
@huan__wang
Huan Wang
3 months
We're hiring AI Research Interns for Summer 2025! Spend 3 months with us working on AI Agents, LLMs, Reasoning, Planning & more—with a focus on publishing high-quality academic papers. If you have a strong publication record, apply or DM me! #researchpaper #JobOpening #intern
0
0
9
@yubai01
Yu Bai
3 months
@lilianweng We will miss you! Good luck on your new journey 🩵
0
0
1
@yubai01
Yu Bai
3 months
@SebastienBubeck @OpenAI @sama Welcome! Let's rock!
0
0
1
@yubai01
Yu Bai
4 months
TLDR: Attention sink/massive tokens emerge in LLMs, simply because most heads need to be * Active for some input sequences; * "Dormant" for others. Started as a fun collab during my time @SFResearch, huge shoutout to @TianyuGuo0505 @druv_pai @Song__Mei +co for the amazing work!
@Song__Mei
Song Mei
4 months
Many LLMs, e.g., GPT2 and Llama, exhibit a fascinating attention sink phenomenon: attention weights often concentrate on the first token. We studied the training dynamics of toy models to demystify the sink formation mechanisms in LLMs. With fantastic @TianyuGuo0505 , @druv_pai , @yubai01 , @JiantaoJ , and Mike Jordan! ArXiv link:  In detail: Practitioners have consistently found three extreme-token phenomena in LLMs: attention sinks, value-state drains, and residual-state peaks. They often cause trouble in LLM inference and quantization. To understand them, we developed the Bigram-Backcopy task and analyzed a single-layer transformer, revealing two key mechanisms: • Active-dormant mechanism: The attention sink represents the dormant phase of an attention head. • Mutual reinforcement mechanism: Attention sinks and value-state drains mutually reinforce during training. All results can transfer to LLMs! • Llama 2 has a “coding head” that is dormant given Wikipedia texts. • OMLo’s training dynamics closely match the theory and the toy model. We also found that replacing SoftMax attention with ReLU attention can mitigate the extreme-token phenomenon.
Tweet media one
0
1
32
@yubai01
Yu Bai
4 months
@liuzhuang1234 @PrincetonCS Congrats Zhuang and Princeton!
0
0
1
@yubai01
Yu Bai
4 months
@jbhuang0604 Get better soon!!
0
0
0
@yubai01
Yu Bai
6 months
@johnschulman2 It's been an honor to have been colleague with you and wished it could be longer. Thank you and all the best!
0
0
7
@yubai01
Yu Bai
6 months
0
0
1
@yubai01
Yu Bai
6 months
0
0
1
@yubai01
Yu Bai
7 months
GPT-4o mini is out!
@OpenAIDevs
OpenAI Developers
7 months
Introducing GPT-4o mini! It’s our most intelligent and affordable small model, available today in the API. GPT-4o mini is significantly smarter and cheaper than GPT-3.5 Turbo.
Tweet media one
0
1
25