![Yu Bai Profile](https://pbs.twimg.com/profile_images/1118653380027539456/ZXGIt7X0_x96.jpg)
Yu Bai
@yubai01
Followers
4K
Following
2K
Statuses
237
Researcher @OpenAI. Previously @SFResearch, PhD @Stanford.
San Francisco, CA
Joined November 2010
We used RL to train a much stronger reasoning model. Excited to have been part of this journey, and way to go!!!
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
9
13
293
Lol don't mind at all if deep research gets pass me in 2025! @EdwardSun0909 it's on you :)
PhD experts? 🤣🤣 Unless they can perform at @yubai01’s level, they’re irrelevant to the machine learning theory community.
2
0
14
@_aidan_clark_ Bad local minima were studied a lot, e.g. Auer et al. 1995 "Exponentially many bad local minima for single neurons": tho the bad example there is clearly contrived, and the authors did not explicitly draw an implication like "NNs are bad" based on it
0
0
8
Besides ~saturating AIME, o3-mini is also the first to consistently solve some of the hard math questions in my own "test set" -- have to update that as well 🤣 Congrats @ren_hongyu @shengjia_zhao @_kevinlu + co!
1
0
26
Great summer research intern opportunity!
We're hiring AI Research Interns for Summer 2025! Spend 3 months with us working on AI Agents, LLMs, Reasoning, Planning & more—with a focus on publishing high-quality academic papers. If you have a strong publication record, apply or DM me! #researchpaper #JobOpening #intern
0
0
9
TLDR: Attention sink/massive tokens emerge in LLMs, simply because most heads need to be * Active for some input sequences; * "Dormant" for others. Started as a fun collab during my time @SFResearch, huge shoutout to @TianyuGuo0505 @druv_pai @Song__Mei +co for the amazing work!
Many LLMs, e.g., GPT2 and Llama, exhibit a fascinating attention sink phenomenon: attention weights often concentrate on the first token. We studied the training dynamics of toy models to demystify the sink formation mechanisms in LLMs. With fantastic @TianyuGuo0505 , @druv_pai , @yubai01 , @JiantaoJ , and Mike Jordan! ArXiv link: In detail: Practitioners have consistently found three extreme-token phenomena in LLMs: attention sinks, value-state drains, and residual-state peaks. They often cause trouble in LLM inference and quantization. To understand them, we developed the Bigram-Backcopy task and analyzed a single-layer transformer, revealing two key mechanisms: • Active-dormant mechanism: The attention sink represents the dormant phase of an attention head. • Mutual reinforcement mechanism: Attention sinks and value-state drains mutually reinforce during training. All results can transfer to LLMs! • Llama 2 has a “coding head” that is dormant given Wikipedia texts. • OMLo’s training dynamics closely match the theory and the toy model. We also found that replacing SoftMax attention with ReLU attention can mitigate the extreme-token phenomenon.
0
1
32
@johnschulman2 It's been an honor to have been colleague with you and wished it could be longer. Thank you and all the best!
0
0
7