MKhalifaaaa Profile Banner
Muhammad Khalifa Profile
Muhammad Khalifa

@MKhalifaaaa

Followers
515
Following
1K
Statuses
289

Ph.D. student at @Umich, previously @cohere, @allenai and @AmazonScience, and @NaverLabsEurope. Reasoning with LLMs, attribution and other stuff.

Ann Arbor, Michigan
Joined March 2019
Don't wanna be here? Send us removal request.
@MKhalifaaaa
Muhammad Khalifa
2 months
📝When training an LLM, we typically end up with substandard models: they perform best👍on some tasks but worse☹️on others. Should we discard🗑️these models? Well... If you Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs 🧵👇 1/n
Tweet media one
4
17
65
@MKhalifaaaa
Muhammad Khalifa
4 days
RT @mbalunovic: We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answ…
0
165
0
@MKhalifaaaa
Muhammad Khalifa
8 days
@lateinteraction Agreed. Also, I’m starting to trust AIME evaluations less and less
0
0
0
@MKhalifaaaa
Muhammad Khalifa
8 days
RT @xiangyue96: Demystifying Long CoT Reasoning in LLMs Reasoning models like R1 / O1 / O3 have gained massive atte…
0
192
0
@MKhalifaaaa
Muhammad Khalifa
13 days
@prajdabre1 Great work. It was cool to see that different methods tend to converge to similar performance at large scales, which kinda motivated our recent work on linear merge optimization
0
0
1
@MKhalifaaaa
Muhammad Khalifa
2 months
@sh_reya I disagree; being hellbent on something can often be a temporary/precarious stance that can easily change with circumstances, mood, etc. You shouldn't base a long-term commitment such as doing a PhD on the feeling of "I must do a PhD no matter what people say"
0
0
4
@MKhalifaaaa
Muhammad Khalifa
2 months
Thinking models like o3 will keep getting better and better but ONLY at easily verifiable skills like code and math. Reward modeling for non-verifiable tasks that involve creativity, imagination, or social interaction will remain a challenge
@DrJimFan
Jim Fan
2 months
Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about *relaxing a single-point RL super intelligence* to cover more points in the space of useful problems. The world of AI is no stranger to RL achieving god-level stunts. AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players. AlphaStar was a super intelligence. It bests some of the greatest e-sport champion teams on StarCraft. Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs. Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains. The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0. Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks. Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games. Huge milestone. Clear roadmap. More to do.
Tweet media one
1
1
15
@MKhalifaaaa
Muhammad Khalifa
2 months
RT @_lewtun: We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥 How? By combining step-wise reward models w…
0
194
0
@MKhalifaaaa
Muhammad Khalifa
2 months
@vishaal_urao Interesting work! Unfortunately, I’m not at NeurIPS, but would be happy to chat regardless!
0
0
0
@MKhalifaaaa
Muhammad Khalifa
2 months
RT @CohereForAI: In our latest work, we ask “Can model merging help with task tradeoffs over models obtained from different training runs”?…
0
8
0
@MKhalifaaaa
Muhammad Khalifa
2 months
RT @mgalle: WAIT! Don't throw away that checkpoint you don't like. Give it a second life by ♻️ it. Who knows, it might be like our _waste…
0
2
0
@MKhalifaaaa
Muhammad Khalifa
2 months
Checkout our paper here for more results and analysis! Work done with amazing people: @mgalle @yichern_tan @aahmadian_ @tomhosking @ahmetustun89 @tomsherborne @honglaklee @LuWang__
0
5
11
@MKhalifaaaa
Muhammad Khalifa
3 months
@_jasonwei It boils down to the level of abstraction. Yes, papers are more polished, but they also written at a higher level of abstraction compared to the other formats you mentioned. Naturally as you get closer to the details, things will become more messy.
0
0
1