Muhammad Khalifa @MKhalifaaaa profile

Muhammad Khalifa

@MKhalifaaaa

Followers

515

Following

1K

Statuses

289

Ph.D. student at @Umich, previously @cohere, @allenai and @AmazonScience, and @NaverLabsEurope. Reasoning with LLMs, attribution and other stuff.

Ann Arbor, Michigan

Joined March 2019

Don't wanna be here? Send us removal request.

Muhammad Khalifa

@MKhalifaaaa

2 months

📝When training an LLM, we typically end up with substandard models: they perform best👍on some tasks but worse☹️on others. Should we discard🗑️these models? Well... If you Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs 🧵👇 1/n

4

17

65

Muhammad Khalifa

@MKhalifaaaa

4 days

RT @mbalunovic: We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answ…

0

165

0

Muhammad Khalifa

@MKhalifaaaa

8 days

@lateinteraction Agreed. Also, I’m starting to trust AIME evaluations less and less

0

Muhammad Khalifa

@MKhalifaaaa

8 days

RT @xiangyue96: Demystifying Long CoT Reasoning in LLMs Reasoning models like R1 / O1 / O3 have gained massive atte…

0

192

0

Muhammad Khalifa

@MKhalifaaaa

13 days

@prajdabre1 Great work. It was cool to see that different methods tend to converge to similar performance at large scales, which kinda motivated our recent work on linear merge optimization

0

1

Muhammad Khalifa

@MKhalifaaaa

2 months

@sh_reya I disagree; being hellbent on something can often be a temporary/precarious stance that can easily change with circumstances, mood, etc. You shouldn't base a long-term commitment such as doing a PhD on the feeling of "I must do a PhD no matter what people say"

0

4

Muhammad Khalifa

@MKhalifaaaa

2 months

Thinking models like o3 will keep getting better and better but ONLY at easily verifiable skills like code and math. Reward modeling for non-verifiable tasks that involve creativity, imagination, or social interaction will remain a challenge

Jim Fan

@DrJimFan

2 months

Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about *relaxing a single-point RL super intelligence* to cover more points in the space of useful problems. The world of AI is no stranger to RL achieving god-level stunts. AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players. AlphaStar was a super intelligence. It bests some of the greatest e-sport champion teams on StarCraft. Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs. Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains. The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0. Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks. Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games. Huge milestone. Clear roadmap. More to do.

1

15

Muhammad Khalifa

@MKhalifaaaa

2 months

RT @_lewtun: We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥 How? By combining step-wise reward models w…

0

194

0

Muhammad Khalifa

@MKhalifaaaa

2 months

@vishaal_urao Interesting work! Unfortunately, I’m not at NeurIPS, but would be happy to chat regardless!

0

Muhammad Khalifa

@MKhalifaaaa

2 months

RT @CohereForAI: In our latest work, we ask “Can model merging help with task tradeoffs over models obtained from different training runs”?…

0

8

0

Muhammad Khalifa

@MKhalifaaaa

2 months

RT @mgalle: WAIT! Don't throw away that checkpoint you don't like. Give it a second life by ♻️ it. Who knows, it might be like our _waste…

0

2

0

Muhammad Khalifa

@MKhalifaaaa

2 months

Checkout our paper here for more results and analysis! Work done with amazing people: @mgalle @yichern_tan @aahmadian_ @tomhosking @ahmetustun89 @tomsherborne @honglaklee @LuWang__

0

5

11

Muhammad Khalifa

@MKhalifaaaa

3 months

@_jasonwei It boils down to the level of abstraction. Yes, papers are more polished, but they also written at a higher level of abstraction compared to the other formats you mentioned. Naturally as you get closer to the details, things will become more messy.

0

1