Kunhao Zheng Profile
Kunhao Zheng

@KunhaoZ

Followers
279
Following
271
Statuses
107

École Polytechnique X18, SJTU. Now in the amazing FAIR CodeGen @AIatMeta. Alumni: @Huggingface, Sea AI Lab, intern @openai

Joined January 2019
Don't wanna be here? Send us removal request.
@KunhaoZ
Kunhao Zheng
5 days
RT @KempeLab: PILAF (Policy-Interpolated Learning for Aligned Feedback): our response sampling scheme that provably aligns LLM preference l…
0
10
0
@KunhaoZ
Kunhao Zheng
5 days
This would not have been possible without the amazing team work and support from @feeelix_feng @ArielKwiatkowsk @KempeLab @YaqiDuanPKU and @syhw !
0
0
4
@KunhaoZ
Kunhao Zheng
5 days
RT @feeelix_feng: You think on-policy sampling gives the best reward models? Think again! 🔥 Our finding: Even with on-policy data, reward m…
0
39
0
@KunhaoZ
Kunhao Zheng
6 days
RT @zzlccc: 🚨There May Not be Aha Moment in R1-Zero-like Training: A common belief about the recent R1-Zero-like t…
0
68
0
@KunhaoZ
Kunhao Zheng
14 days
@shawnup Each problem comes with public and private test. Yeah this is to make sure during training the code is actually correct but not just hacking the public tests like a bunch of if-else. For sure, we don’t expose private test cases information to the model when eval on valid/test set
0
0
0
@KunhaoZ
Kunhao Zheng
1 month
@natolambert It’s generated tests to differentiate code’s behavior a la AlphaCode
0
0
0
@KunhaoZ
Kunhao Zheng
2 months
@willccbb Some detours on offline methods like DPO. Also, codegen ppl (the single-turn codegen guys) and agent ppl (the SWE-Bench guys) are quite separated and didn't notice the role of multi-turn codegen until very recently, not to mention bringing it to train time.
0
0
14
@KunhaoZ
Kunhao Zheng
2 months
RT @lae_teo: ✨Pretty stoked to announce we've been drafting a in-depth guide to PPO with LLMs! PPO looks like a mess from afar (with its 4…
0
6
0
@KunhaoZ
Kunhao Zheng
2 months
I’ll be at #NeurIPS2024! Let’s chat about code generation, reasoning and RL, and life of course!
1
2
31
@KunhaoZ
Kunhao Zheng
4 months
@srush_nlp @justintchiu iirc alphazero has a learned value function (not exactly a verifier cuz value function is tied to a policy but verifier should be independent of it).
0
0
2
@KunhaoZ
Kunhao Zheng
4 months
@srush_nlp Not really. ExIt is described in and AlphaGo Zero roughly in the same time. It uses the MCTS as the policy improvement operator but I think it's not restricted to that: You can train on any policy-in policy-out operator that improves the perf.
0
3
8
@KunhaoZ
Kunhao Zheng
4 months
Work done with amazing @DecugisJuliette (joint first author), @jnsgehring, @TacoCohen, Benjamin Negrevergne, @syhw.
0
2
5