Kunhao Zheng @KunhaoZ profile

Kunhao Zheng

@KunhaoZ

Followers

279

Following

271

Statuses

107

École Polytechnique X18, SJTU. Now in the amazing FAIR CodeGen @AIatMeta. Alumni: @Huggingface, Sea AI Lab, intern @openai

Joined January 2019

Don't wanna be here? Send us removal request.

Kunhao Zheng

@KunhaoZ

5 days

RT @KempeLab: PILAF (Policy-Interpolated Learning for Aligned Feedback): our response sampling scheme that provably aligns LLM preference l…

0

10

0

Kunhao Zheng

@KunhaoZ

5 days

This would not have been possible without the amazing team work and support from @feeelix_feng @ArielKwiatkowsk @KempeLab @YaqiDuanPKU and @syhw !

0

4

Kunhao Zheng

@KunhaoZ

5 days

RT @feeelix_feng: You think on-policy sampling gives the best reward models? Think again! 🔥 Our finding: Even with on-policy data, reward m…

0

39

0

Kunhao Zheng

@KunhaoZ

6 days

RT @zzlccc: 🚨There May Not be Aha Moment in R1-Zero-like Training: A common belief about the recent R1-Zero-like t…

0

68

0

Kunhao Zheng

@KunhaoZ

14 days

@shawnup Each problem comes with public and private test. Yeah this is to make sure during training the code is actually correct but not just hacking the public tests like a bunch of if-else. For sure, we don’t expose private test cases information to the model when eval on valid/test set

0

Kunhao Zheng

@KunhaoZ

1 month

@natolambert It’s generated tests to differentiate code’s behavior a la AlphaCode

0

Kunhao Zheng

@KunhaoZ

2 months

@willccbb Some detours on offline methods like DPO. Also, codegen ppl (the single-turn codegen guys) and agent ppl (the SWE-Bench guys) are quite separated and didn't notice the role of multi-turn codegen until very recently, not to mention bringing it to train time.

0

14

Kunhao Zheng

@KunhaoZ

2 months

RT @lae_teo: ✨Pretty stoked to announce we've been drafting a in-depth guide to PPO with LLMs! PPO looks like a mess from afar (with its 4…

0

6

0

Kunhao Zheng

@KunhaoZ

2 months

I’ll be at #NeurIPS2024! Let’s chat about code generation, reasoning and RL, and life of course!

1

2

31

Kunhao Zheng

@KunhaoZ

4 months

@srush_nlp @justintchiu iirc alphazero has a learned value function (not exactly a verifier cuz value function is tied to a policy but verifier should be independent of it).

0

2

Kunhao Zheng

@KunhaoZ

4 months

@srush_nlp Not really. ExIt is described in and AlphaGo Zero roughly in the same time. It uses the MCTS as the policy improvement operator but I think it's not restricted to that: You can train on any policy-in policy-out operator that improves the perf.

0

3

8

Kunhao Zheng

@KunhaoZ

4 months

Work done with amazing @DecugisJuliette (joint first author), @jnsgehring, @TacoCohen, Benjamin Negrevergne, @syhw.

0

2

5