![Weihao Zeng Profile](https://pbs.twimg.com/profile_images/1847565244010536960/yr0FdCnu_x96.jpg)
Weihao Zeng
@AndrewZeng17
Followers
412
Following
1K
Statuses
341
LLM Researcher | Incoming PhD @hkust @hkustNLP | Ex-intern @MSFTResearch @Meituan | Research on LLMs Reasoning
HongKong
Joined April 2021
@sybilhyz @Grad62304977 Very impressive! Is the step here referring to gradient step or rollout step?
0
0
0
RT @sivil_taram: 🚀 After 5 days of DeepSeek-R1, we’ve replicated its pure reinforcement learning magic on math reasoning — no reward models…
0
150
0
RT @junxian_he: We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly…
0
667
0
RT @rohanpaul_ai: B-STAR introduces dynamic balancing of exploration and exploitation during LLM self-improvement training, preventing perf…
0
6
0
@xpasky Thank you very much for your interpretation. We firmly believe that exploration and exploitation are key to helping us achieve scalable RL, and we are researching more elegant methods to advance this!
0
0
3
🚀 Excited to share our latest research: B-STAR! 💡 Tackling the stagnation in self-improvement, we present a framework that dynamically balances exploration & exploitation, unlocking new potential in complex reasoning tasks.
0
0
5
RT @AndrewZeng17: 🚀 Excited to share our latest research: B-STAR! 💡 Tackling the stagnation in self-improvement, we present a framework th…
0
25
0
RT @WeiLiu99: 🔔🎄Christmas Gift for Multimodal Reasoning: Introducing M-STaR 🎁 (1/6) How can we dive deeper to help Large Multimodal Models…
0
36
0
RT @gm8xx8: B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
0
24
0
RT @xcjthu1: 1/4 🚀 Densing Law of LLMs 🚀 OpenAI's Scaling Law showed how model capabilities scale with size. But what about the trend towa…
0
42
0
RT @lilianweng: 🦃 At the end of Thanksgiving holidays, I finally finished the piece on reward hacking. Not an easy one to write, phew. Rew…
0
225
0
@SNAT02792153 Great job! Wonder if you have tried MIND to pretrain on a larger model? Since use Llama3-70B-Instruct to generate conversations, which is very powerful, it might be like a form of distillation. Or, have you considered using a less powerful model to generate conversations?
1
0
7