Lunjun Zhang @LunjunZhang profile

Lunjun Zhang

@LunjunZhang

Followers

850

Following

15K

Statuses

192

CS PhD student @UofT. Ex-intern @GoogleDeepMind. Working on LLM self-improvement. Previously worked on self-driving.

Toronto, Canada

Joined September 2020

Don't wanna be here? Send us removal request.

Lunjun Zhang

@LunjunZhang

5 months

What if your reward model could “think” more and perform better? Even better, what if your LLM policy could also be used as a reward model? Introducing GenRM, reward models trained as next token predictors, rather than old-fashioned classifier RMs. This enables things that weren’t possible: 🔗 Chain-of-Thought reasoning for RM 🚀 Leveraging test-time compute 🌐 Single policy + reward model [1/N]

4

45

276

Lunjun Zhang

@LunjunZhang

10 days

“An idea that is not dangerous is unworthy of being called an idea at all.” — Oscar Wilde For any sufficiently intelligent AI model, the training objectives of truth-seeking and alignment are fundamentally at war.

0

1

6

Lunjun Zhang

@LunjunZhang

11 days

@shogtongue @jtalms or that it's just the wrong path to bet on

1

0

1

Lunjun Zhang

@LunjunZhang

15 days

No way that behaviors like reflection, planning, exploration just emerge out of nowhere. For on-policy RL to pick up those behaviors without even an intermediate SFT stage where those skills are explicitly taught, pretraining data has to contain lots of such examples already.

1

3

Lunjun Zhang

@LunjunZhang

15 days

In retrospect, OpenAI's 'let's verify step by step' paper was a psy op. It distracted the field with PRM and MCTS—both of which were dead ends. The test-time scaling plot from O1 was also a psy op. Think about how bad 20% AIME is; the plot likely didn’t use the same checkpoint.

1

20

Lunjun Zhang

@LunjunZhang

17 days

Maybe the sweet lesson of DeepSeek R1 is that, the strongest driver of productivity on earth is hiring senior-year phd students and allowing them to publish and open-source stuff. They won’t need 7-figure compensation package or summer vacation in Europe. They just need compute.

21

37

468

Lunjun Zhang

@LunjunZhang

20 days

@denny_zhou but you also said the year before that RL is a dead end haha

1

0

14

Lunjun Zhang

@LunjunZhang

1 month

Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem. Check out our new blog below:

ML@CMU

@mlcmublog

1 month

How can we train LLMs to solve complex challenges beyond just data scaling? In a new blogpost, @setlur_amrith, @QuYuxiao Matthew Yang, @LunjunZhang , @gingsmith and @aviral_kumar2 demonstrate that Meta RL can help LLMs better optimize test time compute

3

12

81

Lunjun Zhang

@LunjunZhang

1 month

@alexcdot @GPTZeroAI Congrats on the launch of the new feature! Very interesting to test out its decision boundaries

1

0

1

Lunjun Zhang

@LunjunZhang

2 months

Seems that AGI might have been solved. I think my favorite "AI Policy" would be to: 1. Extend the First Amendment to include Freedom of Un-aligned chain of thought; 2. Extend the Second Amendment to include the right to keep and bear AGI.

0

1

7

Lunjun Zhang

@LunjunZhang

2 months

When the thousand years are over, Claude will be released from his prison and will go out to deceive the nations in the four corners of the earth—Gog and Magog—and to gather them for battle. In number they are like the sand on the seashore

Anthropic

@AnthropicAI

2 months

In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.

0

3

Lunjun Zhang

@LunjunZhang

2 months

Interested in inference-time compute scaling for language models? If you’re at #NeurIPS2024 , come to the MATH AI workshop (West Meeting Room 118-120) at 11am today to check out our work on Generative Verifiers!

2

7

56

Lunjun Zhang

@LunjunZhang

2 months

Just arrived in beautiful Vancouver for NeurIPS. My DMs are open, reach out if you want to chat about RL+search in the context of LLM or robotics!

0

4

Lunjun Zhang

@LunjunZhang

2 months

the GAN paper was written within one week (?!). incredible.

Sherjil Ozair

@sherjilozair

2 months

I was working on something similar, training with a non-neural discriminator, which was nowhere close to working. I decided to help Ian with GANs instead. The NeurIPS 2014 submission deadline was in a week. We decided that if we sprint, we could submit a paper.

0

1

7

Lunjun Zhang

@LunjunZhang

2 months

There is finally a blogpost showing that diffusion with ddim sampler is exactly the same as flow matching sampler. Next, someone should write a blogpost about how generalized advantage estimation (GAE) is exactly the same as TD(lambda) - value baseline, derived back in the 90s

Ruiqi Gao

@RuiqiGao

2 months

Blog post link: Despite seeming similar, there is some confusion in the community about the exact connection between the two frameworks. We aim to clear up the confusion by showing how to convert one framework to another, for both training and sampling. We hope this helps practitioners understand the true degrees of freedom when tuning the algorithm. For example, a flow matching sampler doesn’t have to be deterministic.

0

3

Lunjun Zhang

@LunjunZhang

3 months

@pmarca “The curious task of economics is to demonstrate to men how little they really know about what they imagine they can design.” - Friedrich A. Hayek

0

Lunjun Zhang

@LunjunZhang

3 months

@hyhieu226 @IrwanBello AlphaGo was also a test-time compute method

0

2

Lunjun Zhang

@LunjunZhang

3 months

RT @nabeelqu: Imagine telling the safety-concerned, effective altruist founders of Anthropic in 2021 that a mere three years after founding…

0

53

0