![will brown Profile](https://pbs.twimg.com/profile_images/1851112754439974912/CeTvIgQ4_x96.jpg)
will brown
@willccbb
Followers
12K
Following
18K
Statuses
3K
ai research @morganstanley | prev phd @columbia bs/ms @penn
nyc
Joined February 2015
@teortaxesTex my interpretation of the technique in this paper is that you basically want more momentum and/or higher effective batch size a gradient only holds so many bits, increasing the # examples it needs to learn about focuses those bits towards generalization
0
0
5
@boazbaraktcs @jeremyphoward the exact opposite is true -- inference-time compute is provably sufficient to solve all problems solvable by any circuit, with steps scaling linearly in circuit size, given constant depth + log(size) embedding dim
2
0
2
@signalgaining the point isn't about using models to do rote calculations, it's about a more general paradigm of learning to solve increasingly hard problems without needing tons of solution data
1
0
3
@wordgrammer it would be pretty unusual if the CEO of a trillion dollar company was 20 years old
2
0
28
@Joanvelja that's a good point, if 10x-ing (?) RL lets you 0.1x test-time compute while still improving accuracy in general, that'd be awesome
1
0
3
RT @leonardtang_: i've been entirely consumed these past few weeks by the LLM-as-a-judge research agenda. there's lots of great work, but…
0
17
0
@Joanvelja they're still sampling 1K solutions for IOI and submitting 50 how good is o3 if you sample + submit 1 solution?
2
0
3
@StateSpeed_AB @casper_hansen_ different algorithms that do different things. GRPO is training on many distinct rollouts per prompt + computing relative advantages, sample count isn’t apples-apples for “training tokens”
1
0
0