![Varun Gangal Profile](https://pbs.twimg.com/profile_images/993890716878278656/0St0pJ66_x96.jpg)
Varun Gangal
@VarunGangal
Followers
1K
Following
7K
Statuses
1K
AI Researcher @amazon AGI; @asapp (22-24); PhD CMU LTI (2017-22); IIT-M CSE (2011-16) RT / bookmark ≠ endorsement. Views personal, not of employers.
New York City
Joined January 2012
Excited to see the Humanity's Last Exam dataset , paper & repo release! Was fun crafting some hard problems for this in collab w @stevenyfeng , @boson2photon & others [names in image] end of '24, 3 of which got into final benchmark. Thanks @DanHendrycks and others at @scale_AI + for the effort & creating the aegis and chance to contribute!
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. @ai_risk @scaleai
1
2
15
@sedrickkeh2 Congrats, this is insightful!! thanks for releasing the pre-judge/code exec filtering 173K Unverified traces too...
0
0
1
@himanshustwts I wish there was a way to do COCONUT (<litethinking> ? ) when you need some reasonable amount of thinking to be enabled but don't need to read the traces explicitly and don't want that added latency too.
0
0
1
RT @theandrewsiah: playing with @sksq96 @VarunGangal thanks to @willccbb @abacaj for their gists and help in setting up dm/reply if you w…
0
1
0
@Dorialexander @willccbb @theandrewsiah Yeah not that doing off-policy would make it invalid either; in fact seeing that this variation also works and how much and how differently would be great to see [if at all there is a distinction, of which I am not sure rn..]
0
0
1
Wondering (will try out) if it would have learnt to do [by virtue of poetry RL training) to do figurative language edits e.g. personification too (something @sedrickkeh2 , @stevenyfeng & me had made a mini-corpus for & explored generating a long while ago w BART etc at COLING'22 )
1
0
4
But weren't there two of them? [AFAIK the old one and the new one were both nice though ofc the new one is better] (Though I guess even with that its possible to upscale [MOEization etc] or bootstrap in other ways a good model out of another good model so it doesn't make one lucky run hypothesis unreasonable)
0
0
2
W.r.t budgeting mem use to avoid OOMs I found @Dorialexander 's colab version [based on the same original gist by @willccbb that @abacaj is using a variant of] very handy: Dodges any OOMs typically [without PEFT] on an A100 in Colab with Qwen 0.5B Instruct [of course if you move up to higher params or increase the max_tokens from 200 you may hit it]
1
0
4
RT @simonw: o3-mini is really good at writing internal documentation - feed it a codebase, get back a detailed explanation of how specific…
0
91
0
I think the answer also depends on [amongst other considerations] if the reward function has any shared parameters / arch components with the policy - if it doesn't , the reward function itself is a form of supervision. If it does, e.g. like in the self-rewarding language models setup of Yuan et al ( where a shared LLM underlies both reward function and the policy, the reward model is not a form of supervision [or atleast is lesser so..]
0
0
0