![hr0nix Profile](https://pbs.twimg.com/profile_images/343187016/donkihot_x96.jpg)
hr0nix
@hr0nix
Followers
619
Following
3K
Statuses
3K
🦾 Head of AI R&D at @nebiusai 💻 Ex @Yandex, @MSFTResearch, @CHARMTherapeutx 🧠 Interested in (M|D|R)L, AGI, rev. Bayes 🤤 Opinions stupid but my own
London, UK
Joined March 2009
I've added some characterizations for very practical cases, like using threshold T=0 (i.e. any successful trajectory) or using binary terminal rewards but with non-unit discount factor. If you have any ideas of what other simple cases are of practical interest, let me know, maybe something can be said about them.
0
0
1
@chopwatercarry Btw, if the reward is sparse binary, but there is a discount factor < 1, the result still applies: RFT in general won't give you an improvement in discounted reward.
0
0
0
Someone should tell John about the ladder of causality and hidden confounders.
Offline reinforcement learning, where an agent tries to improve a behavior policy by observing another agent without actually playing, is a harder problem than it appears. The challenge isn’t to mimic the provided play, but to learn something better than what you have seen. The difference between online (traditional) RL and offline RL is that online RL is constantly "testing" its model by taking new actions as a result of changes to the model, while the offline training can bootstrap itself off into a coherent fantasy of great returns untested by reality. It may be just an artifact of value based RL in particular, but I am inclined to believe that it is a more fundamental truth about theoretical and observational science versus experimental science, and life in general.
0
0
0
Pretty cool, might be useful for value/reward prediction
Is it principled to train w/ next-token prediction over numeric strings e.g. “1.23”? Yes! Decoding can be just as good as regular pointwise heads for regression, but you also get density estimation for free. This is our last paper, closing the loop on investigating LLMs for regression - all evidence points to language models just as capable as traditional methods on tabular data, but 10x more flexible using text to represent general input formats. arXiv: w/ @dara_bahri Code:
0
0
2
Did you see ? We went the fully automated route that prioritises precision over recall: implement heuristics for the most common cases, drop everything where they fail. We plan on scaling this effort significantly: more, languages, test frameworks, more supported ways to declare dependencies. Would be cool to have 100k to 1M real-world SWE problems this year.
1
0
7
Congrats to STaR authors! Their opaque tweets demonstrate that they've (independently) discovered some of the ideas revealed to me in my sleep by the Divine.
Congrats to OAI on producing a reasoning model! Their opaque tweets demonstrate that they’ve (independently) found some of the core ideas that we did on our way to STaR.
1
0
7
@brad19brown You might also find this very relevant:
Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵
0
0
0