hr0nix Profile Banner
hr0nix Profile
hr0nix

@hr0nix

Followers
619
Following
3K
Statuses
3K

🦾 Head of AI R&D at @nebiusai 💻 Ex @Yandex, @MSFTResearch, @CHARMTherapeutx 🧠 Interested in (M|D|R)L, AGI, rev. Bayes 🤤 Opinions stupid but my own

London, UK
Joined March 2009
Don't wanna be here? Send us removal request.
@hr0nix
hr0nix
2 days
Went to Bill Gates’s website for non-fiction recommendations, and it says there that his favourite book on AI is The Coming Wave. Kinda ruined it for me.
1
0
3
@hr0nix
hr0nix
2 days
I've added some characterizations for very practical cases, like using threshold T=0 (i.e. any successful trajectory) or using binary terminal rewards but with non-unit discount factor. If you have any ideas of what other simple cases are of practical interest, let me know, maybe something can be said about them.
0
0
1
@hr0nix
hr0nix
4 days
@chatgpt21 @GaryMarcus you know why
0
0
0
@hr0nix
hr0nix
6 days
@chopwatercarry Btw, if the reward is sparse binary, but there is a discount factor < 1, the result still applies: RFT in general won't give you an improvement in discounted reward.
0
0
0
@hr0nix
hr0nix
9 days
@how_uhh No, my understanding is that AWR/AWAC guarantees policy improvement in KL-regularized MDP. Can write another note, but I think it has been covered sufficiently in the literature. E.g. every DPO paper restates this result.
0
0
3
@hr0nix
hr0nix
9 days
If the formulas look broken, try to refresh the page. I really should move these notes to a better place, math formatting on github pages is so broken.
0
0
3
@hr0nix
hr0nix
9 days
@hr0nix
hr0nix
9 days
Some time ago I've tweeted that fine-tuning on positive trajectories should not be used with stochastic environments, as it can misattribute lucky successes to actions that had nothing to do with it. When asked for an explanation, I went out to write a note, and... ⬇️
1
0
0
@hr0nix
hr0nix
9 days
@gneubig I have recently taken a look at it, and it's an interesting one: yields a guaranteed improvement in some settings and can fail miserably in one of the settings it has been originally proposed for (thresholding rewards):
0
0
1
@hr0nix
hr0nix
9 days
@andersonbcdefg Are you changing your mind every batch depending on where the loss goes?
0
0
6
@hr0nix
hr0nix
10 days
Someone should tell John about the ladder of causality and hidden confounders.
@ID_AA_Carmack
John Carmack
10 days
Offline reinforcement learning, where an agent tries to improve a behavior policy by observing another agent without actually playing, is a harder problem than it appears. The challenge isn’t to mimic the provided play, but to learn something better than what you have seen. The difference between online (traditional) RL and offline RL is that online RL is constantly "testing" its model by taking new actions as a result of changes to the model, while the offline training can bootstrap itself off into a coherent fantasy of great returns untested by reality. It may be just an artifact of value based RL in particular, but I am inclined to believe that it is a more fundamental truth about theoretical and observational science versus experimental science, and life in general.
0
0
0
@hr0nix
hr0nix
10 days
Pretty cool, might be useful for value/reward prediction
@XingyouSong
Richard Song
11 days
Is it principled to train w/ next-token prediction over numeric strings e.g. “1.23”? Yes! Decoding can be just as good as regular pointwise heads for regression, but you also get density estimation for free. This is our last paper, closing the loop on investigating LLMs for regression - all evidence points to language models just as capable as traditional methods on tabular data, but 10x more flexible using text to represent general input formats. arXiv: w/ @dara_bahri Code:
Tweet media one
0
0
2
@hr0nix
hr0nix
14 days
@giffmana Never lasts
0
0
2
@hr0nix
hr0nix
15 days
Did you see ? We went the fully automated route that prioritises precision over recall: implement heuristics for the most common cases, drop everything where they fail. We plan on scaling this effort significantly: more, languages, test frameworks, more supported ways to declare dependencies. Would be cool to have 100k to 1M real-world SWE problems this year.
1
0
7
@hr0nix
hr0nix
15 days
Congrats to STaR authors! Their opaque tweets demonstrate that they've (independently) discovered some of the ideas revealed to me in my sleep by the Divine.
@noahdgoodman
noahdgoodman
16 days
Congrats to OAI on producing a reasoning model! Their opaque tweets demonstrate that they’ve (independently) found some of the core ideas that we did on our way to STaR.
1
0
7
@hr0nix
hr0nix
16 days
@brad19brown You might also find this very relevant:
@hr0nix
hr0nix
3 months
Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵
0
0
0
@hr0nix
hr0nix
19 days
@softwarevlogger отличник знал бы, как пишется слово «троечник»
1
0
14