hr0nix @hr0nix profile

hr0nix

@hr0nix

Followers

619

Following

3K

Statuses

3K

🦾 Head of AI R&D at @nebiusai 💻 Ex @Yandex, @MSFTResearch, @CHARMTherapeutx 🧠 Interested in (M|D|R)L, AGI, rev. Bayes 🤤 Opinions stupid but my own

London, UK

Joined March 2009

Don't wanna be here? Send us removal request.

hr0nix

@hr0nix

2 days

Went to Bill Gates’s website for non-fiction recommendations, and it says there that his favourite book on AI is The Coming Wave. Kinda ruined it for me.

1

0

3

hr0nix

@hr0nix

2 days

I've added some characterizations for very practical cases, like using threshold T=0 (i.e. any successful trajectory) or using binary terminal rewards but with non-unit discount factor. If you have any ideas of what other simple cases are of practical interest, let me know, maybe something can be said about them.

0

1

hr0nix

@hr0nix

4 days

@chatgpt21 @GaryMarcus you know why

0

hr0nix

@hr0nix

6 days

@chopwatercarry Btw, if the reward is sparse binary, but there is a discount factor < 1, the result still applies: RFT in general won't give you an improvement in discounted reward.

0

hr0nix

@hr0nix

9 days

@how_uhh No, my understanding is that AWR/AWAC guarantees policy improvement in KL-regularized MDP. Can write another note, but I think it has been covered sufficiently in the literature. E.g. every DPO paper restates this result.

0

3

hr0nix

@hr0nix

9 days

If the formulas look broken, try to refresh the page. I really should move these notes to a better place, math formatting on github pages is so broken.

0

3

hr0nix

@hr0nix

9 days

@chopwatercarry

hr0nix

@hr0nix

9 days

Some time ago I've tweeted that fine-tuning on positive trajectories should not be used with stochastic environments, as it can misattribute lucky successes to actions that had nothing to do with it. When asked for an explanation, I went out to write a note, and... ⬇️

1

0

hr0nix

@hr0nix

9 days

@gneubig I have recently taken a look at it, and it's an interesting one: yields a guaranteed improvement in some settings and can fail miserably in one of the settings it has been originally proposed for (thresholding rewards):

0

1

hr0nix

@hr0nix

9 days

@andersonbcdefg Are you changing your mind every batch depending on where the loss goes?

0

6

hr0nix

@hr0nix

10 days

Someone should tell John about the ladder of causality and hidden confounders.

John Carmack

@ID_AA_Carmack

10 days

Offline reinforcement learning, where an agent tries to improve a behavior policy by observing another agent without actually playing, is a harder problem than it appears. The challenge isn’t to mimic the provided play, but to learn something better than what you have seen. The difference between online (traditional) RL and offline RL is that online RL is constantly "testing" its model by taking new actions as a result of changes to the model, while the offline training can bootstrap itself off into a coherent fantasy of great returns untested by reality. It may be just an artifact of value based RL in particular, but I am inclined to believe that it is a more fundamental truth about theoretical and observational science versus experimental science, and life in general.

0

hr0nix

@hr0nix

10 days

Pretty cool, might be useful for value/reward prediction

Richard Song

@XingyouSong

11 days

Is it principled to train w/ next-token prediction over numeric strings e.g. “1.23”? Yes! Decoding can be just as good as regular pointwise heads for regression, but you also get density estimation for free. This is our last paper, closing the loop on investigating LLMs for regression - all evidence points to language models just as capable as traditional methods on tabular data, but 10x more flexible using text to represent general input formats. arXiv: w/ @dara_bahri Code:

0

2

hr0nix

@hr0nix

14 days

@giffmana Never lasts

0

2

hr0nix

@hr0nix

15 days

Did you see ? We went the fully automated route that prioritises precision over recall: implement heuristics for the most common cases, drop everything where they fail. We plan on scaling this effort significantly: more, languages, test frameworks, more supported ways to declare dependencies. Would be cool to have 100k to 1M real-world SWE problems this year.

1

0

7

hr0nix

@hr0nix

15 days

Congrats to STaR authors! Their opaque tweets demonstrate that they've (independently) discovered some of the ideas revealed to me in my sleep by the Divine.

noahdgoodman

@noahdgoodman

16 days

Congrats to OAI on producing a reasoning model! Their opaque tweets demonstrate that they’ve (independently) found some of the core ideas that we did on our way to STaR.

1

0

7

hr0nix

@hr0nix

16 days

@brad19brown You might also find this very relevant:

hr0nix

@hr0nix

3 months

Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵

0

hr0nix

@hr0nix

19 days

@softwarevlogger отличник знал бы, как пишется слово «троечник»

1

0

14