![Engel Nyst Profile](https://pbs.twimg.com/profile_images/1843023409157955584/TFyZPZFf_x96.jpg)
Engel Nyst
@engelnyst
Followers
43
Following
459
Statuses
719
"The only way to deal with an unfree world is to become so absolutely free that your very existence is an act of rebellion." Maintainer of OpenHands.
Pre-Mars Humanity
Joined August 2024
Worth noting.
I looked at AIME problems and one thing strikes me. All problems are about computing a number. This is a tiny part of math. I was trained a a mathematician in France. And I almost never had to solve a problem of that kind. All the math work was about proving mathematical properties of mathematical objects. For instance, prove that a given group is isomorphic to another given group. This is to say that getting good at computing numbers specified by some mathematical setting is not the same as getting good at math in general. It is definitely part of math, but only a tiny part of math. There is no wonder AI focuses on number finding math problems. It is because checking the result is simple. Tackling the full spectrum of math requires a much more complex result checking machinery (formal proof checker) It is also interesting to note that AI math benchmarks only care about the final number. If that number was accidentally found via a flawed mathematical proof, then it is still considered a success.
0
0
1
@AmandaAskell @renegadesilicon But that's not enough. It's still possible for it to output non-sensical things. Not false empirically, but fallacies or absurdities.
1
0
1
There can be things that at least with a human-in-the-loop, you can discover. Or verifiable goal. Other than that, I'll go with fundamental limitation. No concept of truth, so random similitude = error = discovery.
I still haven't heard a good answer to this question, on or off the podcast. AI researchers often tell me, "Don't worry bout it, scale solves this." But what is the rebuttal to someone who argues that this indicates a fundamental limitation?
0
0
0
@rakyll So true. Best PTO ever: fly to your favorite city, get a co-working spot, and build the things that matter!
0
0
36
RT @ericjmichaud_: @dwarkesh_sp tl;dr: Maybe learning simple things (basic knowledge, heuristics, etc) actually lowers the loss more than lβ¦
0
52
0
@hkproj Yes, and we have seen nothing yet. The apps of frontier labs are still mostly bad, but it's coming.
0
0
0
RT @ludwigABAP: >CoT is now visible >look inside >it's another processed, summarized response larped as CoT Deepseek showed raw CoT beingβ¦
0
87
0
π« They're lovely! And aww R1, hey it's relatable
LLMs are starting to have personalities. User: How are you? GPT-4o: Responds with 4 rocket emojis π Deepseek-R1: Thinks for 25 seconds about how to respond without being socially awkward. Claude: Codes up a react app of a hand waving back to you and posts it to artifacts.
0
0
1
π could this be the case - that they need more samples, or is a higher context window not well used by some/small models?
Takeaway 6: Models might need more training samples to learn to utilize larger context window sizes. We found that the model with a context window size of 8K performed better than the model with 4K, as expected. However, we observed performance was better under 8K than 16K.
1
0
0
This sounds very interesting, although details of what prompting was doing matter.
Takeaway 2: SFT initialization matters: high-quality, emergent long CoT patterns from a large model (QwQ-32B) lead to significantly better generalization and RL gains compared with constructed long CoT patterns from an action prompting framework.
0
0
1
This reminds of a Karpathy podcast riffing on the idea of rather little data, but very high quality data = "our own CoTs"
Don't underestimate the benefits of high-quality curated data. Turns out this is also effective in achieving complex reasoning in LLMs. With just 817 curated training samples, LIMO achieves 57.1% accuracy on the highly challenging AIME benchmark and 94.8% on MATH. Great quote from the paper: "In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes." LIMO is based on two key factors: (1) leveraging rich mathematical knowledge already encoded in pre-trained models, and (2) using high-quality reasoning chains that demonstrate optimal problem-solving processes. This "Less-Is-More Reasoning Hypothesis" suggests that when models have strong foundational knowledge from pre-training, complex reasoning capabilities can emerge through minimal but precisely crafted demonstrations. This is more evidence that a good strong foundational pretrained model can lead to impressive results downstream. The results show significant improvements across 10 diverse benchmarks, with LIMO demonstrating exceptional out-of-distribution generalization and outperforming models trained on 100x more data.
0
0
0
@artilectium @yacineMTB Yes, they noticed it was too fun and wanted a piece of that attention. IMO the jury is still out on whether they will succeed -ish. I don't like it, personally. It introduces a foreign element, so you can't count on it for debugging. Boo.
0
0
4