engelnyst Profile Banner
Engel Nyst Profile
Engel Nyst

@engelnyst

Followers
43
Following
459
Statuses
719

"The only way to deal with an unfree world is to become so absolutely free that your very existence is an act of rebellion." Maintainer of OpenHands.

Pre-Mars Humanity
Joined August 2024
Don't wanna be here? Send us removal request.
@engelnyst
Engel Nyst
15 hours
@paulg Gemini, I can't decide if it has an optimistic or a pessimistic view.
Tweet media one
0
0
0
@engelnyst
Engel Nyst
17 hours
Worth noting.
@JFPuget
JFPuget πŸ‡ΊπŸ‡¦πŸ‡¨πŸ‡¦πŸ‡¬πŸ‡±
2 days
I looked at AIME problems and one thing strikes me. All problems are about computing a number. This is a tiny part of math. I was trained a a mathematician in France. And I almost never had to solve a problem of that kind. All the math work was about proving mathematical properties of mathematical objects. For instance, prove that a given group is isomorphic to another given group. This is to say that getting good at computing numbers specified by some mathematical setting is not the same as getting good at math in general. It is definitely part of math, but only a tiny part of math. There is no wonder AI focuses on number finding math problems. It is because checking the result is simple. Tackling the full spectrum of math requires a much more complex result checking machinery (formal proof checker) It is also interesting to note that AI math benchmarks only care about the final number. If that number was accidentally found via a flawed mathematical proof, then it is still considered a success.
0
0
1
@engelnyst
Engel Nyst
17 hours
@ai_for_success I did like it, albeit when I watched it a second time. πŸ€·β€β™‚οΈ
0
0
2
@engelnyst
Engel Nyst
18 hours
πŸ˜‚
@bayeslord
bayes
7 days
the government runs on elon standard time now
0
0
0
@engelnyst
Engel Nyst
22 hours
@AmandaAskell @renegadesilicon But that's not enough. It's still possible for it to output non-sensical things. Not false empirically, but fallacies or absurdities.
1
0
1
@engelnyst
Engel Nyst
24 hours
There can be things that at least with a human-in-the-loop, you can discover. Or verifiable goal. Other than that, I'll go with fundamental limitation. No concept of truth, so random similitude = error = discovery.
@dwarkesh_sp
Dwarkesh Patel
1 year
I still haven't heard a good answer to this question, on or off the podcast. AI researchers often tell me, "Don't worry bout it, scale solves this." But what is the rebuttal to someone who argues that this indicates a fundamental limitation?
Tweet media one
0
0
0
@engelnyst
Engel Nyst
2 days
@emollick Nice. It did one for me too, with a slightly adapted prompt. Ouch.
Tweet media one
0
0
1
@engelnyst
Engel Nyst
2 days
@rakyll So true. Best PTO ever: fly to your favorite city, get a co-working spot, and build the things that matter!
0
0
36
@engelnyst
Engel Nyst
2 days
RT @ericjmichaud_: @dwarkesh_sp tl;dr: Maybe learning simple things (basic knowledge, heuristics, etc) actually lowers the loss more than l…
0
52
0
@engelnyst
Engel Nyst
3 days
πŸ‘€
@alexalbert__
Alex Albert
3 days
MCP has been on a roll the past two weeks. Support added to both Cursor and Windsurf. Block built their agent's extensions on top of MCP. And many other major partners are working on adding MCP to their apps right now. MCP is truly the community's integration protocol.
0
0
0
@engelnyst
Engel Nyst
3 days
The replies contain quite a few more fun stuffs. πŸ˜…
@peterthedecent
Peter
4 days
Worst financial decision in history
Tweet media one
0
0
0
@engelnyst
Engel Nyst
3 days
@hkproj Yes, and we have seen nothing yet. The apps of frontier labs are still mostly bad, but it's coming.
0
0
0
@engelnyst
Engel Nyst
3 days
πŸ˜‚ They kinda forgot MLE-bench though
@npew
Peter Welinder
4 days
Existing models failing at your task? Just drop a benchmark and watch the LLM providers trip over themselves to beat it. Problem solved in months.
0
0
0
@engelnyst
Engel Nyst
3 days
RT @ludwigABAP: >CoT is now visible >look inside >it's another processed, summarized response larped as CoT Deepseek showed raw CoT being…
0
87
0
@engelnyst
Engel Nyst
3 days
🫠 They're lovely! And aww R1, hey it's relatable
@gneubig
Graham Neubig
3 days
LLMs are starting to have personalities. User: How are you? GPT-4o: Responds with 4 rocket emojis πŸš€ Deepseek-R1: Thinks for 25 seconds about how to respond without being socially awkward. Claude: Codes up a react app of a hand waving back to you and posts it to artifacts.
0
0
1
@engelnyst
Engel Nyst
3 days
RT @yacineMTB: so basically you could have won literally everything but safety stopped you
0
55
0
@engelnyst
Engel Nyst
4 days
πŸ‘€ could this be the case - that they need more samples, or is a higher context window not well used by some/small models?
@xiangyue96
Xiang Yue
5 days
Takeaway 6: Models might need more training samples to learn to utilize larger context window sizes. We found that the model with a context window size of 8K performed better than the model with 4K, as expected. However, we observed performance was better under 8K than 16K.
Tweet media one
1
0
0
@engelnyst
Engel Nyst
4 days
This sounds very interesting, although details of what prompting was doing matter.
@xiangyue96
Xiang Yue
5 days
Takeaway 2: SFT initialization matters: high-quality, emergent long CoT patterns from a large model (QwQ-32B) lead to significantly better generalization and RL gains compared with constructed long CoT patterns from an action prompting framework.
Tweet media one
0
0
1
@engelnyst
Engel Nyst
4 days
This reminds of a Karpathy podcast riffing on the idea of rather little data, but very high quality data = "our own CoTs"
@omarsar0
elvis
4 days
Don't underestimate the benefits of high-quality curated data. Turns out this is also effective in achieving complex reasoning in LLMs. With just 817 curated training samples, LIMO achieves 57.1% accuracy on the highly challenging AIME benchmark and 94.8% on MATH. Great quote from the paper: "In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes." LIMO is based on two key factors: (1) leveraging rich mathematical knowledge already encoded in pre-trained models, and (2) using high-quality reasoning chains that demonstrate optimal problem-solving processes. This "Less-Is-More Reasoning Hypothesis" suggests that when models have strong foundational knowledge from pre-training, complex reasoning capabilities can emerge through minimal but precisely crafted demonstrations. This is more evidence that a good strong foundational pretrained model can lead to impressive results downstream. The results show significant improvements across 10 diverse benchmarks, with LIMO demonstrating exceptional out-of-distribution generalization and outperforming models trained on 100x more data.
Tweet media one
0
0
0
@engelnyst
Engel Nyst
4 days
@artilectium @yacineMTB Yes, they noticed it was too fun and wanted a piece of that attention. IMO the jury is still out on whether they will succeed -ish. I don't like it, personally. It introduces a foreign element, so you can't count on it for debugging. Boo.
0
0
4