Engel Nyst @engelnyst profile

Engel Nyst

@engelnyst

Followers

43

Following

459

Statuses

719

"The only way to deal with an unfree world is to become so absolutely free that your very existence is an act of rebellion." Maintainer of OpenHands.

Pre-Mars Humanity

Joined August 2024

Don't wanna be here? Send us removal request.

Engel Nyst

@engelnyst

15 hours

@paulg Gemini, I can't decide if it has an optimistic or a pessimistic view.

0

Engel Nyst

@engelnyst

17 hours

Worth noting.

JFPuget 🇺🇦🇨🇦🇬🇱

@JFPuget

2 days

I looked at AIME problems and one thing strikes me. All problems are about computing a number. This is a tiny part of math. I was trained a a mathematician in France. And I almost never had to solve a problem of that kind. All the math work was about proving mathematical properties of mathematical objects. For instance, prove that a given group is isomorphic to another given group. This is to say that getting good at computing numbers specified by some mathematical setting is not the same as getting good at math in general. It is definitely part of math, but only a tiny part of math. There is no wonder AI focuses on number finding math problems. It is because checking the result is simple. Tackling the full spectrum of math requires a much more complex result checking machinery (formal proof checker) It is also interesting to note that AI math benchmarks only care about the final number. If that number was accidentally found via a flawed mathematical proof, then it is still considered a success.

0

1

Engel Nyst

@engelnyst

17 hours

@ai_for_success I did like it, albeit when I watched it a second time. 🤷‍♂️

0

2

Engel Nyst

@engelnyst

18 hours

😂

bayes

@bayeslord

7 days

the government runs on elon standard time now

0

Engel Nyst

@engelnyst

22 hours

@AmandaAskell @renegadesilicon But that's not enough. It's still possible for it to output non-sensical things. Not false empirically, but fallacies or absurdities.

1

0

1

Engel Nyst

@engelnyst

24 hours

There can be things that at least with a human-in-the-loop, you can discover. Or verifiable goal. Other than that, I'll go with fundamental limitation. No concept of truth, so random similitude = error = discovery.

Dwarkesh Patel

@dwarkesh_sp

1 year

I still haven't heard a good answer to this question, on or off the podcast. AI researchers often tell me, "Don't worry bout it, scale solves this." But what is the rebuttal to someone who argues that this indicates a fundamental limitation?

0

Engel Nyst

@engelnyst

2 days

@emollick Nice. It did one for me too, with a slightly adapted prompt. Ouch.

0

1

Engel Nyst

@engelnyst

2 days

@rakyll So true. Best PTO ever: fly to your favorite city, get a co-working spot, and build the things that matter!

0

36

Engel Nyst

@engelnyst

2 days

RT @ericjmichaud_: @dwarkesh_sp tl;dr: Maybe learning simple things (basic knowledge, heuristics, etc) actually lowers the loss more than l…

0

52

0

Engel Nyst

@engelnyst

3 days

👀

Alex Albert

@alexalbert__

3 days

MCP has been on a roll the past two weeks. Support added to both Cursor and Windsurf. Block built their agent's extensions on top of MCP. And many other major partners are working on adding MCP to their apps right now. MCP is truly the community's integration protocol.

0

Engel Nyst

@engelnyst

3 days

The replies contain quite a few more fun stuffs. 😅

Peter

@peterthedecent

4 days

Worst financial decision in history

0

Engel Nyst

@engelnyst

3 days

@hkproj Yes, and we have seen nothing yet. The apps of frontier labs are still mostly bad, but it's coming.

0

Engel Nyst

@engelnyst

3 days

😂 They kinda forgot MLE-bench though

Peter Welinder

@npew

4 days

Existing models failing at your task? Just drop a benchmark and watch the LLM providers trip over themselves to beat it. Problem solved in months.

0

Engel Nyst

@engelnyst

3 days

RT @ludwigABAP: >CoT is now visible >look inside >it's another processed, summarized response larped as CoT Deepseek showed raw CoT being…

0

87

0

Engel Nyst

@engelnyst

3 days

🫠 They're lovely! And aww R1, hey it's relatable

Graham Neubig

@gneubig

3 days

LLMs are starting to have personalities. User: How are you? GPT-4o: Responds with 4 rocket emojis 🚀 Deepseek-R1: Thinks for 25 seconds about how to respond without being socially awkward. Claude: Codes up a react app of a hand waving back to you and posts it to artifacts.

0

1

Engel Nyst

@engelnyst

3 days

RT @yacineMTB: so basically you could have won literally everything but safety stopped you

0

55

0

Engel Nyst

@engelnyst

4 days

👀 could this be the case - that they need more samples, or is a higher context window not well used by some/small models?

Xiang Yue

@xiangyue96

5 days

Takeaway 6: Models might need more training samples to learn to utilize larger context window sizes. We found that the model with a context window size of 8K performed better than the model with 4K, as expected. However, we observed performance was better under 8K than 16K.

1

0

Engel Nyst

@engelnyst

4 days

This sounds very interesting, although details of what prompting was doing matter.

Xiang Yue

@xiangyue96

5 days

Takeaway 2: SFT initialization matters: high-quality, emergent long CoT patterns from a large model (QwQ-32B) lead to significantly better generalization and RL gains compared with constructed long CoT patterns from an action prompting framework.

0

1

Engel Nyst

@engelnyst

4 days

This reminds of a Karpathy podcast riffing on the idea of rather little data, but very high quality data = "our own CoTs"

elvis

@omarsar0

4 days

Don't underestimate the benefits of high-quality curated data. Turns out this is also effective in achieving complex reasoning in LLMs. With just 817 curated training samples, LIMO achieves 57.1% accuracy on the highly challenging AIME benchmark and 94.8% on MATH. Great quote from the paper: "In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes." LIMO is based on two key factors: (1) leveraging rich mathematical knowledge already encoded in pre-trained models, and (2) using high-quality reasoning chains that demonstrate optimal problem-solving processes. This "Less-Is-More Reasoning Hypothesis" suggests that when models have strong foundational knowledge from pre-training, complex reasoning capabilities can emerge through minimal but precisely crafted demonstrations. This is more evidence that a good strong foundational pretrained model can lead to impressive results downstream. The results show significant improvements across 10 diverse benchmarks, with LIMO demonstrating exceptional out-of-distribution generalization and outperforming models trained on 100x more data.

0

Engel Nyst

@engelnyst

4 days

@artilectium @yacineMTB Yes, they noticed it was too fun and wanted a piece of that attention. IMO the jury is still out on whether they will succeed -ish. I don't like it, personally. It introduces a foreign element, so you can't count on it for debugging. Boo.

0

4