Evan Walters @evaninwords profile

Evan Walters

@evaninwords

Followers

426

Following

14K

Statuses

3K

ML/RL enthusiast, second-order optimization, plasticity, environmentalist

Denver, CO

Joined July 2016

Don't wanna be here? Send us removal request.

Evan Walters

@evaninwords

2 days

RT @_arohan_: Today some of my ex and new colleagues are hosting AlgoPerfy workshop I will drop by and participate…

0

15

0

Evan Walters

@evaninwords

2 days

RT @winglian: What's the trick? DoRA. I don't have a great hypothesis on why it works yet, but I've upstreamed the changes to TRL. The PR m…

0

29

0

Evan Walters

@evaninwords

3 days

The DreamerV3 repo by @danijarh is a lot of fun! To visualize the world model, it logs sets of original gameplay, reconstructed gameplay, and loss between the two. Here when the borders are green the model is doing closed-loop prediction from real observations, and when they're red the model is doing open-loop prediction using only its learned dynamics model.

1

14

Evan Walters

@evaninwords

4 days

New AlphaGeometry paper: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 AlphaGeometry2 crushes Olympiad geometry, boasting an 84% solve rate on the last 25 years of geometry problems compared to AG1's 54%. They combine Gemini with a novel tree search algorithm called Shared Knowledge Ensemble of Search Trees to formulate proofs. SKEST is a complex tree search that unrolls multiple trees in parallel with a shared knowledge bank. They sped up the symbolic engine 300x by rewriting the gaussian elimination core in C++, handling double points, and improving symbolic representations, and also sped up its data generation by simplifying proofs in linear complexity using a greedy algorithm. Overall a very interesting paper! Paper:

0

3

Evan Walters

@evaninwords

5 days

RT @TheGradient: (1/2) Ever wondered why Sharpness-Aware Minimization (SAM) yields greater generalization gains in vision than in NLP? I'll…

0

10

0

Evan Walters

@evaninwords

5 days

@francoisfleuret it’s probably not uncommon to still be true haha, nowadays it’s probably not a rule, psgd stays pretty light as well as some other sec ord opts. If we consider whitening as second order I’m often beating adam on wall clock in training runs (maybe psgd with actual hessian would as well I just don’t use it as often) Also dev time, we all have a feel for adam, but while we worked to make psgd easily usable there’s always going to be a slight hump fitting in a new optimizer and getting a feel for it. But if you want to try out psgd feel free to ask about hypers :)

0

Evan Walters

@evaninwords

6 days

RT @gallabytes: it's kinda wild that deepseek trained v3 so cheaply with just Adam. if they'd known about second order optimizers might it…

0

3

0

Evan Walters

@evaninwords

6 days

See some of you at the AlgoPerf workshop next week!

0

1

5

Evan Walters

@evaninwords

6 days

RT @sirbayes: I'm happy to share our new paper on model-based RL, which achieves a new SOTA on Crafter (first time to beat human reward aft…

0

22

0

Evan Walters

@evaninwords

6 days

@francoisfleuret

0

1

Evan Walters

@evaninwords

7 days

@_xjdr congrats!

0

2

Evan Walters

@evaninwords

7 days

RT @LoubnaBenAllal1: The wait is over: our SmolLM2 paper is out—a detailed guide for building SOTA small LMs. While most LM papers skim ove…

0

104

0

Evan Walters

@evaninwords

7 days

@cargoshortdad64 yeah that'd be awesome, it could get hairy haha, maybe not the complete worst for muon and one-sided kron but full kron might take a little bit

0

3

Evan Walters

@evaninwords

7 days

@tenderizzation @drisspg @tri_dao Awesome! Jax has similar like cudnn flash attn in their DPA or you can call your own cuda or triton if you wanted. jax.experimental is getting pretty cool too

0

2

Evan Walters

@evaninwords

7 days

Oh I should mention OneSidedKron should be used like Muon in the nanogpt speedrun with a separate adam optimizer for 1D and largely skewed tensors until I handle internally :) but full Kron is good to go as a standalone. Example:

0

3

Evan Walters

@evaninwords

7 days

Doubly Robust Monte Carlo Tree Search

0

1

4