evaninwords Profile Banner
Evan Walters Profile
Evan Walters

@evaninwords

Followers
426
Following
14K
Statuses
3K

ML/RL enthusiast, second-order optimization, plasticity, environmentalist

Denver, CO
Joined July 2016
Don't wanna be here? Send us removal request.
@evaninwords
Evan Walters
2 days
RT @_arohan_: Today some of my ex and new colleagues are hosting AlgoPerfy workshop I will drop by and participate…
0
15
0
@evaninwords
Evan Walters
2 days
RT @winglian: What's the trick? DoRA. I don't have a great hypothesis on why it works yet, but I've upstreamed the changes to TRL. The PR m…
0
29
0
@evaninwords
Evan Walters
3 days
The DreamerV3 repo by @danijarh is a lot of fun! To visualize the world model, it logs sets of original gameplay, reconstructed gameplay, and loss between the two. Here when the borders are green the model is doing closed-loop prediction from real observations, and when they're red the model is doing open-loop prediction using only its learned dynamics model.
1
1
14
@evaninwords
Evan Walters
4 days
New AlphaGeometry paper: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 AlphaGeometry2 crushes Olympiad geometry, boasting an 84% solve rate on the last 25 years of geometry problems compared to AG1's 54%. They combine Gemini with a novel tree search algorithm called Shared Knowledge Ensemble of Search Trees to formulate proofs. SKEST is a complex tree search that unrolls multiple trees in parallel with a shared knowledge bank. They sped up the symbolic engine 300x by rewriting the gaussian elimination core in C++, handling double points, and improving symbolic representations, and also sped up its data generation by simplifying proofs in linear complexity using a greedy algorithm. Overall a very interesting paper! Paper:
Tweet media one
0
0
3
@evaninwords
Evan Walters
5 days
RT @TheGradient: (1/2) Ever wondered why Sharpness-Aware Minimization (SAM) yields greater generalization gains in vision than in NLP? I'll…
0
10
0
@evaninwords
Evan Walters
5 days
@francoisfleuret it’s probably not uncommon to still be true haha, nowadays it’s probably not a rule, psgd stays pretty light as well as some other sec ord opts. If we consider whitening as second order I’m often beating adam on wall clock in training runs (maybe psgd with actual hessian would as well I just don’t use it as often) Also dev time, we all have a feel for adam, but while we worked to make psgd easily usable there’s always going to be a slight hump fitting in a new optimizer and getting a feel for it. But if you want to try out psgd feel free to ask about hypers :)
0
0
0
@evaninwords
Evan Walters
6 days
RT @gallabytes: it's kinda wild that deepseek trained v3 so cheaply with just Adam. if they'd known about second order optimizers might it…
0
3
0
@evaninwords
Evan Walters
6 days
See some of you at the AlgoPerf workshop next week!
0
1
5
@evaninwords
Evan Walters
6 days
RT @sirbayes: I'm happy to share our new paper on model-based RL, which achieves a new SOTA on Crafter (first time to beat human reward aft…
0
22
0
@evaninwords
Evan Walters
6 days
Tweet media one
0
0
1
@evaninwords
Evan Walters
7 days
@_xjdr congrats!
0
0
2
@evaninwords
Evan Walters
7 days
RT @LoubnaBenAllal1: The wait is over: our SmolLM2 paper is out—a detailed guide for building SOTA small LMs. While most LM papers skim ove…
0
104
0
@evaninwords
Evan Walters
7 days
@cargoshortdad64 yeah that'd be awesome, it could get hairy haha, maybe not the complete worst for muon and one-sided kron but full kron might take a little bit
0
0
3
@evaninwords
Evan Walters
7 days
@tenderizzation @drisspg @tri_dao Awesome! Jax has similar like cudnn flash attn in their DPA or you can call your own cuda or triton if you wanted. jax.experimental is getting pretty cool too
0
0
2
@evaninwords
Evan Walters
7 days
Oh I should mention OneSidedKron should be used like Muon in the nanogpt speedrun with a separate adam optimizer for 1D and largely skewed tensors until I handle internally :) but full Kron is good to go as a standalone. Example:
0
0
3
@evaninwords
Evan Walters
7 days
Doubly Robust Monte Carlo Tree Search
0
1
4