![Evan Walters Profile](https://pbs.twimg.com/profile_images/1637112056951476225/vees7V97_x96.jpg)
Evan Walters
@evaninwords
Followers
426
Following
14K
Statuses
3K
ML/RL enthusiast, second-order optimization, plasticity, environmentalist
Denver, CO
Joined July 2016
RT @_arohan_: Today some of my ex and new colleagues are hosting AlgoPerfy workshop I will drop by and participate…
0
15
0
RT @winglian: What's the trick? DoRA. I don't have a great hypothesis on why it works yet, but I've upstreamed the changes to TRL. The PR m…
0
29
0
The DreamerV3 repo by @danijarh is a lot of fun! To visualize the world model, it logs sets of original gameplay, reconstructed gameplay, and loss between the two. Here when the borders are green the model is doing closed-loop prediction from real observations, and when they're red the model is doing open-loop prediction using only its learned dynamics model.
1
1
14
New AlphaGeometry paper: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 AlphaGeometry2 crushes Olympiad geometry, boasting an 84% solve rate on the last 25 years of geometry problems compared to AG1's 54%. They combine Gemini with a novel tree search algorithm called Shared Knowledge Ensemble of Search Trees to formulate proofs. SKEST is a complex tree search that unrolls multiple trees in parallel with a shared knowledge bank. They sped up the symbolic engine 300x by rewriting the gaussian elimination core in C++, handling double points, and improving symbolic representations, and also sped up its data generation by simplifying proofs in linear complexity using a greedy algorithm. Overall a very interesting paper! Paper:
0
0
3
RT @TheGradient: (1/2) Ever wondered why Sharpness-Aware Minimization (SAM) yields greater generalization gains in vision than in NLP? I'll…
0
10
0
@francoisfleuret it’s probably not uncommon to still be true haha, nowadays it’s probably not a rule, psgd stays pretty light as well as some other sec ord opts. If we consider whitening as second order I’m often beating adam on wall clock in training runs (maybe psgd with actual hessian would as well I just don’t use it as often) Also dev time, we all have a feel for adam, but while we worked to make psgd easily usable there’s always going to be a slight hump fitting in a new optimizer and getting a feel for it. But if you want to try out psgd feel free to ask about hypers :)
0
0
0
RT @gallabytes: it's kinda wild that deepseek trained v3 so cheaply with just Adam. if they'd known about second order optimizers might it…
0
3
0
RT @sirbayes: I'm happy to share our new paper on model-based RL, which achieves a new SOTA on Crafter (first time to beat human reward aft…
0
22
0
RT @LoubnaBenAllal1: The wait is over: our SmolLM2 paper is out—a detailed guide for building SOTA small LMs. While most LM papers skim ove…
0
104
0
@cargoshortdad64 yeah that'd be awesome, it could get hairy haha, maybe not the complete worst for muon and one-sided kron but full kron might take a little bit
0
0
3
@tenderizzation @drisspg @tri_dao Awesome! Jax has similar like cudnn flash attn in their DPA or you can call your own cuda or triton if you wanted. jax.experimental is getting pretty cool too
0
0
2