Josh Profile Banner
Josh Profile
Josh

@JoshPurtell

Followers
1,186
Following
641
Media
9
Statuses
1,039

Ars longa The situation is excellent.

NYC
Joined July 2021
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@JoshPurtell
Josh
1 year
Agents are a joke until they’re not.
0
2
24
@JoshPurtell
Josh
4 months
We will be benchmarking the spectrum of prompting abilities at Betaworks on June 20th. Come if and only if you're good.
@trybasis
Basis
4 months
June 20th. 1 Champion. NYC.
Tweet media one
3
3
13
1
4
35
@JoshPurtell
Josh
23 days
@YouJiacheng He can still come clean. Blame it on an adderall-fueled manic episode, like SBF did, or whatever. Waiting until tmr makes that out difficult. He should come clean now.
13
0
1K
@JoshPurtell
Josh
6 months
There's a lot of confused thinking when it comes to early stage. By far the worst offender is the idea of 'moats', which most people hold to be very important in hypotheticals, yet rarely IRL. My take on what actually achieves what moats don't:
1
0
0
@JoshPurtell
Josh
8 months
@yacineMTB Coding an impressive architecture (performance) is basically just IQ, yeah. But coding *the right* impressive architecture (effectiveness) for a given situation is pretty experience-driven.
2
0
59
@JoshPurtell
Josh
7 months
There are multiple great frameworks like Ragatouille by @bclavie and Neural-Cherche by @raphaelsrty . But I still encounter some resistant when telling people to try out Colbert. Now there's no excuse -> wrote a quick gist that does synthetic data gen, fine-tuning, eval. Just add
3
11
49
@JoshPurtell
Josh
10 months
@yacineMTB @andyohlbaum @digiaiapp Darwinian selection is back on the menu
2
2
43
@JoshPurtell
Josh
6 months
@sebjenseb why are 14 and 7 the same?
4
0
44
@JoshPurtell
Josh
27 days
Evaluating LM agents has come a long way since gpt-4 released in March of 2023. We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks. There's still lots to do, though 🧵
2
11
43
@JoshPurtell
Josh
1 month
Researchers and engineers want a lot from LMs, and not all of those desiderata are captured by leaderboards like lmsys. One of the capabilities I care about is an LM's ability to effectively reason over lots of information. It turns out, this ability varies widely!
3
7
41
@JoshPurtell
Josh
6 months
@Nazionalis69101 Apollo is real. “You” are not
0
2
37
@JoshPurtell
Josh
5 months
I'm co-hosting an experienced practitioners ML reading group in Flatiron. The purpose: - Go through the collective exercise of distilling an important paper into its bare essentials + the good ideas. Good skill. - Discuss / hack on applications, experiments, hypotheses.
6
3
39
@JoshPurtell
Josh
7 months
@MoreBirths The Czech Republic (one of best birth rates in Europe and almost 100% atheist) exists, so this hypothesis is wrong. In fact, there is a third variable that causes both lack of religiosity and national barrenness.
6
0
37
@JoshPurtell
Josh
7 months
@marvinvonhagen @TechBroDrip @agihouse_org Time to restore competent leadership to Google
0
2
34
@JoshPurtell
Josh
7 months
@_JakubJanda Czech Republic, along with Germany, is the only European country with industrial capacity. France and England mostly exist to launder American inflation proceeds and are actively destroying their own economies. This of course recommends them over CR to the WSJ.
8
0
30
@JoshPurtell
Josh
4 months
Prompting multistage programs is hard - evaluation is slow, LM behavior hard to anticipate, root causes for failures tricky to ferret out. We break down the problem to its constituent parts and systematically evaluate what works. Really excited for people to use MIPROv2!
@lateinteraction
Omar Khattab
4 months
🚨Announcing the largest study focused on *how* to optimize the prompts within LM programs, a key DSPy challenge. Should we use LMs to… Craft instructions? Self-generate examples? Handle credit assignment? Specify a Bayesian model? By @kristahopsalong * @michaelryan207 * &team🧵
Tweet media one
15
126
600
2
5
23
@JoshPurtell
Josh
11 months
Am hosting a recurring + small "in the trenches" event for anyone solving hard ML problems and/or problems with ML. The objective of the meeting is to encourage creative problem-solving in a convivial setting. All conversation should drive toward relevant open questions and
@the_simonpastor
Simon Pastor
11 months
Organizing the first session of our New York ML Research Group tonight at 8pm with @JoshPurtell and @davideasnaghi dm if you want to join
0
0
7
0
3
17
@JoshPurtell
Josh
1 month
@theasianchris1 @isaacbmiller1 My thoughts on the topic - vaguely related to Isaac's. Summary: high-conviction long DSPy. And not just for compound LM systems.
1
6
16
@JoshPurtell
Josh
1 year
Basis is hiring ML Engineering Interns We're deploying ML solutions across the stack - LLMs, agents, RL, conformal prediction, and more - to solve every problem in accounting with AI. Help us do it faster.
0
1
14
@JoshPurtell
Josh
3 months
@WillManidis Aristocrat culture used to valorize courage / bravery
1
0
14
@JoshPurtell
Josh
5 months
Excited about gpt-4o? Fast, cheap, multi-modal. What's the catch? There are many capabilities that make a good language model. One of them is raw state capacity - how much information can the LLM track and use before it runs out of 'space' in its residual stream?
4
2
12
@JoshPurtell
Josh
10 months
Second iteration of ML in the trenches tonight, 7pm in Flatiron. Will be focused on mechanistic interpretability. These are focused sessions where a researcher presents the problem they're actively working on, and other researchers rapidly idea+experiment. DM me or
3
3
12
@JoshPurtell
Josh
4 months
B2B SaaS is here to stay.
@cpaik
Chris Paik
4 months
The End of Software
385
459
3K
3
0
12
@JoshPurtell
Josh
10 months
Pmarca missed out on OpenAI. So now he’s going spend a few mil to nuke their profits to zero. Live player. Deeply based.
@a16z
a16z
10 months
a16z is thrilled to announce our Series A investment in @MistralAI . Mistral is at the center of a passionate open source AI developer community. We think this is the most promising path to achieve widely adopted AI systems, and that Mistral is the leading team on this path.
25
56
518
1
0
12
@JoshPurtell
Josh
5 months
@natolambert @julien_c Elo doesn't matter at all. Benchmark it on agent tasks and it will fail most likely
2
0
11
@JoshPurtell
Josh
3 months
Screen cap of the crafter session that secured the $5k prize for the Prompt Olympics' winner. The LLM agent he prompted almost mined iron - but tried to do so with a wood pickaxe, not stone - and secured 10/22 achievements in one run. Congrats Coleman! @povgoggles
Tweet media one
1
2
11
@JoshPurtell
Josh
5 months
@VictorTaelin Where is the text emulator?
1
0
10
@JoshPurtell
Josh
10 months
@QuanquanGu Language models introduce a powerful prior for medium-term planning. Given that medium term choice feels much more combinatorial than short (only so many actions) and long term (only so many high level goals), this could unlock a lot of problems for agents. Really like Jarvis,
3
0
10
@JoshPurtell
Josh
2 months
@_xjdr Big improvement. Haven't tested 405 yet but 3.1-70b blows 3-70b out of the water in an unexpected way on my favorite benchmark - testing state capacity. # State Capacity Results (Higher is better) # claude-sonnet-3-5-20240620: 200 # gpt-4-32k: 95 # gpt-4-turbo: 75 #
3
0
9
@JoshPurtell
Josh
19 days
These models can reason. Will need to update some apis before running more evals, but excited to see how they do on the other task suites in LRC Bench. More to come! Reference:
Tweet media one
1
0
8
@JoshPurtell
Josh
5 months
I was blown away at how fruitful the first few iterations were. Suffice it to say, open invite to all who attended those. First session will be tackling DreamerV3, which just got some updates. Rumor has it, there might be some benchmark results on a few puffer lib environments
1
0
8
@JoshPurtell
Josh
1 month
Davide is a force of nature.
@diodeinc
Diode Computers, Inc.
1 month
Diode Computers, Inc. just launched on @ycombinator 's Launch YC! Diode Computers, Inc. — Circuit boards as a service. Check us out:
0
11
40
0
0
8
@JoshPurtell
Josh
1 year
@BasedBeffJezos Weekends - explore Week - exploit If you’re not amassing insane leverage on a Sunday it’s over
2
1
8
@JoshPurtell
Josh
9 months
@jxmnop There is fast progress, but unfortunately also a complete publishing freeze across the prominent labs. No one is sharing.
0
0
8
@JoshPurtell
Josh
1 year
@alexgraveley - my agent does work - solving these problems requires like 600 lines of code and two core abstractions
1
2
6
@JoshPurtell
Josh
5 months
MIPRO Ascendancy
@lateinteraction
Omar Khattab
5 months
When you randomly come across a paper that delivers 18-point improvements over the next best system (and tops *three* different leaderboards) by cleverly applying DSPy optimizers, but it's Friday afternoon so it's not a great time to publicize that sort of thing. Monday it is.
4
6
157
0
1
7
@JoshPurtell
Josh
3 months
@terryyuezhuo 25 engineers are competing to prompt LLMs to score best on your benchmark - excellent work!
@mitch_troy
Mitchell Troyanovsky
3 months
@trybasis prompt Olympics with @modal_labs @retool @promptlayer running. Over 30k inference calls made and half the contests have been eliminated. Final four coming soon
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
2
5
1
1
7
@JoshPurtell
Josh
19 days
@JoshPurtell
Josh
19 days
These models can reason. Will need to update some apis before running more evals, but excited to see how they do on the other task suites in LRC Bench. More to come! Reference:
Tweet media one
1
0
8
1
0
7
@JoshPurtell
Josh
6 months
I'm assembling a team
@JoshPurtell
Josh
6 months
@ocolegro Have a goated AI eng friend from Yale math on board. Trying to secure a math PhD mutual. Still looking for someone w deep expertise training/tuning foundation models. Very serious abt this, think it could be fun.
0
0
3
0
0
7
@JoshPurtell
Josh
27 days
TLDR: agent evals go BRRR
@JoshPurtell
Josh
27 days
Evaluating LM agents has come a long way since gpt-4 released in March of 2023. We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks. There's still lots to do, though 🧵
2
11
43
1
0
7
@JoshPurtell
Josh
27 days
If you want to get started, check out our quick ReAct demo here: .comBCB_Agent_Benchmark.ipynbColab notebook To read more, check out the announcement: And to contribute, add issues and PRs here:
1
0
7
@JoshPurtell
Josh
19 days
O1-mini and O1-preview absolutely obliterating all other models on the LRCB abstract algebra benchmark (the hardest). Sadly may be an hour or two until the final results are in. Running them on CraftaxLM might be an overnight affair ... suggests hybrid systems may be valuable!
0
0
6
@JoshPurtell
Josh
19 days
Results are in for the Algebra sub-challenge of the LRC Benchmark! The challenge is very simple - reduce the product of K 2-cycles from S4 into simplest form. However, doing requires the LM to keep track of a lot of information. So, a simple way to measure reasoning capacity 🧵
1
0
7
@JoshPurtell
Josh
1 month
This isn't quite needle-in-a-haystack, which mostly tests a model's ability to recognize needles. Instead, imagine if the model needed to reason over the entire haystack to deduce where the needle is. Well, that happens to be eval #2 in LRCBench!
1
0
7
@JoshPurtell
Josh
1 month
@YouJiacheng There is a difference btwn reasoning over long context and learning from it. Will share an eval on latter tn. Once labs build models designed for ICL (Gemini pro is an early hint), then things get interesting.
0
1
6
@JoshPurtell
Josh
1 year
Speed at any cost.
Tweet media one
Tweet media two
0
2
6
@JoshPurtell
Josh
3 months
23,000 LLM calls in. Just getting started.
@JoshPurtell
Josh
3 months
Soon.
0
0
3
0
1
6
@JoshPurtell
Josh
10 months
@corry_wang Basically, yeah. Seed stage equity is priced as prima facie investment + option to reinvest later (in much greater size) In any reasonably competitive market you’d expect prices to surpass prima facie returns
0
0
4
@JoshPurtell
Josh
7 months
@Espartiata @MoreBirths 1.7, which is (much) higher than all but the very poorest Christian countries on earth.
1
0
5
@JoshPurtell
Josh
1 month
LRCBench introduces three tasks that require the model to reason over a lot of information. These tasks require the LM to reason over code, data, and mathematical systems - so hopefully it won't over index on any given post-training modality, like gpt-4o's emphasis on math.
1
1
6
@JoshPurtell
Josh
1 month
What does it tell us? Well, mostly that, within a data mix, it seems like bigger models are better. Not much of a surprise there. What is surprising, though, is how widely performance varies between providers. Anthropic models are much, much better than OpenAI models.
1
0
6
@JoshPurtell
Josh
1 month
Why is that? Hard to speculate, but given that rumor has it that Anthropic cares a lot about synthetic data, it's possible that their models learn to represent information more effectively/compactly. But it could also be due to a million other reasons! Google is hit or miss.
1
0
6
@JoshPurtell
Josh
5 months
@martin_casado One thing I'd add: Transformers specialize in tasks that are highly parallelizable (TC0): Sequential tasks require either exceptionally large param counts (but this only buys so much), CoT, or multi-step pipelines.
2
2
6
@JoshPurtell
Josh
1 year
Is the vast plurality of VCs tweeting fortune cookie -tier content every hour of the work day an inside joke I haven’t been let in on? There’s one VC fund in particular where literally every vp drops jordan Peterson content nonstop. Straussian humor?
1
0
6
@JoshPurtell
Josh
7 months
@mattshumer_ Haiku is really good and consistent. Outperforms GPT-4-turbo on many of my internal evals. Ludicrous given the price. Demonstrates the value of pretaining on excellent synthetic data
0
0
6
@JoshPurtell
Josh
7 months
@Archipelag38114 @MoreBirths True, but it’s been steadily rising since the fall of Communism. Could very well be above replacement in a decade or two.
1
0
6
@JoshPurtell
Josh
1 year
China Lake for a day. Insane talent, hardware + robotics on site. You need to apply, anon.
@0xPHBD
PHBD
1 year
I'm hosting a hardtech / defense hackathon in NYC on Nov. 3-4 with @anishgoel_ , @join_ef , and @8vc If you are a student, new grad, or operator in the space, we would love to have you. Please apply below or reach out to me with any questions.
7
11
51
0
1
6
@JoshPurtell
Josh
8 months
@adamcohenhillel How hard would it be to let it collect 24/7% and then batch process when connected to wifi? Also, will you be adding an agent layer?
1
0
6
@JoshPurtell
Josh
6 months
The modern day Salon
@the_simonpastor
Simon Pastor
6 months
Really enjoyed hosting our ai dinner with @ns_whit and @WorksInProgMag 🚀 reach out if you want to join the next one🔥
Tweet media one
Tweet media two
3
1
25
0
1
6
@JoshPurtell
Josh
5 months
@n_s_bradford My results on an internal benchmark to measure raw state capacity # K pairs at which the LLM passes / fails (pass means it got 1/3 tries correct or better) # OpenAI models # gpt-4-32k: passes at 95 pairs / fails at 100 pairs # gpt-4-turbo: 75/85 # gpt-4o: 20/25 # gpt-3.5-turbo:
3
0
3
@JoshPurtell
Josh
19 days
O1 and O1-mini are advertised as having lots of reasoning capacity. It turns out, they do! While they miss a few points here and there due to corrupted output formatting breaking my parser, both essentially solve the challenge. 30 2-cycles is no prob for o1-mini
1
0
5
@JoshPurtell
Josh
4 months
@IkkyusDen Same applies day to day you just don’t see it
0
0
5
@JoshPurtell
Josh
1 year
Have had a lot of conversations with college students / new grads about joining startups. It's hard to talk about dynamics that matter without pointing to a stack of PG essays. A *lot* of implicit knowledge that's hard to verbalize.
3
0
4
@JoshPurtell
Josh
3 months
@thomasahle @danielcorin1 Two lines can never intersect any finite number of times other than once. Two curves can intersect twice, though. Poor LM, must have been very confused
0
0
5
@JoshPurtell
Josh
1 year
@AndyTech99 @personofswag “Just” - that role would generate 2 OOMs more value than current SWE
1
0
5
@JoshPurtell
Josh
1 year
@Brad08414464 They’re the least productive because of regulation that protects them from technological disruptions and allows them egregious rent-seeking. Democrats will not allow their clients (teachers unions, healthcare professionals, etc) to have their jobs programs disrupted by AI
2
0
4
@JoshPurtell
Josh
4 months
@doomslide Idt FF is feasible. There are representations that LLMs learn that would be exceptionally difficult for a small FF to compute. Also unsure MCTS is a panacea. It helps with decoding quality but doesn't really improve world model
1
0
4
@JoshPurtell
Josh
1 month
@anishgoel_ Best researchers in the world are good at a very very simple loop: - ask a meaningful question - formulate a fast way to kill it - if that fails, find the lowest # of bits possible to communicate the answer Broadly applicable.
0
0
4
@JoshPurtell
Josh
4 months
@nosilverv Hero’s journey
0
0
5
@JoshPurtell
Josh
5 months
Receipts re: the benchmark I used. There are plenty of ways to test state capacity, but this one is rather simple. Just ask it match a lot of pairs!
3
0
5
@JoshPurtell
Josh
27 days
The first SmallBench benchmark is BigCodeBench-Agent (BCB-A). BCB-A provides the language agent with a simple, stateful environment (answer drafting/editing + generating and running unit tests) to help it succeed on tightly-scoped, moderately difficult programming challenges.
1
1
5
@JoshPurtell
Josh
2 months
@teortaxesTex Sonnet is astonishingly good at storing large amounts of IC information simultaneously. Running this benchmark was a huge WTF moment - expected ~4t performance
@JoshPurtell
Josh
2 months
@_xjdr Big improvement. Haven't tested 405 yet but 3.1-70b blows 3-70b out of the water in an unexpected way on my favorite benchmark - testing state capacity. # State Capacity Results (Higher is better) # claude-sonnet-3-5-20240620: 200 # gpt-4-32k: 95 # gpt-4-turbo: 75 #
3
0
9
0
1
5
@JoshPurtell
Josh
10 months
@trickylabyrinth @yacineMTB @andyohlbaum @digiaiapp Yeah. Pretty soon anyone without cog sec is going to have substantially all of their agency stolen by malicious humans, ais, egregores. For now, the percentage is low but growing. Stay frosty
0
2
4
@JoshPurtell
Josh
3 months
@yi_ding @goodside Yeah, there is definitely a bias/variance trade off with prompts and some DSPy optimizers (including MIPROv2) will favor high variance candidates. In prod, I’ll often manually polish an optimized prompt - result tends to be best of both worlds. There was a recent paper that did
0
0
5
@JoshPurtell
Josh
2 months
@DivGarg9 So + ? If so, very cool. Do you have results on WebArena or VWebArena?
1
0
4
@JoshPurtell
Josh
10 months
@TheManMikeTan Death or victory.
1
0
5
@JoshPurtell
Josh
4 months
@doomslide Yeah it's ~haiku level in terms of working memory capacity Probably ~34B activated params imo
@JoshPurtell
Josh
5 months
Receipts re: the benchmark I used. There are plenty of ways to test state capacity, but this one is rather simple. Just ask it match a lot of pairs!
3
0
5
1
0
5
@JoshPurtell
Josh
2 months
@vmuaddib @_xjdr I like 4o and 4o-mini for agentic/"ToT" type setups but yeah they're really shallow reasoners
0
0
5
@JoshPurtell
Josh
21 days
@OfficialLoganK Is there a simple way to configure safety settings (pref to lowest possible)? Gemini-flash is bottom of the pack on my coding benchmark because it flags a decent proportion of simple, innocuous coding requests as "Medium" risk etc. Hard to use for serious apps
1
0
5
@JoshPurtell
Josh
10 months
@lateinteraction Yeah, if you ask turbo really nicely to simulate gradient descent internally you can get some nice prompts e.g. 42% win rate on gsm8k+llama-13b But you also get a lot of those. I guess the latent space is weirder than we thought! Maybe constrained generation / projection is
1
1
4
@JoshPurtell
Josh
4 months
@corbtt @OpenPipeAI I like finetuning but am hesitant to do it for problems where I expect intense distribution shifts / difficulty covering the entire test set distribution in training. FT essentially amounts to turning off most of the model’s capabilities, never know when you’ll need them
2
1
4
@JoshPurtell
Josh
1 month
@lateinteraction Crazy timing - quick post explaining is it hot off the presses!
0
1
4
@JoshPurtell
Josh
1 year
@thomasahle @karpathy The support vector gives you a set of weights by which to scale your space before applying knn. The linearity of SVM makes this a simple linear transform. Using NNs here would not make sense as the dimensions would get “tangled”
1
0
4
@JoshPurtell
Josh
6 months
@pli_cachete 1. He was incredibly successful early in life. 2. Deeply true and important: in small social contexts, people who manipulate empathy get detected and punished. In society at scale w democracy / mass media etc, that’s not longer the case. Thus empathy becomes ruinous
0
0
4
@JoshPurtell
Josh
1 year
There's a dark side to power law curves, a loooooot of zeros on the left. Way more than you'd like @0xPHBD
0
1
4
@JoshPurtell
Josh
9 months
@nickcammarata My Rousseau-ian intuition is something like "contraction is probably worth it in high-stakes situations *when you can undo it shortly after*, the problem is we over-estimate stakes and don't know how to undo it" But it's unprincipled
1
0
4
@JoshPurtell
Josh
27 days
Huge shout-out to @BigCodeProject for releasing the underlying dataset that makes BCB-A possible, and to Modal for offering the excellent containerization needed to scale @charles_irl @modal_labs .
1
0
4
@JoshPurtell
Josh
6 months
@DrJimFan Have been thinking about building a CleanAgents benchmark, possibly with DSPy, to do this and have a way to evaluate methods like voyager/reflexion etc in a more procedural way.
1
0
4
@JoshPurtell
Josh
1 year
@StefanFSchubert It’s not halo effect, SBF is ignoring that the population in 1650 had 2 standard deviations more ability than we do, and the variance was higher to boot
1
0
4
@JoshPurtell
Josh
10 months
@AAAzzam Bragging about having 200k context is so embarrassing, it means your users care more about making 10% performance easy than getting 50% to 90%
1
0
4