@YouJiacheng
He can still come clean. Blame it on an adderall-fueled manic episode, like SBF did, or whatever.
Waiting until tmr makes that out difficult. He should come clean now.
There's a lot of confused thinking when it comes to early stage. By far the worst offender is the idea of 'moats', which most people hold to be very important in hypotheticals, yet rarely IRL.
My take on what actually achieves what moats don't:
@yacineMTB
Coding an impressive architecture (performance) is basically just IQ, yeah.
But coding *the right* impressive architecture (effectiveness) for a given situation is pretty experience-driven.
There are multiple great frameworks like Ragatouille by
@bclavie
and Neural-Cherche by
@raphaelsrty
. But I still encounter some resistant when telling people to try out Colbert.
Now there's no excuse -> wrote a quick gist that does synthetic data gen, fine-tuning, eval. Just add
Evaluating LM agents has come a long way since gpt-4 released in March of 2023.
We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks.
There's still lots to do, though 🧵
Researchers and engineers want a lot from LMs, and not all of those desiderata are captured by leaderboards like lmsys.
One of the capabilities I care about is an LM's ability to effectively reason over lots of information. It turns out, this ability varies widely!
I'm co-hosting an experienced practitioners ML reading group in Flatiron.
The purpose:
- Go through the collective exercise of distilling an important paper into its bare essentials + the good ideas. Good skill.
- Discuss / hack on applications, experiments, hypotheses.
@MoreBirths
The Czech Republic (one of best birth rates in Europe and almost 100% atheist) exists, so this hypothesis is wrong.
In fact, there is a third variable that causes both lack of religiosity and national barrenness.
@_JakubJanda
Czech Republic, along with Germany, is the only European country with industrial capacity. France and England mostly exist to launder American inflation proceeds and are actively destroying their own economies.
This of course recommends them over CR to the WSJ.
Prompting multistage programs is hard - evaluation is slow, LM behavior hard to anticipate, root causes for failures tricky to ferret out.
We break down the problem to its constituent parts and systematically evaluate what works. Really excited for people to use MIPROv2!
🚨Announcing the largest study focused on *how* to optimize the prompts within LM programs, a key DSPy challenge.
Should we use LMs to… Craft instructions? Self-generate examples? Handle credit assignment? Specify a Bayesian model?
By
@kristahopsalong
*
@michaelryan207
* &team🧵
Links to the projects mentioned:
Ragatouille, full features for Colbert RAG:
Neural-Cherche, extremely clean and lightweight Colbert Finetuning: The
OG:
Am hosting a recurring + small "in the trenches" event for anyone solving hard ML problems and/or problems with ML.
The objective of the meeting is to encourage creative problem-solving in a convivial setting.
All conversation should drive toward relevant open questions and
@theasianchris1
@isaacbmiller1
My thoughts on the topic - vaguely related to Isaac's.
Summary: high-conviction long DSPy. And not just for compound LM systems.
Basis is hiring ML Engineering Interns
We're deploying ML solutions across the stack - LLMs, agents, RL, conformal prediction, and more - to solve every problem in accounting with AI.
Help us do it faster.
Excited about gpt-4o?
Fast, cheap, multi-modal. What's the catch?
There are many capabilities that make a good language model. One of them is raw state capacity - how much information can the LLM track and use before it runs out of 'space' in its residual stream?
Second iteration of ML in the trenches tonight, 7pm in Flatiron. Will be focused on mechanistic interpretability.
These are focused sessions where a researcher presents the problem they're actively working on, and other researchers rapidly idea+experiment.
DM me or
a16z is thrilled to announce our Series A investment in
@MistralAI
. Mistral is at the center of a passionate open source AI developer community. We think this is the most promising path to achieve widely adopted AI systems, and that Mistral is the leading team on this path.
Screen cap of the crafter session that secured the $5k prize for the Prompt Olympics' winner.
The LLM agent he prompted almost mined iron - but tried to do so with a wood pickaxe, not stone - and secured 10/22 achievements in one run. Congrats Coleman!
@povgoggles
@QuanquanGu
Language models introduce a powerful prior for medium-term planning.
Given that medium term choice feels much more combinatorial than short (only so many actions) and long term (only so many high level goals), this could unlock a lot of problems for agents.
Really like Jarvis,
@_xjdr
Big improvement. Haven't tested 405 yet but 3.1-70b blows 3-70b out of the water in an unexpected way on my favorite benchmark - testing state capacity.
# State Capacity Results (Higher is better)
# claude-sonnet-3-5-20240620: 200
# gpt-4-32k: 95
# gpt-4-turbo: 75
#
These models can reason.
Will need to update some apis before running more evals, but excited to see how they do on the other task suites in LRC Bench. More to come!
Reference:
I was blown away at how fruitful the first few iterations were. Suffice it to say, open invite to all who attended those.
First session will be tackling DreamerV3, which just got some updates. Rumor has it, there might be some benchmark results on a few puffer lib environments
When you randomly come across a paper that delivers 18-point improvements over the next best system (and tops *three* different leaderboards) by cleverly applying DSPy optimizers, but it's Friday afternoon so it's not a great time to publicize that sort of thing. Monday it is.
These models can reason.
Will need to update some apis before running more evals, but excited to see how they do on the other task suites in LRC Bench. More to come!
Reference:
@ocolegro
Have a goated AI eng friend from Yale math on board. Trying to secure a math PhD mutual. Still looking for someone w deep expertise training/tuning foundation models.
Very serious abt this, think it could be fun.
Evaluating LM agents has come a long way since gpt-4 released in March of 2023.
We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks.
There's still lots to do, though 🧵
If you want to get started, check out our quick ReAct demo here: .comBCB_Agent_Benchmark.ipynbColab notebook
To read more, check out the announcement:
And to contribute, add issues and PRs here:
O1-mini and O1-preview absolutely obliterating all other models on the LRCB abstract algebra benchmark (the hardest).
Sadly may be an hour or two until the final results are in. Running them on CraftaxLM might be an overnight affair ... suggests hybrid systems may be valuable!
Results are in for the Algebra sub-challenge of the LRC Benchmark!
The challenge is very simple - reduce the product of K 2-cycles from S4 into simplest form. However, doing requires the LM to keep track of a lot of information. So, a simple way to measure reasoning capacity 🧵
This isn't quite needle-in-a-haystack, which mostly tests a model's ability to recognize needles. Instead, imagine if the model needed to reason over the entire haystack to deduce where the needle is.
Well, that happens to be eval
#2
in LRCBench!
@YouJiacheng
There is a difference btwn reasoning over long context and learning from it.
Will share an eval on latter tn.
Once labs build models designed for ICL (Gemini pro is an early hint), then things get interesting.
@corry_wang
Basically, yeah. Seed stage equity is priced as prima facie investment + option to reinvest later (in much greater size)
In any reasonably competitive market you’d expect prices to surpass prima facie returns
LRCBench introduces three tasks that require the model to reason over a lot of information. These tasks require the LM to reason over code, data, and mathematical systems - so hopefully it won't over index on any given post-training modality, like gpt-4o's emphasis on math.
What does it tell us? Well, mostly that, within a data mix, it seems like bigger models are better. Not much of a surprise there.
What is surprising, though, is how widely performance varies between providers.
Anthropic models are much, much better than OpenAI models.
Why is that? Hard to speculate, but given that rumor has it that Anthropic cares a lot about synthetic data, it's possible that their models learn to represent information more effectively/compactly. But it could also be due to a million other reasons!
Google is hit or miss.
@martin_casado
One thing I'd add: Transformers specialize in tasks that are highly parallelizable (TC0):
Sequential tasks require either exceptionally large param counts (but this only buys so much), CoT, or multi-step pipelines.
Is the vast plurality of VCs tweeting fortune cookie -tier content every hour of the work day an inside joke I haven’t been let in on?
There’s one VC fund in particular where literally every vp drops jordan Peterson content nonstop. Straussian humor?
@mattshumer_
Haiku is really good and consistent. Outperforms GPT-4-turbo on many of my internal evals. Ludicrous given the price. Demonstrates the value of pretaining on excellent synthetic data
I'm hosting a hardtech / defense hackathon in NYC on Nov. 3-4 with
@anishgoel_
,
@join_ef
, and
@8vc
If you are a student, new grad, or operator in the space, we would love to have you. Please apply below or reach out to me with any questions.
@n_s_bradford
My results on an internal benchmark to measure raw state capacity
# K pairs at which the LLM passes / fails (pass means it got 1/3 tries correct or better)
# OpenAI models
# gpt-4-32k: passes at 95 pairs / fails at 100 pairs
# gpt-4-turbo: 75/85
# gpt-4o: 20/25
# gpt-3.5-turbo:
O1 and O1-mini are advertised as having lots of reasoning capacity.
It turns out, they do! While they miss a few points here and there due to corrupted output formatting breaking my parser, both essentially solve the challenge. 30 2-cycles is no prob for o1-mini
Have had a lot of conversations with college students / new grads about joining startups. It's hard to talk about dynamics that matter without pointing to a stack of PG essays. A *lot* of implicit knowledge that's hard to verbalize.
@thomasahle
@danielcorin1
Two lines can never intersect any finite number of times other than once.
Two curves can intersect twice, though.
Poor LM, must have been very confused
@Brad08414464
They’re the least productive because of regulation that protects them from technological disruptions and allows them egregious rent-seeking.
Democrats will not allow their clients (teachers unions, healthcare professionals, etc) to have their jobs programs disrupted by AI
@doomslide
Idt FF is feasible. There are representations that LLMs learn that would be exceptionally difficult for a small FF to compute.
Also unsure MCTS is a panacea. It helps with decoding quality but doesn't really improve world model
@anishgoel_
Best researchers in the world are good at a very very simple loop:
- ask a meaningful question
- formulate a fast way to kill it
- if that fails, find the lowest # of bits possible to communicate the answer
Broadly applicable.
The first SmallBench benchmark is BigCodeBench-Agent (BCB-A). BCB-A provides the language agent with a simple, stateful environment (answer drafting/editing + generating and running unit tests) to help it succeed on tightly-scoped, moderately difficult programming challenges.
@teortaxesTex
Sonnet is astonishingly good at storing large amounts of IC information simultaneously.
Running this benchmark was a huge WTF moment - expected ~4t performance
@_xjdr
Big improvement. Haven't tested 405 yet but 3.1-70b blows 3-70b out of the water in an unexpected way on my favorite benchmark - testing state capacity.
# State Capacity Results (Higher is better)
# claude-sonnet-3-5-20240620: 200
# gpt-4-32k: 95
# gpt-4-turbo: 75
#
@trickylabyrinth
@yacineMTB
@andyohlbaum
@digiaiapp
Yeah.
Pretty soon anyone without cog sec is going to have substantially all of their agency stolen by malicious humans, ais, egregores.
For now, the percentage is low but growing. Stay frosty
@yi_ding
@goodside
Yeah, there is definitely a bias/variance trade off with prompts and some DSPy optimizers (including MIPROv2) will favor high variance candidates.
In prod, I’ll often manually polish an optimized prompt - result tends to be best of both worlds. There was a recent paper that did
@OfficialLoganK
Is there a simple way to configure safety settings (pref to lowest possible)?
Gemini-flash is bottom of the pack on my coding benchmark because it flags a decent proportion of simple, innocuous coding requests as "Medium" risk etc.
Hard to use for serious apps
@lateinteraction
Yeah, if you ask turbo really nicely to simulate gradient descent internally you can get some nice prompts e.g. 42% win rate on gsm8k+llama-13b
But you also get a lot of those. I guess the latent space is weirder than we thought!
Maybe constrained generation / projection is
@corbtt
@OpenPipeAI
I like finetuning but am hesitant to do it for problems where I expect intense distribution shifts / difficulty covering the entire test set distribution in training.
FT essentially amounts to turning off most of the model’s capabilities, never know when you’ll need them
@thomasahle
@karpathy
The support vector gives you a set of weights by which to scale your space before applying knn. The linearity of SVM makes this a simple linear transform.
Using NNs here would not make sense as the dimensions would get “tangled”
@pli_cachete
1. He was incredibly successful early in life.
2. Deeply true and important: in small social contexts, people who manipulate empathy get detected and punished. In society at scale w democracy / mass media etc, that’s not longer the case. Thus empathy becomes ruinous
@nickcammarata
My Rousseau-ian intuition is something like "contraction is probably worth it in high-stakes situations *when you can undo it shortly after*, the problem is we over-estimate stakes and don't know how to undo it"
But it's unprincipled
Huge shout-out to
@BigCodeProject
for releasing the underlying dataset that makes BCB-A possible, and to Modal for offering the excellent containerization needed to scale
@charles_irl
@modal_labs
.
@DrJimFan
Have been thinking about building a CleanAgents benchmark, possibly with DSPy, to do this and have a way to evaluate methods like voyager/reflexion etc in a more procedural way.
@StefanFSchubert
It’s not halo effect, SBF is ignoring that the population in 1650 had 2 standard deviations more ability than we do, and the variance was higher to boot
@AAAzzam
Bragging about having 200k context is so embarrassing, it means your users care more about making 10% performance easy than getting 50% to 90%