Charles Foster
@CFGeek
Followers
3K
Following
18K
Media
496
Statuses
5K
Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
Oakland, CA
Joined June 2020
It's a big day for understanding how LLMs generalize from their training signals!
“Output-based training will keep chains-of-thought honest.” Sadly, NO. We show that training on *just the output* can still cause models to hide unwanted behavior in their chain-of-thought. MATS 8.0 Team Shard presents: a 🧵
0
0
10
Such a simple, yet ridiculous-sounding method. It has every right to work this well.
1
0
4
You literally just add a prompt (or some other intervention like a steering vector) that explains-away the unwanted pattern of generalization. That's it.
1
0
4
a.k.a. the unreasonable effectiveness of inoculation
Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
2
0
6
This is the most impressive release I’ve seen in a while. Fully open suite, from the start of training to multiple endpoints (chat, reasoning, domain-specific RL), with every dataset used along the way. Incredible research potential here.
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵
0
3
27
When doing independent evals for open-weight releases: You can use the API from the developer, but what if it trains on your data? You can use third-party infra, but what if that is buggy/incorrect at first? And waiting for those to stabilize means you’re slow to publish.
We chose this inference provider because its policies indicated it wouldn't retain or train on our tasks. Our guess is that this produces lower performance for Kimi K2 Thinking than what we would see from the developer's own API. https://t.co/GHDsqP459g
4
1
56
at the suggestion of @CFGeek & @joel_bkr, i'm running a manifundraiser for my model tinkering! it's already passed the minimum goal of $5k, but has stretch goals for funding more open-ended research. if that interests you, you can find it here: https://t.co/kQYe3x79To
1
17
61
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵
47
300
2K
A little visibility into how AI companies like OpenAI work with external assessors. It even includes snippets from their legal agreements!
Third party testing has long been part of our safety work. Our new blog shows how we collaborate on capability evaluations, methodology reviews, and expert probing that brings in domain expertise — all strengthening the broader safety ecosystem.
0
0
15
METR completed a pre-deployment evaluation of GPT-5.1-Codex-Max & found its capabilities consistent with past trends. If our projections hold, we expect further OpenAI development in the next 6 months is unlikely to pose catastrophic risk via automated AI R&D or rogue autonomy.
8
39
328
Great guest lecture from @SuryaGanguli’s theoretical neuroscience course. Felt like a complete impostor taking it years ago, but it was one of those classes that expands your mind even if you’re drowning in the material.
Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an
0
1
42
Regrettable “own goal” how AI safety folks made an enemy out of open-source a few years back. Lots of the technical problems in safety & security would benefit from greater transparency around how models are developed (e.g. attribution, evaluation, interpretability, verification)
@ChrisMurphyCT You're being played by people who want regulatory capture. They are scaring everyone with dubious studies so that open source models are regulated out of existence.
4
13
84
This offers an interesting perspective on how hippocampal learning complements cortical: cortical learning may be overly tied to explicit learning goals/formulations, while episodic retrieval of learning experiences can allow using latent information more flexibly. 7/
1
3
11
I was honored to speak at Princeton’s symposium on The Physics of John Hopfield: Learning & Intelligence this week. I sketched out a perspective that ties together some of our recent work on ICL vs. parametric learning, and some possible links to hippocampal replay: 1/
3
29
249
I would like to introduce the concept of a phenome-wide association study to my LLM-interpreting friends. It's where you start with a genetic variant, and figure out which phenotypes are associated with it (the inverse of figuring out which variants are associated with a trait).
6
3
49
Today, I write about US-China competition in AI and associated technologies as I see it. “Race” is a fairly poor analogy, in my view, for understanding the magnitude of what is underway.
11
27
220
Can somebody with a cybersecurity background weigh in on how big of a deal this is? Just finished the report, but I didn’t feel like I learned much from it.
We believe this is the first documented case of a large-scale AI cyberattack executed without substantial human intervention. It has significant implications for cybersecurity in the age of AI agents. Read more:
29
4
152
LLMs are notorious for "hallucinating": producing confident-sounding answers that are entirely wrong. But with the right definitions, we can extract a semantic notion of "confidence" from LLMs, and this confidence turns out to be calibrated out-of-the-box in many settings (!)
22
82
586