Charles Foster
@CFGeek
Followers
3K
Following
18K
Media
499
Statuses
5K
Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
Oakland, CA
Joined June 2020
mechint is cool, but there are many other types of interp research that don’t get enough attention which should good direction
The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit
1
1
26
In AI policy debates, I rarely value government-facing transparency on frontier AI and think most benefits require public information. Yet sharing information that broadly may create risks for frontier AI companies. Why do I think government-facing transparency is rarely useful?
3
3
22
If the eval informed decisions, including release, spend more talking about how/why. Technically credible third parties should instead be the main producers of results on public evals with full methodological transparency that can be standardized and compared across companies
1
1
6
As a concrete example, many current system cards allocate a lot of space to results on a bunch of public evals with mediocre-at-best experimental transparency. This is an uncanny valley where current practice is not what we want.
1
2
10
In exchange, frontier labs should publish *a lot* more about post-deployment insights, because this high-value insight is only possible with their usage data and related telemetry. This should be housed on websites/UIs appropriate for 2025, not static docs like its the 80s.
1
1
7
For cards released alongside model/system release, frontier labs should prioritize what must be said (e.g. if/how they are certain risk thresholds are met). Saying less reduces burden during the intense pre-release period.
1
2
7
Model/system cards should evolve because - Frontier models get updated a lot beyond the main training run - Elicitation (e.g. thinking mode, amount of test-time compute) matters a lot - Post-deployment insight is really valuable, yet largely unaddressed with static system cards
2
7
27
It seems like software engineers these days are mostly integrating closed-weight models into their workflows. By contrast, someone told me that folks in bio are using open-weight models a lot more. Can anyone confirm whether this is accurate?
0
0
9
Very bullish on recontextualization methods such as inoculation prompting. The ambitious vision is that even simple tools like promoting + finetuning can work to steer generalization (i.e. to choose *what* models pick up from training)
1
0
35
Based on the recent blog and paper from Anthropic, I made a blogpost detailing what I think about it in details and why I think we could do better (link in the replies)
4
13
81
Very excited for the Genesis Mission ->
whitehouse.gov
USHERING IN A NEW ERA OF DISCOVERY: Today, President Donald J. Trump signed an Executive Order launching the Genesis Mission, a new national effort to use
44
64
882
How might @METR_Evals' time horizon trend change if compute growth slows? In a new paper, @whitfill_parker, @bsnodin, and I show that trends + a common (and contestable -- read on!) economic model of algorithmic progress can imply substantial delays in AI capability milestones.
10
35
189
I think of the recent Anthropic paper “using in-context rationales to protect against unwanted out-of-context generalization from reward hacks”.
We call this: *out-of-context reasoning*Â (OOCR). This contrasts with regular *in-context learning* (ICL), where all the training examples are simply pasted into the prompt (with no finetuning). We evaluate ICL on the same tasks and find OOCR performs much better.
2
0
42
An employee claims that this AI developer releases its model weights “within a few hours” after training. Big if true.
I asked (on ChinaTalk) the head of product at Z ai, one of the leading Chinese companies building open models, how long it takes them to get their model out the door once its done training. Incredible stuff: "a few hours" and the model is on HuggingFace.
2
0
14
It's a big day for understanding how LLMs generalize from their training signals!
“Output-based training will keep chains-of-thought honest.” Sadly, NO. We show that training on *just the output* can still cause models to hide unwanted behavior in their chain-of-thought. MATS 8.0 Team Shard presents: a 🧵
3
1
33
Such a simple, yet ridiculous-sounding method. It has every right to work this well.
1
0
4
You literally just add a prompt (or some other intervention like a steering vector) that explains-away the unwanted pattern of generalization. That's it.
1
0
4
a.k.a. the unreasonable effectiveness of inoculation
Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
2
0
9