CFGeek Profile Banner
Charles Foster Profile
Charles Foster

@CFGeek

Followers
3K
Following
18K
Media
499
Statuses
5K

Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq

Oakland, CA
Joined June 2020
Don't wanna be here? Send us removal request.
@CFGeek
Charles Foster
3 years
Running list of conjectures about neural networks 📜:
6
10
159
@AnjneyMidha
Anjney Midha
14 hours
mechint is cool, but there are many other types of interp research that don’t get enough attention which should good direction
@NeelNanda5
Neel Nanda
19 hours
The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit
1
1
26
@RishiBommasani
rishi@NeurIPS
2 days
In AI policy debates, I rarely value government-facing transparency on frontier AI and think most benefits require public information. Yet sharing information that broadly may create risks for frontier AI companies. Why do I think government-facing transparency is rarely useful?
3
3
22
@RishiBommasani
rishi@NeurIPS
4 days
If the eval informed decisions, including release, spend more talking about how/why. Technically credible third parties should instead be the main producers of results on public evals with full methodological transparency that can be standardized and compared across companies
1
1
6
@RishiBommasani
rishi@NeurIPS
4 days
As a concrete example, many current system cards allocate a lot of space to results on a bunch of public evals with mediocre-at-best experimental transparency. This is an uncanny valley where current practice is not what we want.
1
2
10
@RishiBommasani
rishi@NeurIPS
4 days
In exchange, frontier labs should publish *a lot* more about post-deployment insights, because this high-value insight is only possible with their usage data and related telemetry. This should be housed on websites/UIs appropriate for 2025, not static docs like its the 80s.
1
1
7
@RishiBommasani
rishi@NeurIPS
4 days
For cards released alongside model/system release, frontier labs should prioritize what must be said (e.g. if/how they are certain risk thresholds are met). Saying less reduces burden during the intense pre-release period.
1
2
7
@RishiBommasani
rishi@NeurIPS
4 days
Model/system cards should evolve because - Frontier models get updated a lot beyond the main training run - Elicitation (e.g. thinking mode, amount of test-time compute) matters a lot - Post-deployment insight is really valuable, yet largely unaddressed with static system cards
2
7
27
@CFGeek
Charles Foster
3 days
It seems like software engineers these days are mostly integrating closed-weight models into their workflows. By contrast, someone told me that folks in bio are using open-weight models a lot more. Can anyone confirm whether this is accurate?
0
0
9
@CFGeek
Charles Foster
7 days
Very bullish on recontextualization methods such as inoculation prompting. The ambitious vision is that even simple tools like promoting + finetuning can work to steer generalization (i.e. to choose *what* models pick up from training)
1
0
35
@secemp9
secemp
8 days
Based on the recent blog and paper from Anthropic, I made a blogpost detailing what I think about it in details and why I think we could do better (link in the replies)
4
13
81
@joel_bkr
Joel Becker
8 days
How might @METR_Evals' time horizon trend change if compute growth slows? In a new paper, @whitfill_parker, @bsnodin, and I show that trends + a common (and contestable -- read on!) economic model of algorithmic progress can imply substantial delays in AI capability milestones.
10
35
189
@CFGeek
Charles Foster
9 days
*paper as
0
0
1
@CFGeek
Charles Foster
9 days
I think of the recent Anthropic paper “using in-context rationales to protect against unwanted out-of-context generalization from reward hacks”.
@OwainEvans_UK
Owain Evans
1 year
We call this: *out-of-context reasoning*  (OOCR). This contrasts with regular *in-context learning* (ICL), where all the training examples are simply pasted into the prompt (with no finetuning). We evaluate ICL on the same tasks and find OOCR performs much better.
2
0
42
@CFGeek
Charles Foster
10 days
0
0
2
@CFGeek
Charles Foster
10 days
An employee claims that this AI developer releases its model weights “within a few hours” after training. Big if true.
@natolambert
Nathan Lambert
11 days
I asked (on ChinaTalk) the head of product at Z ai, one of the leading Chinese companies building open models, how long it takes them to get their model out the door once its done training. Incredible stuff: "a few hours" and the model is on HuggingFace.
2
0
14
@CFGeek
Charles Foster
10 days
It's a big day for understanding how LLMs generalize from their training signals!
@Turn_Trout
Alex Turner
11 days
“Output-based training will keep chains-of-thought honest.” Sadly, NO. We show that training on *just the output* can still cause models to hide unwanted behavior in their chain-of-thought. MATS 8.0 Team Shard presents: a 🧵
3
1
33
@CFGeek
Charles Foster
11 days
Such a simple, yet ridiculous-sounding method. It has every right to work this well.
1
0
4
@CFGeek
Charles Foster
11 days
You literally just add a prompt (or some other intervention like a steering vector) that explains-away the unwanted pattern of generalization. That's it.
1
0
4
@CFGeek
Charles Foster
11 days
a.k.a. the unreasonable effectiveness of inoculation
@AnthropicAI
Anthropic
11 days
Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
2
0
9