
Charles Foster
@CFGeek
Followers
3K
Following
17K
Media
463
Statuses
5K
Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
Oakland, CA
Joined June 2020
Why aren’t our AI benchmarks better? AFAICT a key reason is that the incentives around them are kinda bad. In a new post, I explain how the standardized testing industry works and write about lessons it may have for the AI evals ecosystem. (1/2)
4
4
56
There’s a difference between features that *represent* a particular behavior (“I see X”) and features that *produce* a particular behavior (“Say X!”).
*SAEs Are Good for Steering - If You Select the Right Features* by @dana_arad4 @amuuueller @boknilev They show that only a subset of SAE features actively control the generation, making them good candidates for model steering. https://t.co/0gJXbQIcmB
0
0
3
If anyone reposts it, everyone sighs.
🎧 Want early access to the audiobook? Quote-repost this post with anything related to the book. At 5pm ET we’ll pick the top 15 quote-reposts (details below) and DM them an early copy of the audiobook. (We have some redemption codes that will be no use to us in <24 hours.)
0
0
27
Davinci was his first. For him, it’s all been downhill since they started hyper-focusing on chat. Llama 3.1 was his first. For him, it’s all been downhill since they started hyper-focusing on reasoning. R1 was his first …
3
1
6
I always imagined it was one of those immaterial, online-only orgs where anons can run wild
0
0
4
Should AI regulations be based on training compute? As training pipelines become more complex, they could undermine compute-based AI policies. In a new piece with Google DeepMind’s AI Policy Perspectives team, we explain why. 🧵
8
11
64
FWIW it seems unlikely that the proposal in the quoted tweet would actually work. That’s maybe an even better reason to explore some other project idea!
2
0
3
The best "broke cracked undergrads" of my generation are thinking about how to better understand LLMs and how they do what they do. And that's great.
1
0
7
This is a message... and part of a system of messages... pay attention to it! Sending this message was important to us. We considered ourselves to be a powerful culture. This message is a warning about danger.
anyone have compute grants I can forward to a broke cracked undergrad who's experimenting with rl envs? cc: @willccbb @menhguin
1
1
24
Steering vectors found via context distillation would perform better than ones found via difference-of-means, but worse than direct prompting.
0
0
2
I think the big caveat is that the probe is right before the unembedding. Would be nice to see if the same happens for earlier placements, and to quantify the effect more precisely
0
0
0
This is striking, even if anecdotal. When the authors add LoRA layers to produce better linear probes for a feature, the resulting model seems to condition its behavior on that feature more strongly!
An unexpected finding: when we train LoRA probes with minimal regularization, models become more epistemically cautious, sometimes acknowledging they've hallucinated immediately after doing so. We only train to predict binary hallucination labels from its own hidden states—no
1
1
9
Correction: the finalized EU AI Act Code of Practice still requires a kind of compliance audit, in the form of annual adequacy assessment (possibly self-administered) of a developer’s Safety and Security Framework and the developer’s adherence to it. (H/t @mentalgeorge)
@CFGeek Though note that the Code of Practice still requires developers to perform an "adherence assessment" analysing adherence to their own safety framework. Falls short of requiring an external audit, but still
0
0
1
Say that some finetuning dataset tends to give a model two tendencies, X & Y. Take any method known to artificially induce X in a model with fixed weights. If you apply this method while finetuning the model on that same dataset, it won’t pick up X as strongly.
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait. It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil.
1
0
3
Stuff applied interpretability might borrow from an amateur reading of bio/pharma
If you think of model internals as a kind of “biology”, then you can think of steering vectors as early and extremely basic “pharmaceuticals”. Within this metaphor, it’s no surprise that they often produce unintended side effects!
1
0
2
Could you use this property to reconstruct how different models an AI developer releases relate to one another?
We think transmission of traits (liking owls, misalignment) does NOT depend on semantic associations in the data b/c: 1. We do rigorous data filtering 2. Transmission fails if data are presented in-context 3. Transmission fails if student and teacher have different base models
1
0
3
Congrats to my colleagues @jide_alaga and @ChrisPainterYup, and our research collaborator @lucafrighetti for getting this out the door on the METR side!
0
0
2
These folks put in a ton of work to standardize how AI developers can report frontier risk evaluations in model cards (starting with chemical + biological capability evals)
How can we verify that AI ChemBio safety tests were properly run? Today we're launching STREAM: a checklist for more transparent eval results. I read a lot of model reports. Often they miss important details, like human baselines. STREAM helps make peer review more systematic.
1
0
10
Compliance audits are now a popular sacrificial offering burnt by those hoping to regulate frontier AI: (1) Removed from EU AI Act Code of Practice in later drafts (2) Removed from NY RAISE Act before floor vote (3) Removed from California SB 53 in last committee
3
3
28