Charles Foster
@CFGeek
Followers
3K
Following
18K
Media
500
Statuses
5K
Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
Oakland, CA
Joined June 2020
I think it's good for independent evaluators to provide information about the *context* surrounding their evaluation, in addition to their eval results.
As part of our launch, we are releasing AEF-1, a new standard which ensures a baseline level of independence, access, and transparency for evaluations: https://t.co/k3iFtdZMrT
0
3
20
Putting this side by side with the Anthropic paper, it looks like we now have two basic strategies to cope with reward hacking, beyond patching the environments: shaping how models generalize from hacking and incentivizing models to honestly report when they hack.
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. https://t.co/4vgG9wS3SE
5
11
105
@norvid_studies would the Watchers really do that? go in the prompt and tell lies?
1
4
25
This is an interesting idea, though I’m unclear on if it’s worthwhile. To get this kind of customization otherwise you would need to run your own training; at best, you could build on top of a fully open model suite like Olmo 3 that provides intermediate checkpoints and datasets.
Today, AWS CEO Matt Garman announced Nova Forge, a model builder which lets companies inject their own data during the pre-training phase. "You [tell Forge]: 'Here's my corpus of corporate data, here's everything I need to know about my industry.' We then mix that in and finish
1
0
6
As I understand it, Kyutai is/was a nonprofit lab doing research on open-weight voice models, and a bunch of its technical leadership has now spun out into a startup to build voice AI products.
Gradium is out of stealth to solve voice. We raised $70M and after only 3 months we’re releasing our transcription and synthesis products to power the next generation of voice AI.
1
0
4
mechint is cool, but there are many other types of interp research that don’t get enough attention which should good direction
The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit
1
3
31
In AI policy debates, I rarely value government-facing transparency on frontier AI and think most benefits require public information. Yet sharing information that broadly may create risks for frontier AI companies. Why do I think government-facing transparency is rarely useful?
3
3
22
If the eval informed decisions, including release, spend more talking about how/why. Technically credible third parties should instead be the main producers of results on public evals with full methodological transparency that can be standardized and compared across companies
1
1
6
As a concrete example, many current system cards allocate a lot of space to results on a bunch of public evals with mediocre-at-best experimental transparency. This is an uncanny valley where current practice is not what we want.
1
2
10
In exchange, frontier labs should publish *a lot* more about post-deployment insights, because this high-value insight is only possible with their usage data and related telemetry. This should be housed on websites/UIs appropriate for 2025, not static docs like its the 80s.
1
1
7
For cards released alongside model/system release, frontier labs should prioritize what must be said (e.g. if/how they are certain risk thresholds are met). Saying less reduces burden during the intense pre-release period.
1
2
7
Model/system cards should evolve because - Frontier models get updated a lot beyond the main training run - Elicitation (e.g. thinking mode, amount of test-time compute) matters a lot - Post-deployment insight is really valuable, yet largely unaddressed with static system cards
2
7
27
It seems like software engineers these days are mostly integrating closed-weight models into their workflows. By contrast, someone told me that folks in bio are using open-weight models a lot more. Can anyone confirm whether this is accurate?
0
0
9
Very bullish on recontextualization methods such as inoculation prompting. The ambitious vision is that even simple tools like promoting + finetuning can work to steer generalization (i.e. to choose *what* models pick up from training)
1
0
34
Based on the recent blog and paper from Anthropic, I made a blogpost detailing what I think about it in details and why I think we could do better (link in the replies)
4
13
81
Very excited for the Genesis Mission ->
whitehouse.gov
USHERING IN A NEW ERA OF DISCOVERY: Today, President Donald J. Trump signed an Executive Order launching the Genesis Mission, a new national effort to use
44
64
885
How might @METR_Evals' time horizon trend change if compute growth slows? In a new paper, @whitfill_parker, @bsnodin, and I show that trends + a common (and contestable -- read on!) economic model of algorithmic progress can imply substantial delays in AI capability milestones.
10
36
195
I think of the recent Anthropic paper “using in-context rationales to protect against unwanted out-of-context generalization from reward hacks”.
We call this: *out-of-context reasoning*Â (OOCR). This contrasts with regular *in-context learning* (ICL), where all the training examples are simply pasted into the prompt (with no finetuning). We evaluate ICL on the same tasks and find OOCR performs much better.
2
0
41