Wes Gurnee @wesg52 profile

Wes Gurnee

@wesg52

Followers

3K

Following

3K

Statuses

129

Optimizer @MIT @ORCenter PhD student thinking about Mechanistic Interpretability, Optimization, and Governance.

San Francisco, CA

Joined June 2022

Don't wanna be here? Send us removal request.

Wes Gurnee

@wesg52

1 year

New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:

7

68

407

Wes Gurnee

@wesg52

15 days

RT @boazbaraktcs: Wrote a blog post with some personal thoughts on AI safety.

0

46

0

Wes Gurnee

@wesg52

29 days

RT @saprmarks: What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems…

0

67

0

Wes Gurnee

@wesg52

2 months

RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f…

0

740

0

Wes Gurnee

@wesg52

2 months

RT @Jack_W_Lindsey: If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic…

0

25

0

Wes Gurnee

@wesg52

2 months

RT @AnthropicAI: We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-…

0

309

0

Wes Gurnee

@wesg52

3 months

RT @tilderesearch: We're thrilled to be launching Tilde. We're applying interpretability to unlock deep reasoning and control of models, e…

0

128

0

Wes Gurnee

@wesg52

4 months

RT @ch402: Early days, but I'm pretty excited about crosscoders ( as a way to start to create a more universal lang…

0

65

0

Wes Gurnee

@wesg52

4 months

RT @Jack_W_Lindsey: Really excited to share our work on crosscoders, a generalization of sparse autoencoders that allows us to identify sha…

0

42

0

Wes Gurnee

@wesg52

4 months

RT @davidbau: RT to PhD applicants... If you are a biologist who wants extract knowledge from RNA/Protein LMs or AlphaFold, and you are co…

0

8

0

Wes Gurnee

@wesg52

4 months

RT @JoshAEngels: 1/5: We recently uploaded a new version of our paper "Not All Language Models are Linear," and I thought I would post a fe…

0

23

0

Wes Gurnee

@wesg52

4 months

RT @DarioAmodei: Machines of Loving Grace: my essay on how AI could transform the world for the better

0

1K

0

Wes Gurnee

@wesg52

4 months

RT @jade_lei_yu: New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"…

0

5

0

Wes Gurnee

@wesg52

6 months

RT @AdamSJermyn: A collection of small updates from the Anthropic Interpretability team.

0

19

0

Wes Gurnee

@wesg52

7 months

RT @NeelNanda5: Are you excited about @ch402-style mechanistic interpretability research? I'm looking to mentor scholars via MATS - apply b…

0

24

0

Wes Gurnee

@wesg52

7 months

RT @sprice354_: 🚨New paper🚨: We train sleeper agent models which act maliciously if they see future (post training-cutoff) news headlines,…

0

30

0

Wes Gurnee

@wesg52

8 months

RT @AdamSJermyn: A collection of small updates from the Anthropic Interpretability team.

0

12

0

Wes Gurnee

@wesg52

8 months

New paper led by @vedanglad showing robustness and distinct stages of an LLM forward pass!

Vedang Lad

@vedanglad

8 months

1/7 Wondered what happens when you permute the layers of a language model? In our recent paper with @tegmark, we swap and delete entire layers to understand how models perform inference - in doing so we see signs of four universal stages of inference!

0

6

Wes Gurnee

@wesg52

8 months

RT @alesstolfo: New paper w/ @benwu_ml and @NeelNanda5! LLMs don’t just output the next token, they also output confidence. How is this com…

0

55

0

Wes Gurnee

@wesg52

8 months

RT @OwainEvans_UK: New paper, surprising result: We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can…

0

223

0

Wes Gurnee

@wesg52

8 months

RT @littlefish3625: Our paper on refusal in LLMs is finally up on arXiv.

0

44

0