New paper! "Universal Neurons in GPT2 Language Models"
How many neurons are independently meaningful?
How many neurons reappear across models with different random inits?
Do these neurons specialize into specific functional roles or form feature families?
Answers below 🧵:
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales?
In a new paper with
@tegmark
we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out . A 🧵:
For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date.
But does the model actually _use_ these representations? By looking for neurons with similar weights as the probe, we find many space and time neurons which are sensitive to the spacetime coords of an entity, showing the model actually learned the global geometry -- not the probe
When training probes over every layer and model, we find that representations emerge gradually over the early layers before plateauing at around the halfway point. As expected, bigger models are better, but for more obscure datasets (NYC) no model is great.
Are these representations actually linear? By comparing the performance of nonlinear MLP probes with linear probes, we find evidence that they are! More complicated probes do not perform any better on the test set.
Are these representations robust to prompting? Probing on different prompts we find performance is largely preserved but can be degraded by capitalizing the entity name or prepending random tokens. Also probing on the trailing period instead of last token is better for headlines
A critical part of this project was constructing space and time datasets at multiple spatiotemporal scales with a diversity of entity types (eg, both cities and natural landmarks).
One large family of neurons we find are “context” neurons, which activate only for tokens in a particular context (French, Python code, US patent documents, etc). When deleting these neurons the loss increases in the relevant context while leaving other contexts unaffected!
Thrilled to share my first grad school preprint “Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization” with
@dbertsim
Preprint:
Code:
Thread: 1/4
Short research post on a potential issue arising in Sparse Autoencoders (SAEs): the reconstruction errors change model predictions much more than a random error of the same magnitude!
This paper would not have been possible without my coauthors
@NeelNanda5
, Matthew Pauly, Katherine Harvey,
@mitroitskii
, and
@dbertsim
or all the foundational and inspirational work from
@ch402
,
@boknilev
, and many others!
Read the full paper:
@rafaelrmuller
@tegmark
We have results for PCA and you definitely need more than two PCs. This is because our datasets contain diverse entities (eg, cities, buildings, natural landmarks) and on inspection the first few PCs seemed to cluster this information.
But what if there are more features than there are neurons? This results in polysemantic neurons which fire for a large set of unrelated features. Here we show a single early layer neuron which activates for a large collection of unrelated n-grams.
That said, more than any specific technical contribution, we hope to contribute to the general sense that ambitious interpretability is possible: that LLMs have a tremendous amount of rich structure that can and should be understood by humans!
Early layers seem to use sparse combinations of neurons to represent many features in superposition. That is, using the activations of multiple polysemantic neurons to boost the signal of the true feature over all interfering features (here “social security” vs. adjacent bigrams)
Results in toy models from
@AnthropicAI
and
@ch402
suggest a potential mechanistic fingerprint of superposition: large MLP weight norms and negative biases. We find a striking drop in early layers in the Pythia models from
@AiEleuther
and
@BlancheMinerva
.
New version is out (to appear at ICLR)! Main updates:
- Additional experiments on Pythia models
- Causal interventions on space and time neurons
- More related work
- Clarify our claims of a literal world model (static vs. dynamic)
- External replications!
More in thread:
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales?
In a new paper with
@tegmark
we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
While we found tons of interesting neurons with sparse probing, it requires careful follow up analysis to draw more rigorous conclusions. E.g., athlete neurons turn out to be more general sport neurons when analyzing max average activating tokens.
We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)
Attention heads can be effectively "turned off" by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.
We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!
Precision and recall can also be helpful guides, and remind us that it should not be assumed a model will learn to represent features in an ontology convenient or familiar to humans.
When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons
Working with Neel was one of the most valuable experiences of my career and I can’t recommend working with him enough! The MATS cohort and program were also great – I think most people interested should definitely apply!
Are you excited about
@ch402
-style mechanistic interpretability research? I'm looking for scholars to mentor via MATS - apply by April 12!
I'm very impressed by the great work from past scholars, and enjoy mentoring promising mech interp talent. I'm excited for my next cohort!
After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are "universal"; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize
What happens with scale? We find representational sparsity increases on average, but different features obey different scaling dynamics. In particular, quantization and neuron splitting: features both emerge and split into finer grained features.
What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!
Really enjoyed advising this follow up project on training dynamics of context neurons! I think there a ton of good research to do in the intersection of interpretability and training dynamics and hope to see more!
A mystery in prior work: LLMs contain interpretable neurons that correspond to text language. Some aren't important, but deleting Pythia 70M’s German neuron increases loss by 12% on German text. Why?
We investigate over training and show it's part of a "second order circuit."
There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!
Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"?
In a new paper with
@tegmark
, we explore how LLMs represent truth. 1/N
@emilymbender
Indeed this is an unfortunate bias in our dataset. These are all places from english Wikipedia so even coverage of places like France or China is worse than South Africa or India.
We believe this extra modeling power will give practitioners unprecedented flexibility in tailoring the learning process to their problem domain and aid in learning dynamics in highly underdetermined settings. 4/4
This optimality buys consistent statistical gains across many different systems and data regimes while still being very tractable (sometimes even faster than heuristics). Perhaps most exciting is the ability to embed a huge variety of constraints for physics informed ML. 3/4
We consider the SINDy framework proposed by
@eigensteve
, Joshua Proctor, and Nathan Kutz to discover governing equations of dynamical systems directly from data.
We integrate exact sparse regression techniques to solve the SINDy problem to provable optimality. 2/4
However, there was a recent paper from Chen et al. on "Causal Representations of Space" in LLMs that builds on our work and finds "LLMs learn and use an internal model of space in solving geospatial related tasks."
Reviewers (and twitter) were unhappy with our use of the term "world model". We edited the text to clarify we use this term in its static sense -- i.e., that LLMs have a map of time and space, but we don't show this is part of a dynamic model used to solve downstream problems.
@maksym_andr
@askerlee
I think this is an artifact of OPT models being undertrained and using ReLU. I see some but not nearly as many dead neurons in Pythia and GPT2 models
We reran our main probing sweep experiment with the Pythia models from
@AiEleuther
. We find clear scaling in model size with a jump between the Pythia and Llama models, likely due to different training data size (300B vs 2T tokens). Scale wins again!
We also ran a few simple causal intervention/ablation experiments on our space and time neurons. We find we can alter the predicted release of famous art works by intervening on time neurons and that geospatial prompts suffer highest loss increase under space neuron ablations