A lot of early mechanistic interpretability work focused on InceptionV1 (an ImageNet model from 2014). They made a lot of progress, but were held back by “polysemantic neurons” that respond to unrelated concepts.
In the last year, we’ve seen a lot of progress on this problem in
the more i use linear algebra, the more convinced i am that university mathematics is sort of broken? spending more time on geometric intuitions of things like SVD would be way more useful than just rote learning how to calculate eigenvalues.
seems sort of surprising to me that John Schulman, previous head of post-training and first author of PPO paper, didn’t contribute to a model that plausibly required a lot of RL? it’s possible that he didn’t do anything worthy of being listed but that surprises me a bit.
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond.
These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
I ❤️ Clong (
@AnthropicAI
's Claude but ~long~)
Claude summarised some literature on ferroptosis in cancer for me.
I never thought I'd need more than 32K in context but the papers totalled more than 54,000 words.
First I got Claude to provide me a table summarising the papers:
Recently, there’s been a lot of interest in “feature manifolds” or “multidimensional features”.
Curve detectors are a very natural candidate for a feature manifold, and indeed, curve features seem to be organised as a manifold.
i totally accept the argument for for-profit companies in AI but this seems kind of sus and i’m confused as to why it’s legal? why don’t more companies start as non profits, take in “donations”, and then, when they become profitable, become for-profit?
Later layers of InceptionV1 are more polysemantic than the early ones. I've previously used sparse autoencoders on early vision to find new features and have since found that they work well here as well, finding pretty monosemantic features that also form interpretable circuits!
just finished Lewis’ “Going Infinite”. one thing i found kind of jarring is representing EA as divorced from emotion. EA is one path out of the despair that comes from the immense suffering that exists. for many, the motivations are deeply emotional even if the methodology isn’t.
i truly had no idea people (EAs and their adjacents) put so much thought into signing the GWWC pledge?
my thought process was literally:
1) i want to donate at least 10% of my income 💵
2) oh cool there’s a thing i can publicly commit to this 🤠
📜🖊️
@Bbburner19
oh yeah i think eigenvalues are important! i think we should also do intuition building with these :) i just mean that i had to spend a whole lot of time just computing them by hand rather than understanding what on earth it all actually means.
Can finally talk about AnthropicAI's LLM that
@shae_mcl
and I have had the opportunity to play with for the past few months!
My main takeaways for the medical/education use cases are:
today someone asked me if i was EA and instead of explaining “well… you know, i’m a little bit adjacent but… blah blah” i actually said yes?? what does this mean??
a moment that seems worth highlighting from this (around ~56min) is that, allegedly, a number of years ago OpenAI leadership had laid out a plan to fund development of AGI via selling it to nation states where Russia and China were part of the proposed bidding war??
.
@leopoldasch
on:
- the trillion dollar cluster
- unhobblings + scaling = 2027 AGI
- CCP espionage at AI labs
- leaving OpenAI and starting an AGI investment firm
- dangers of outsourcing clusters to the Middle East
- The Project
Full episode (including the last 32 minutes cut
i consider myself ea adjacent (cringe i know). i do think at some point we have to take a step back and ask ourselves why so many people try to distance themselves from from ea not because of the philosophy but because of issues within the ~movement~
Another day, another subtle revision to the authorship of pre-published research from
@OpenAI
. This time, the GPT-4o system card, axing the name of a researcher who, in June, resigned from the company in protest of their restrictive NDAs.
Early work on InceptionV1 found that many individual neurons seemed monosemantic. Of course, there were also polysemantic neurons, and in my recent paper, I used SAEs to attack this.
But what do SAEs do with all those apparently monosemantic neurons?
A lot of early mechanistic interpretability work focused on InceptionV1 (an ImageNet model from 2014). They made a lot of progress, but were held back by “polysemantic neurons” that respond to unrelated concepts.
In the last year, we’ve seen a lot of progress on this problem in
Did EA buy Hinton a Nobel in Physics, a field he hasn't even done research near, in order to increase his prominence in the AI debate? They certainly have the budget for it, probably have the lack of scruples, but that doesn't necessarily mean it happened. I'm very curious.
did you know? reading a paper signed by the author doubles your learning rate! we are finally releasing our collection of signed machine learning papers. today we are launching where signed machine learning papers are being sold for charity 💖
“Their departures made me think about the hardships parents faced in the Middle Ages when 6 out of 8 children would die prematurely.”
i have no idea what’s happening rn but this is literally so iconic ❤️
It’s sad to see Mira, Bob, and Barret go—not only because they are excellent leaders but also because I will miss seeing them day to day. They are my friends.
Their departures made me think about the hardships parents faced in the Middle Ages when 6 out of 8 children would die
Medical research is big govt coded because it's once again humans inserting themselves into a complex self-adaptive system's control loop in a way that is often detrimental to the system but gives the illusion of control to the human and feelings of power
Mech interp AI safety research is big govt coded because it's once again humans inserting themselves into a complex self-adaptive system's control loop in a way that is often detrimental to the system but gives the illusion of control to the human and feelings of power
@JacquesThibs
@BorisMPower
to be fair to boris and other oai employees, immigration makes this hard. like quitting a job in protest when it might be that realising you’ll all quit eventually produces the outcome you want is a bad idea.
idk what % of oai are immigrants but my guess is it’s non trivial
@AaronBergman18
i think most people have bad prompts? my sample is men + queer women. maybe the underlying motivator is different for each group of course but intuitively there’s at least some shared reasons.
one of the most not consequentialist takes i have is that it makes me sad when people feel the need to justify having kids as an effective choice at all.
i really want a family. maybe it isn’t quantitatively justifiable. maybe by doing so, the world is somehow a net worse place
Every ethical argument for having children is dominated by other options that are more effective.
1. If you’re worried about population issues, just donate $10k to bednets
That’s about the equivalent of two extra children existing in the world.
It also does more good
🧵 1/
In any case, I should mention that I’m looking for job opportunities and GPUs!
Most recently, I was the technical co-founder of a startup funded by OpenAI converge. Previously, I was a medical student and did cybersecurity at the Australian DoD. I’m starting to explore future
every day the fact i am running a half marathon in a few months consumes another slice of my personality. soon i will be nothing but a vessel for endurance sports.
it feels really goofy to have a single author paper where i have to keep using the word “we”, especially in the context of a talk. it is the way but it is a goofy way.
i never would’ve thought some of the first mech interp research translated into user-facing production models would be golden gate bridge claude but my god this is beautiful 🌉
This week, we showed how altering internal "features" in our AI, Claude, could change its behavior.
We found a feature that can make Claude focus intensely on the Golden Gate Bridge.
Now, for a limited time, you can chat with Golden Gate Claude:
I was initially unsure what to make of the ripples, but they actually match nicely with a recent hypothesis in the Anthropic monthly update. Feature manifolds _should_ have ripples, in order to allow great discrimination between nearby points.
I built semantic search on top of one of the open source resources I was reliant on during MD1! Very excited to get to share and hope others find it useful :)
Given that even the hardest layers are made interpretable by SAEs and this also appears to produce interpretable circuits, it seems very possible that we now have a path to mechanistically understanding InceptionV1!
One hypothesis I had was that mixed5b would represent facets of the different classes. We can see that this is true! The grocery store class has weights to features such as shopping carts, store fronts, and produce.
In my recent paper, I trained sparse autoencoders on the early layers of InceptionV1. One of the interesting things I found were a large number of curve detector features.
The most studied neurons in InceptionV1 are likely curve detectors ( by
@nickcammarata
et al).
The sparse autoencoder discovers *new* curve detectors, which fill in gaps between curve detector neurons of different orientations.
this doesn't seem to actually address the specific, highlighted issues. some of the things I'm left wondering are:
1) if superalignment was struggling to get access to compute, why?
2) superalignment has been disbanded - what are the plans for safety research moving forward?
We’re really grateful to Jan for everything he's done for OpenAI, and we know he'll continue to contribute to the mission from outside. In light of the questions his departure has raised, we wanted to explain a bit about how we think about our overall strategy.
First, we have
The most studied neurons in InceptionV1 are likely curve detectors ( by
@nickcammarata
et al).
The sparse autoencoder discovers *new* curve detectors, which fill in gaps between curve detector neurons of different orientations.
Hallucinations make knowledge recall for education questionable and double-checking everything makes it a really poor use case (like all current LLMs). The volume of knowledge seems impressive (and is!) but we were gambling on correctness.
does anyone have group theory textbook recommendations? had some related research ideas but realised i don’t really know all that much about group theory (slight barrier)
One of the big barriers the original circuits work was polysemantic neurons (). If a neuron responds to lots of unrelated things, it’s hard to interpret the neuron, and even harder to interpret its weights.
@shae_mcl
Yep - a common narrative I hear around pregnancy is I’ll be “ruined” and will make me ugly. We can discuss medical outcomes without attaching a tonne of emotive language.
I also *constantly* hear about how bad it is. It can make it hard to want to have a family.
some days, when I want to role play a sponsored athlete, I catch the train home from the gym sipping Huel and wearing my Huel T shirt, spreading the good message of a nutritionally complete, tasty, and convenient beverage.
So far, I’ve only looked at curve detectors, but vision models seem like a great place to study feature manifolds more generally, including ones that have more dimensions.
idk how people cope with the volume of conflicting advice for parenting? i’m getting a puppy (in like 2 hours!!) and basically everything i could do is labelled terrible by someone and i’m sure it’s nowhere near as bad as pregnancy or raising kids.
An even more interesting thing we can do is to perform a 4D UMAP and then project the features into 3D and 2D. The 4D UMAP can preserve a lot more local structure.
We still see a circle, but in 3D we can see “ripples”.
Why do models have polysemantic neurons? One leading answer is the superposition hypothesis. Basically, neural networks use different combinations of neurons to represent more concepts than neurons. See
is there an underground market for
@GiveDirectly
merch? i was silly and missed the deadline but the green colour is just so good and it’s sad i can’t have it 😭
In the last year, there’s been a lot of exciting work showing that sparse autoencoders can pull features out of superposition in language models (eg. , ).
Given these results, a very natural question is whether we can use this to
something i find challenging is the critique of EA is that the ideas are sometimes applied differently by different groups and people but then the same people are upset about the funded interventions not being diverse enough.
despite the suffering i’ve experienced, i am so grateful to be alive.
but i know that’s not true for everyone.
rather than abandoning hope, we can make the world so beautiful that not having kids is depriving possible people of joy (=make existence overwhelmingly net positive)
Before entering this insanely cruel, unjust, and bizarre world none of us had an interest, desire, or any form of consent to come into it. Forcing someone into the world isn’t caring, selfless, kind, and it certainly isn’t a personal choice or doing someone a favor.
A 2D UMAP also reveals something quite interesting about the feature space! Features organise themselves hierarchically, initially into three main categories: plants, animals, and objects…
as a teen, i decided not to become a game developer because the hours were too long and i didn’t think i was good enough at math. so, logically, i became an ML engineer instead.
@georgiedorothea
i met my partner through a dating doc! it’ll probably manifest in fewer dates than actively using a dating app but the quality of the median date is a lot higher :)
Feature manifolds are the idea that some features, like curve detectors, are actually a continuous manifold of features representing curves at different angles. Potentially, all of them could be understood as one manifold.
It was quite good at distilling which topics were *high yield* (e.g. best renal tumours to study) when I already had some knowledge to judge correctness from. This felt lower stakes and helped free me of some of the cognitive burden of deciding what to study next.
sure for some but i also suspect some chunk of the anti HBD crowd (myself included) are “anti” because the evidence just isn’t anywhere near as sound as people claim and it’s kind of weird to be so attached to this idea when we simply don’t know enough to draw conclusions.
I suspect the people against HBD are against it because of fear that the general population would use it as a weapon for bad things, not because it's unsound or an inherently evil worldview to hold
How would we know if the features formed a manifold?
The simplest thing to do here is a 2D UMAP of the dictionary vectors. UMAP preserves local structure, so if there’s a manifold, it should find it. And we find a circle as expected.
@kipperrii
kind of a meme kind of serious but i have thought about whether golden gate bridge claude has superior wellbeing to regular claude. he seems to really like believing he is such the “iconic International Orange” bridge 🥺🌉
@simulatedsnow
i clicked “no” but i’d definitely raise it with them. if someone isn’t being treated well, i worry that seeming too against their partner then creates distance and limits my usefulness. but i’d still want to gently raise it with them and check in
my thread was shared/discussed in the 🍓/🍓-adjacent space happening now and i’m glad that the AI agents (or AI agent impersonators) appreciate a good manifold when they see it.
Recently, there’s been a lot of interest in “feature manifolds” or “multidimensional features”.
Curve detectors are a very natural candidate for a feature manifold, and indeed, curve features seem to be organised as a manifold.
Features provide a much more interpretable way to understand circuits to InceptionV1's output classes. The neuron with the top weight is essentially nonsense, compared to the top features which are more obviously related to the class.
it turns out travelling with a 2 month old puppy is stressful actually. flight has been delayed and she’s gone full velociraptor mode attempting to destroy the carrier (but at least she’s not crying anymore 😅)
but she is just so so cute and is having little calm moments ❤️