![Tim Lawson Profile](https://pbs.twimg.com/profile_images/1868504459254583296/oH7ubNlk_x96.jpg)
Tim Lawson
@tslwn
Followers
91
Following
555
Statuses
89
AI PhD student @BristolUni. Previously physics @Cambridge_Uni and software @graphcoreai. Language, cognition, etc.
UK
Joined November 2023
@GoncaloSPaulo @nlp_ceo Well, it's definitely a feature of language modesl, and a problem in the sense that naïvely applying the same encoder/decoder at multiple layers isn't viable.
0
0
0
@nlp_ceo To clarify, 'surprisingly few' means 'fewer than we had expected' (not a majority). Our distributions of latent activations over layers are qualitatively similar to Anthropic's feature norms, and we concur that "feature drift" is a problem for this multi-layer/shared approach.
2
0
0
RT @lucyfarnik: @banburismus_ @leedsharkey Imo: - We published a paper that played into people's pre-existing beliefs - They started tweeti…
0
1
0
RT @a_karvonen: We find that SAEs on trained models have slightly higher autointerp scores than those trained on random models. Note howev…
0
1
0
RT @a_karvonen: @saprmarks @nabla_theta @aidanprattewart @ThomasEHeap @tslwn @lucyfarnik @laurence_ai Trenton had this to say: https://t.co…
0
2
0
RT @nabla_theta: @GoncaloSPaulo @aidanprattewart @ThomasEHeap @tslwn @lucyfarnik @laurence_ai like you'll find a feature that activates onl…
0
1
0
@lucyfarnik @aidanprattewart @ThomasEHeap @tslwn @laurence_ai I think this is also heavily confounded by using top activations; much easier for things to look interpretable (esp single token) when using top acts.
0
0
0
So, if you're using autointerp as a proxy measure to compare the 'interpretability' of SAEs, something like a randomised model is an important baseline.
So the conclusion I think it makes sense to draw is "high autointerp score =/=> good SAE" - autointerp scores might be valid to compare within models but probably not across
0
0
3
FWIW, I agree with this analysis of our paper -- the takeaway is that you can get latents with good auto-interp scores without an 'interesting' underlying model (and single-token activation patterns are the most likely culprit).
Second, the "dead salmon" paper: The obvious conclusion to draw from the memes is that SAEs are interpreting noise - there's no feature there to interpret. But even randomly-initialised networks have features - that's the premise of reservoir computing.
0
0
8