tslwn Profile Banner
Tim Lawson Profile
Tim Lawson

@tslwn

Followers
91
Following
555
Statuses
89

AI PhD student @BristolUni. Previously physics @Cambridge_Uni and software @graphcoreai. Language, cognition, etc.

UK
Joined November 2023
Don't wanna be here? Send us removal request.
@tslwn
Tim Lawson
19 days
Very pleased to confirm that our paper "Residual Stream Analysis with Multi-Layer SAEs" has been accepted to ICLR 2025!
1
3
39
@tslwn
Tim Lawson
3 days
@GoncaloSPaulo @nlp_ceo Well, it's definitely a feature of language modesl, and a problem in the sense that naïvely applying the same encoder/decoder at multiple layers isn't viable.
0
0
0
@tslwn
Tim Lawson
3 days
@nlp_ceo Oh, that's a nice idea! I guess you could reverse-engineer a lossy mapping between the representation spaces of adjacent layers that way. We tried applying 'tuned lens' transformations to similar effect, but with limited success
0
0
1
@tslwn
Tim Lawson
3 days
@nlp_ceo To clarify, 'surprisingly few' means 'fewer than we had expected' (not a majority). Our distributions of latent activations over layers are qualitatively similar to Anthropic's feature norms, and we concur that "feature drift" is a problem for this multi-layer/shared approach.
2
0
0
@tslwn
Tim Lawson
3 days
@nlp_ceo Thanks, I haven't read this thoroughly -- what do you mean by inter-layer dynamics, given that the method is data-free? At a glance, I can only see the LLM-based categorization of features as same/maybe/different, which look roughly equally distributed.
1
0
0
@tslwn
Tim Lawson
4 days
@nlp_ceo This is interesting, thanks. We actually found surprisingly few latents activated at multiple layers when training a single SAE on every layer of the residual stream:
2
0
2
@tslwn
Tim Lawson
4 days
@lucyfarnik See also
0
0
2
@tslwn
Tim Lawson
8 days
RT @lucyfarnik: @banburismus_ @leedsharkey Imo: - We published a paper that played into people's pre-existing beliefs - They started tweeti…
0
1
0
@tslwn
Tim Lawson
8 days
RT @a_karvonen: We find that SAEs on trained models have slightly higher autointerp scores than those trained on random models. Note howev…
0
1
0
@tslwn
Tim Lawson
8 days
RT @nabla_theta: @GoncaloSPaulo @aidanprattewart @ThomasEHeap @tslwn @lucyfarnik @laurence_ai like you'll find a feature that activates onl…
0
1
0
@tslwn
Tim Lawson
8 days
@nabla_theta
Leo Gao
9 days
@lucyfarnik @aidanprattewart @ThomasEHeap @tslwn @laurence_ai I think this is also heavily confounded by using top activations; much easier for things to look interpretable (esp single token) when using top acts.
0
0
0
@tslwn
Tim Lawson
8 days
So, if you're using autointerp as a proxy measure to compare the 'interpretability' of SAEs, something like a randomised model is an important baseline.
@banburismus_
Tom McGrath
8 days
So the conclusion I think it makes sense to draw is "high autointerp score =/=> good SAE" - autointerp scores might be valid to compare within models but probably not across
0
0
3
@tslwn
Tim Lawson
8 days
FWIW, I agree with this analysis of our paper -- the takeaway is that you can get latents with good auto-interp scores without an 'interesting' underlying model (and single-token activation patterns are the most likely culprit).
@banburismus_
Tom McGrath
8 days
Second, the "dead salmon" paper: The obvious conclusion to draw from the memes is that SAEs are interpreting noise - there's no feature there to interpret. But even randomly-initialised networks have features - that's the premise of reservoir computing.
0
0
8
@tslwn
Tim Lawson
22 days
Tweet media one
0
0
2
@tslwn
Tim Lawson
2 months
Are computational models and processes increasingly analogous to the brain and cognition, or do we increasingly speak (think) in computational metaphors?
0
0
1
@tslwn
Tim Lawson
2 months
-- Hacker, P. M. S. (1990). Men, minds and machines. In Wittgenstein, meaning and mind. Cambridge, Mass., USA: Blackwell. pp. 89–111.
0
0
0