Tim Lawson @tslwn profile

Tim Lawson

@tslwn

Followers

91

Following

555

Statuses

89

AI PhD student @BristolUni. Previously physics @Cambridge_Uni and software @graphcoreai. Language, cognition, etc.

UK

Joined November 2023

Don't wanna be here? Send us removal request.

Tim Lawson

@tslwn

19 days

Very pleased to confirm that our paper "Residual Stream Analysis with Multi-Layer SAEs" has been accepted to ICLR 2025!

1

3

39

Tim Lawson

@tslwn

3 days

@GoncaloSPaulo @nlp_ceo Well, it's definitely a feature of language modesl, and a problem in the sense that naïvely applying the same encoder/decoder at multiple layers isn't viable.

0

Tim Lawson

@tslwn

3 days

@nlp_ceo Oh, that's a nice idea! I guess you could reverse-engineer a lossy mapping between the representation spaces of adjacent layers that way. We tried applying 'tuned lens' transformations to similar effect, but with limited success

0

1

Tim Lawson

@tslwn

3 days

@nlp_ceo To clarify, 'surprisingly few' means 'fewer than we had expected' (not a majority). Our distributions of latent activations over layers are qualitatively similar to Anthropic's feature norms, and we concur that "feature drift" is a problem for this multi-layer/shared approach.

2

0

Tim Lawson

@tslwn

3 days

@nlp_ceo Thanks, I haven't read this thoroughly -- what do you mean by inter-layer dynamics, given that the method is data-free? At a glance, I can only see the LLM-based categorization of features as same/maybe/different, which look roughly equally distributed.

1

0

Tim Lawson

@tslwn

4 days

@nlp_ceo This is interesting, thanks. We actually found surprisingly few latents activated at multiple layers when training a single SAE on every layer of the residual stream:

2

0

2

Tim Lawson

@tslwn

4 days

@lucyfarnik See also

0

2

Tim Lawson

@tslwn

8 days

RT @lucyfarnik: @banburismus_ @leedsharkey Imo: - We published a paper that played into people's pre-existing beliefs - They started tweeti…

0

1

0

Tim Lawson

@tslwn

8 days

RT @a_karvonen: We find that SAEs on trained models have slightly higher autointerp scores than those trained on random models. Note howev…

0

1

0

Tim Lawson

@tslwn

8 days

RT @a_karvonen: @saprmarks @nabla_theta @aidanprattewart @ThomasEHeap @tslwn @lucyfarnik @laurence_ai Trenton had this to say: https://t.co…

0

2

0

Tim Lawson

@tslwn

8 days

RT @nabla_theta: @GoncaloSPaulo @aidanprattewart @ThomasEHeap @tslwn @lucyfarnik @laurence_ai like you'll find a feature that activates onl…

0

1

0

Tim Lawson

@tslwn

8 days

Leo Gao

@nabla_theta

9 days

@lucyfarnik @aidanprattewart @ThomasEHeap @tslwn @laurence_ai I think this is also heavily confounded by using top activations; much easier for things to look interpretable (esp single token) when using top acts.

0

Tim Lawson

@tslwn

8 days

So, if you're using autointerp as a proxy measure to compare the 'interpretability' of SAEs, something like a randomised model is an important baseline.

Tom McGrath

@banburismus_

8 days

So the conclusion I think it makes sense to draw is "high autointerp score =/=> good SAE" - autointerp scores might be valid to compare within models but probably not across

0

3

Tim Lawson

@tslwn

8 days

FWIW, I agree with this analysis of our paper -- the takeaway is that you can get latents with good auto-interp scores without an 'interesting' underlying model (and single-token activation patterns are the most likely culprit).

Tom McGrath

@banburismus_

8 days

Second, the "dead salmon" paper: The obvious conclusion to draw from the memes is that SAEs are interpreting noise - there's no feature there to interpret. But even randomly-initialised networks have features - that's the premise of reservoir computing.

0

8

Tim Lawson

@tslwn

22 days

@_MathAcademy_

0

2

Tim Lawson

@tslwn

2 months

Are computational models and processes increasingly analogous to the brain and cognition, or do we increasingly speak (think) in computational metaphors?

0

1

Tim Lawson

@tslwn

2 months

-- Hacker, P. M. S. (1990). Men, minds and machines. In Wittgenstein, meaning and mind. Cambridge, Mass., USA: Blackwell. pp. 89–111.

0