![Phillip Guo Profile](https://pbs.twimg.com/profile_images/1730094633076600832/sU0iUUZa_x96.jpg)
Phillip Guo
@phuguo
Followers
302
Following
21K
Statuses
89
Undergrad at UMD working on AI safety. Previously trading intern @ Jane Street, robustness research @ MATS
USA
Joined February 2019
RT @jasonhausenloy: 🧵 I wrote a piece in @inferencemag arguing that AGI is now an engineering problem, not a scientific one. The main ideas…
0
16
0
RT @PandaAshwinee: i feel like i see 10 papers dunking on machine unlearning for every 1 paper trying to show that it actually does somethi…
0
2
0
RT @radiuskia: We raised a $27M seed! Has been a complete blast working with this insanely talented team on real-time intelligence. If you’…
0
5
0
@megamor2 @aghyadd98 Ah sorry, makes sense. I agree this methodology is a necessary evaluation. I wonder if this method has different results than supervised linear probing evals (Lynch et al) - I expect your method might be more robust. Also wonder if SAE features are less noisy than concept vectors
0
0
1
@megamor2 @aghyadd98 Big fan of your eval using latent knowledge to evaluate unlearning. However, do we expect latent info to be vocab aligned? By optimizing for vocab output, we might be latching on output extraction/logit boosting components rather than sources of latent info (attribute extraction)
1
0
0
@javirandor Agree that the standard unlearning goals probably don’t theoretically address models that can learn about science in context. They probably make them harder to use empirically? But also for closed models you can employ good monitoring instead, and unlearning gives insights there
0
0
4
@scychan_brains @TrentonBricken I don't expect standard L2 regularization to help - no reason to expect mean 0 prior on raw activations. PCA probably helps, but at that point you're kind of back to methods like RepE or mean diff (which ime work worse than trained linear classifiers), losing other SAE benefits
1
0
2
@TrentonBricken Hm, it seems surprising that linear models on sparse features work better than nonlinear models - my intuition is these sparse interpretable features probably interact very nonlinearly. Have you tried models better than single decision trees (e.g. gbms for more interactions)?
0
0
0