Phillip Guo @phuguo profile

Phillip Guo

@phuguo

Followers

302

Following

21K

Statuses

89

Undergrad at UMD working on AI safety. Previously trading intern @ Jane Street, robustness research @ MATS

USA

Joined February 2019

Don't wanna be here? Send us removal request.

Phillip Guo

@phuguo

2 days

@GrimpenMar @tomaspueyo Humanity’s Lastest Exam_final 3.ipynb

0

1

Phillip Guo

@phuguo

12 days

RT @tszzl: this didn’t move overnight markets at all which either means markets either: - don’t believe it’s credible - were pricing thi…

0

136

0

Phillip Guo

@phuguo

16 days

@jxmnop You’re probably right that this will give us some efficiency improvement, but I’m worried that this will make it significantly harder to monitor the CoTs for failure modes and misalignment. Our oversight plan seems to be “have LMs watch the CoTs”

0

3

Phillip Guo

@phuguo

21 days

RT @jasonhausenloy: 🧵 I wrote a piece in @inferencemag arguing that AGI is now an engineering problem, not a scientific one. The main ideas…

0

16

0

Phillip Guo

@phuguo

2 months

RT @PandaAshwinee: i feel like i see 10 papers dunking on machine unlearning for every 1 paper trying to show that it actually does somethi…

0

2

0

Phillip Guo

@phuguo

2 months

RT @radiuskia: We raised a $27M seed! Has been a complete blast working with this insanely talented team on real-time intelligence. If you’…

0

5

0

Phillip Guo

@phuguo

3 months

@khamsinshrike @DeepDishEnjoyer few more beautiful phrases in the English language

0

1

Phillip Guo

@phuguo

4 months

@megamor2 @aghyadd98 Ah sorry, makes sense. I agree this methodology is a necessary evaluation. I wonder if this method has different results than supervised linear probing evals (Lynch et al) - I expect your method might be more robust. Also wonder if SAE features are less noisy than concept vectors

0

1

Phillip Guo

@phuguo

4 months

@megamor2 @aghyadd98 Big fan of your eval using latent knowledge to evaluate unlearning. However, do we expect latent info to be vocab aligned? By optimizing for vocab output, we might be latching on output extraction/logit boosting components rather than sources of latent info (attribute extraction)

1

0

Phillip Guo

@phuguo

4 months

@javirandor Agree that the standard unlearning goals probably don’t theoretically address models that can learn about science in context. They probably make them harder to use empirically? But also for closed models you can employ good monitoring instead, and unlearning gives insights there

0

4

Phillip Guo

@phuguo

4 months

Next steps: we want to try more advanced analysis of intermediate mechanisms (e.g. SAE features), monitoring in addition to unlearning, and work on more safety-relevant domains like WMDP (@natliml)!

0

6

Phillip Guo

@phuguo

4 months

@scychan_brains @TrentonBricken I don't expect standard L2 regularization to help - no reason to expect mean 0 prior on raw activations. PCA probably helps, but at that point you're kind of back to methods like RepE or mean diff (which ime work worse than trained linear classifiers), losing other SAE benefits

1

0

2

Phillip Guo

@phuguo

4 months

@TrentonBricken Hm, it seems surprising that linear models on sparse features work better than nonlinear models - my intuition is these sparse interpretable features probably interact very nonlinearly. Have you tried models better than single decision trees (e.g. gbms for more interactions)?

0