Phillip Guo Profile
Phillip Guo

@phuguo

Followers
302
Following
21K
Statuses
89

Undergrad at UMD working on AI safety. Previously trading intern @ Jane Street, robustness research @ MATS

USA
Joined February 2019
Don't wanna be here? Send us removal request.
@phuguo
Phillip Guo
2 days
@GrimpenMar @tomaspueyo Humanity’s Lastest Exam_final 3.ipynb
0
0
1
@phuguo
Phillip Guo
12 days
RT @tszzl: this didn’t move overnight markets at all which either means markets either: - don’t believe it’s credible - were pricing thi…
0
136
0
@phuguo
Phillip Guo
16 days
@jxmnop You’re probably right that this will give us some efficiency improvement, but I’m worried that this will make it significantly harder to monitor the CoTs for failure modes and misalignment. Our oversight plan seems to be “have LMs watch the CoTs”
0
0
3
@phuguo
Phillip Guo
21 days
RT @jasonhausenloy: 🧵 I wrote a piece in @inferencemag arguing that AGI is now an engineering problem, not a scientific one. The main ideas…
0
16
0
@phuguo
Phillip Guo
2 months
RT @PandaAshwinee: i feel like i see 10 papers dunking on machine unlearning for every 1 paper trying to show that it actually does somethi…
0
2
0
@phuguo
Phillip Guo
2 months
RT @radiuskia: We raised a $27M seed! Has been a complete blast working with this insanely talented team on real-time intelligence. If you’…
0
5
0
@phuguo
Phillip Guo
3 months
@khamsinshrike @DeepDishEnjoyer few more beautiful phrases in the English language
0
0
1
@phuguo
Phillip Guo
4 months
@megamor2 @aghyadd98 Ah sorry, makes sense. I agree this methodology is a necessary evaluation. I wonder if this method has different results than supervised linear probing evals (Lynch et al) - I expect your method might be more robust. Also wonder if SAE features are less noisy than concept vectors
0
0
1
@phuguo
Phillip Guo
4 months
@megamor2 @aghyadd98 Big fan of your eval using latent knowledge to evaluate unlearning. However, do we expect latent info to be vocab aligned? By optimizing for vocab output, we might be latching on output extraction/logit boosting components rather than sources of latent info (attribute extraction)
1
0
0
@phuguo
Phillip Guo
4 months
@javirandor Agree that the standard unlearning goals probably don’t theoretically address models that can learn about science in context. They probably make them harder to use empirically? But also for closed models you can employ good monitoring instead, and unlearning gives insights there
0
0
4
@phuguo
Phillip Guo
4 months
Next steps: we want to try more advanced analysis of intermediate mechanisms (e.g. SAE features), monitoring in addition to unlearning, and work on more safety-relevant domains like WMDP (@natliml)!
0
0
6
@phuguo
Phillip Guo
4 months
@scychan_brains @TrentonBricken I don't expect standard L2 regularization to help - no reason to expect mean 0 prior on raw activations. PCA probably helps, but at that point you're kind of back to methods like RepE or mean diff (which ime work worse than trained linear classifiers), losing other SAE benefits
1
0
2
@phuguo
Phillip Guo
4 months
@TrentonBricken Hm, it seems surprising that linear models on sparse features work better than nonlinear models - my intuition is these sparse interpretable features probably interact very nonlinearly. Have you tried models better than single decision trees (e.g. gbms for more interactions)?
0
0
0