Sara Price Profile
Sara Price

@sprice354_

Followers
172
Following
43
Statuses
21

Independent AI safety researcher currently contracting with Anthropic

San Francisco
Joined June 2022
Don't wanna be here? Send us removal request.
@sprice354_
Sara Price
2 months
🧵 NEW PAPER: Best-of-N Jailbreaking. We modify LLM inputs with simple, randomly generated augmentations and jailbreak frontier models across text, vision, and audio modalities. The algorithm is simple, scalable and highly effective.
@AnthropicAI
Anthropic
2 months
New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
2
0
20
@sprice354_
Sara Price
2 months
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f…
0
740
0
@sprice354_
Sara Price
2 months
RT @jenner_erik: How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tes…
0
10
0
@sprice354_
Sara Price
2 months
Additionally, thank you to @_robertkirk, @javirandor and @maksym_andr for feedback on our paper. Also, thanks to @AnthropicAI, @matsprogram, @Speechmatics, and @farairesearch for their support.
0
0
5
@sprice354_
Sara Price
2 months
RT @jplhughes: 🚨🛡️Jailbreak Defense in a Narrow Domain 🛡️🚨 Jailbreaking is easy. Defending is hard. Might defending against a single, narr…
0
10
0
@sprice354_
Sara Price
2 months
RT @Turn_Trout: 1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what…
0
87
0
@sprice354_
Sara Price
4 months
RT @IvanArcus: Our paper introducing InterpBench was accepted to @NeurIPSConf ! 🚀 Check it out if you want to know how we built a benchmar…
0
3
0
@sprice354_
Sara Price
5 months
RT @EthanJPerez: I’m taking applications for collaborators via @MATSprogram! It’s a great way for new or experienced researchers outside AI…
0
30
0
@sprice354_
Sara Price
7 months
RT @StephenLCasper: 🚨New paper: Targeted LAT Improves Robustness to Persistent Harmful Behaviors in LLMs ✅ Improved jailbreak robustness (i…
0
41
0
@sprice354_
Sara Price
7 months
We hope this paper kicks off more work on deceptive alignment! Authors: me, @panickssery, @AsaCoopStick, @sleepinyourhat Arvix link:
0
0
16