Sara Price @sprice354_ profile

Sara Price

@sprice354_

Followers

172

Following

43

Statuses

21

Independent AI safety researcher currently contracting with Anthropic

San Francisco

Joined June 2022

Don't wanna be here? Send us removal request.

Sara Price

@sprice354_

2 months

🧵 NEW PAPER: Best-of-N Jailbreaking. We modify LLM inputs with simple, randomly generated augmentations and jailbreak frontier models across text, vision, and audio modalities. The algorithm is simple, scalable and highly effective.

Anthropic

@AnthropicAI

2 months

New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

2

0

20

Sara Price

@sprice354_

2 months

RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f…

0

740

0

Sara Price

@sprice354_

2 months

RT @jenner_erik: How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tes…

0

10

0

Sara Price

@sprice354_

2 months

Additionally, thank you to @_robertkirk, @javirandor and @maksym_andr for feedback on our paper. Also, thanks to @AnthropicAI, @matsprogram, @Speechmatics, and @farairesearch for their support.

0

5

Sara Price

@sprice354_

2 months

RT @jplhughes: 🚨🛡️Jailbreak Defense in a Narrow Domain 🛡️🚨 Jailbreaking is easy. Defending is hard. Might defending against a single, narr…

0

10

0

Sara Price

@sprice354_

2 months

RT @Turn_Trout: 1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what…

0

87

0

Sara Price

@sprice354_

4 months

RT @IvanArcus: Our paper introducing InterpBench was accepted to @NeurIPSConf ! 🚀 Check it out if you want to know how we built a benchmar…

0

3

0

Sara Price

@sprice354_

5 months

RT @EthanJPerez: I’m taking applications for collaborators via @MATSprogram! It’s a great way for new or experienced researchers outside AI…

0

30

0

Sara Price

@sprice354_

7 months

RT @StephenLCasper: 🚨New paper: Targeted LAT Improves Robustness to Persistent Harmful Behaviors in LLMs ✅ Improved jailbreak robustness (i…

0

41

0

Sara Price

@sprice354_

7 months

We hope this paper kicks off more work on deceptive alignment! Authors: me, @panickssery, @AsaCoopStick, @sleepinyourhat Arvix link:

0

16