![Joshua Batson Profile](https://pbs.twimg.com/profile_images/1251741215415885829/6kDbML8z_x96.jpg)
Joshua Batson
@thebasepoint
Followers
3K
Following
5K
Statuses
2K
trying to understand evolved systems (🖥 and 🧬) interpretability research @anthropicai formerly @czbiohub, @mit math
Oakland, CA
Joined February 2012
RT @AnthropicAI: We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-…
0
309
0
RT @bneyshabur: Thrilled to share that I’m joining @AnthropicAI ! After 5.5 amazing years at Alphabet, including working on Gemini’s reaso…
0
23
0
RT @AnthropicAI: Crosscoders (published today: are a new method allowing us to find features shared across differe…
0
32
0
RT @esindurmusnlp: Excited to share my new research on evaluating feature steering: I ran quantitative evaluations on how steering specific…
0
18
0
@a_karvonen Lovely work! Are the pair of per-method trajectories on the right different dictionary sizes?
1
0
1
RT @cogcelia: The thing about AI is that no one knows how it works (not even AI developers). Interpreting AI is HARD, but it’s a challenge…
0
9
0
RT @farairesearch: Do neural networks dream of internal goals? We confirm RNNs trained to play Sokoban with RL learn to plan. Our black-box…
0
44
0
Excellent new work from Ben and Michael on; white box attacks yielding universalizing jailbreaks. The mix of discrete and continuous optimization is very hard to get right, and these guys are some of the best out there. Impressive results.
1/ @michaelbsklar and I just published "Fluent student-teacher redteaming" - The key idea is an improved objective function for discrete-optimization-based adversarial attacks based on distilling the activations/logits from a toxified model.
0
0
4
A great opportunity for *anyone* to do some interpretability on a ~frontier model.
Time to study #llama3 405b, but gosh it's big! Please retweet: if you have a great experiment but not enough GPU, here is an opportunity to apply for shared #NDIF research resources. Deadline July 30: You'll help @ndif_team test, we'll help you run 405b
0
1
19
RT @livgorton: A lot of early mechanistic interpretability work focused on InceptionV1 (an ImageNet model from 2014). They made a lot of pr…
0
34
0
Last week I tried my hand at hosting a podcast for the first time, interviewing my colleagues about the engineering work that went into scaling monosemanticity. If this sounds fun to you, we are hiring senior engineers...
Science and engineering are inseparable. Watch our new roundtable video where our researchers discuss the engineering challenges of interpretability research:
0
2
46
RT @nabla_theta: Excited to share what I've been working on as part of the former Superalignment team! We introduce a SOTA training stack…
0
85
0