![Yi Zeng 曾祎 Profile](https://pbs.twimg.com/profile_images/1625135041637412864/w46fqNhR_x96.jpg)
Yi Zeng 曾祎
@EasonZeng623
Followers
1K
Following
2K
Statuses
563
probe to improve @VirtueAI_co | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.
Virginia, US
Joined August 2017
Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023 [📸 With legendaries @ylecun and Yoshua Bengio]
1
5
117
RT @Yihe__Deng: New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails…
0
28
0
RT @aleks_madry: Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), b…
0
42
0
RT @aryaman2020: new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept…
0
69
0
RT @tomekkorbak: 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrati…
0
36
0
RT @arankomatsuzaki: Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanisti…
0
38
0
RT @leedsharkey: Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers…
0
86
0
RT @McaleerStephen: DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.
0
13
0
RT @rm_rafailov: We have a new position paper on "inference time compute" and what we have been working on in the last few months! We prese…
0
236
0
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f…
0
740
0
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets…
0
82
0
RT @xun_aq: Code agents are great, but not risk-free in code execution and generation! 🎯 We propose RedCode, an evaluation platform to com…
0
10
0
RT @xun_aq: For more details, please visit our paper at I'm Xun Liu, a senior undergraduate student advised by Pr…
0
1
0
RT @yujink_: How will LLMs reshape our democracy? Recent work including ours has started exploring this important question. We recently wr…
0
19
0
Paper 1530 here @emnlpmeeting . Seeing y’all soon 🤫
Excited to present "BEEAR" at @emnlpmeeting! Join us in Session 03: Ethics, Bias, and Fairness 🕑on Nov 12 (Tue) from 14:00-15:30. See you in Miami! 🎉
0
1
11
RT @jbhuang0604: Wow!! 🤯🤯🤯 Openings of *30* tenured and/or tenure-track faculty positions in Artificial Intelligence!
0
40
0
RT @farairesearch: Bay Area Alignment Workshop Day 2 packed with learnings on interpretability, robustness, oversight & beyond! Shoutout to…
0
7
0
RT @kevin_klyman: Come to our workshop on the future of third party AI evaluations on Monday! We have some of the top folks in the field on…
0
6
0
RT @farairesearch: Kicked off Day 1 of the Bay Area Alignment Workshop in Santa Cruz with amazing energy! Huge thanks to @ancadianadragan,…
0
6
0
Javier’s take aligns with mine. It’s the same feeling I had after reading Anthropic’s new RSP, where their red-teaming focuses on testing models that have had safety guardrails removed, rather than just evaluating refusals. Assessing how models might assist harmful actions during jailbreaks, makes more sense to me in preventing catastrophic risks.
Jailbreaks have become a new sort of ImageNet competition instead of helping us better understand LLM security. I wrote a blogpost about what I think valuable research could look like 🧵
0
0
12