Yi Zeng 曾祎 @EasonZeng623 profile

Yi Zeng 曾祎

@EasonZeng623

Followers

1K

Following

2K

Statuses

563

probe to improve @VirtueAI_co | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.

Virginia, US

Joined August 2017

Don't wanna be here? Send us removal request.

Yi Zeng 曾祎

@EasonZeng623

1 year

Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023 [📸 With legendaries @ylecun and Yoshua Bengio]

1

5

117

Yi Zeng 曾祎

@EasonZeng623

4 days

RT @Yihe__Deng: New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails…

0

28

0

Yi Zeng 曾祎

@EasonZeng623

7 days

RT @aleks_madry: Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), b…

0

42

0

Yi Zeng 曾祎

@EasonZeng623

12 days

RT @aryaman2020: new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept…

0

69

0

Yi Zeng 曾祎

@EasonZeng623

14 days

RT @tomekkorbak: 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrati…

0

36

0

Yi Zeng 曾祎

@EasonZeng623

15 days

RT @arankomatsuzaki: Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanisti…

0

38

0

Yi Zeng 曾祎

@EasonZeng623

15 days

RT @leedsharkey: Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers…

0

86

0

Yi Zeng 曾祎

@EasonZeng623

16 days

RT @McaleerStephen: DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.

0

13

0

Yi Zeng 曾祎

@EasonZeng623

24 days

@PeterHndrsn 😧

0

1

Yi Zeng 曾祎

@EasonZeng623

1 month

RT @rm_rafailov: We have a new position paper on "inference time compute" and what we have been working on in the last few months! We prese…

0

236

0

Yi Zeng 曾祎

@EasonZeng623

2 months

RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f…

0

740

0

Yi Zeng 曾祎

@EasonZeng623

2 months

RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets…

0

82

0

Yi Zeng 曾祎

@EasonZeng623

2 months

RT @xun_aq: Code agents are great, but not risk-free in code execution and generation! 🎯 We propose RedCode, an evaluation platform to com…

0

10

0

Yi Zeng 曾祎

@EasonZeng623

2 months

RT @xun_aq: For more details, please visit our paper at I'm Xun Liu, a senior undergraduate student advised by Pr…

0

1

0

Yi Zeng 曾祎

@EasonZeng623

3 months

RT @yujink_: How will LLMs reshape our democracy? Recent work including ours has started exploring this important question. We recently wr…

0

19

0

Yi Zeng 曾祎

@EasonZeng623

3 months

Paper 1530 here @emnlpmeeting . Seeing y’all soon 🤫

Yi Zeng 曾祎

@EasonZeng623

4 months

Excited to present "BEEAR" at @emnlpmeeting! Join us in Session 03: Ethics, Bias, and Fairness 🕑on Nov 12 (Tue) from 14:00-15:30. See you in Miami! 🎉

0

1

11

Yi Zeng 曾祎

@EasonZeng623

4 months

RT @jbhuang0604: Wow!! 🤯🤯🤯 Openings of *30* tenured and/or tenure-track faculty positions in Artificial Intelligence!

0

40

0

Yi Zeng 曾祎

@EasonZeng623

4 months

RT @farairesearch: Bay Area Alignment Workshop Day 2 packed with learnings on interpretability, robustness, oversight & beyond! Shoutout to…

0

7

0

Yi Zeng 曾祎

@EasonZeng623

4 months

RT @kevin_klyman: Come to our workshop on the future of third party AI evaluations on Monday! We have some of the top folks in the field on…

0

6

0

Yi Zeng 曾祎

@EasonZeng623

4 months

RT @farairesearch: Kicked off Day 1 of the Bay Area Alignment Workshop in Santa Cruz with amazing energy! Huge thanks to @ancadianadragan,…

0

6

0

Yi Zeng 曾祎

@EasonZeng623

4 months

Javier’s take aligns with mine. It’s the same feeling I had after reading Anthropic’s new RSP, where their red-teaming focuses on testing models that have had safety guardrails removed, rather than just evaluating refusals. Assessing how models might assist harmful actions during jailbreaks, makes more sense to me in preventing catastrophic risks.

Javier Rando

@javirandor

4 months

Jailbreaks have become a new sort of ImageNet competition instead of helping us better understand LLM security. I wrote a blogpost about what I think valuable research could look like 🧵

0

12