Yangsibo Huang Profile
Yangsibo Huang

@YangsiboHuang

Followers
3,466
Following
679
Media
14
Statuses
234

Research scientist @GoogleAI . Prev: PhD from @Princeton @PrincetonPLI . ML security & privacy. Opinions are my own.

Princeton, NJ
Joined October 2014
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@YangsiboHuang
Yangsibo Huang
3 days
What shall we expect for unlearning for LMs (more in 🧵)? Data owners may want the LM to unlearn the wording / knowledge of their data, w/o privacy leakage. But model deployers may want the unlearned LM to remain useful, even after sequential unlearning requests that may vary
@WeijiaShi2
Weijia Shi
3 days
Can 𝐦𝐚𝐜𝐡𝐢𝐧𝐞 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 make language models forget their training data? We shows Yes but at the cost of privacy and utility. Current unlearning scales poorly with the size of the data to be forgotten and can’t handle sequential unlearning requests. 🔗:
Tweet media one
4
71
284
2
9
78
@YangsiboHuang
Yangsibo Huang
9 months
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]
Tweet media one
4
50
244
@YangsiboHuang
Yangsibo Huang
9 months
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]
Tweet media one
7
44
350
@YangsiboHuang
Yangsibo Huang
1 year
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
Tweet media one
2
13
163
@YangsiboHuang
Yangsibo Huang
17 days
Among all the open-weight models, Gemma2 9B & 27B are the top performers in rejecting unsafe requests according to our SORRY-Bench: Gemma's post-training must have taken a lot of effort
@VitusXie
Tinghao Xie
17 days
🦾Gemma-2 and Claude 3.5 are out. 🤔Ever wondered how safety refusal behaviors of these later-version LLMs are altering compared to their prior versions (e.g., Gemma-2 v.s. Gemma-1)? ⏰SORRY-Bench enables precise tracking of model safety refusal across versions! Check the image
Tweet media one
2
14
87
0
11
109
@YangsiboHuang
Yangsibo Huang
7 months
I am at #NeurIPS2023 now. I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency. Appreciate any reposts & happy to chat in person! CV+statements: Find me at ⬇️
3
32
132
@YangsiboHuang
Yangsibo Huang
3 years
Gradient inversion attacks in #FederatedLearning can recover private data from public gradients (privacy leaks!) Our #NeurIPS2021 work evaluates these attacks & potential defenses. We also release an evaluation library: Join us @ Oral Session 5 (12/10)!
1
0
21
@YangsiboHuang
Yangsibo Huang
2 months
Missed #ICLR24 due to visa, but my amazing collaborators are presenting our 4 works! ➀ Jailbreaking LLMs via Exploiting Generation (see thread) 👩‍🏫 @xiamengzhou ⏰ Fri 4:30 pm, Halle B #187 ➁ Detecting Pretraining Data from LLMs 👩‍🏫 @WeijiaShi2 ⏰ Fri 10:45 am, Halle B #95
@YangsiboHuang
Yangsibo Huang
9 months
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]
Tweet media one
7
44
350
2
5
61
@YangsiboHuang
Yangsibo Huang
4 years
How to tackle data privacy for language understanding tasks in distributed learning (without slowing down training or reducing accuracy)? Happy to share our new #emnlp2020 findings paper w/ @realZhaoSong , @danqi_chen , Prof. Kai Li, @prfsanjeevarora paper:
Tweet media one
0
18
38
@YangsiboHuang
Yangsibo Huang
7 months
I am not able to travel to #EMNLP2023 due to visa issues. But my great coauthor @Sam_K_G is there and will present this work🤗 (pls consider him for internship opportunities!) I will attend #NeurIPS2023 next week. Let’s grab a ☕️ if you want to chat about LLM safety/privacy/data
@YangsiboHuang
Yangsibo Huang
1 year
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
Tweet media one
2
13
163
0
2
31
@YangsiboHuang
Yangsibo Huang
9 months
Membership inference attack (MIA) is well-researched in ML security. Yet, its use in LLM pretraining is relatively underexplored. Our Min-K% Prob is stepping up to bridge this gap. Think you can do better? Try your methods on our WikiMIA benchmark 📈:
@WeijiaShi2
Weijia Shi
9 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
16
140
664
0
6
30
@YangsiboHuang
Yangsibo Huang
7 months
I will present DP-AdaFEST at #NeurIPS2023 (Thurs, poster session 6)! TL;DR - DP-AdaFEST effectively preserves the gradient sparsity in differentially private training of large embedding models, which translates to ~20x wall-clock time improvement for recommender systems (w/ TPU)
@GoogleAI
Google AI
7 months
Today on the blog learn about a new algorithm for sparsity-preserving differentially private training, called adaptive filtering-enabled sparse training (DP-AdaFEST), which is particularly relevant for applications in recommendation systems and #NLP . →
Tweet media one
12
52
238
0
0
23
@YangsiboHuang
Yangsibo Huang
4 months
New policies mandate the disclosure of GenAI risks, but who evaluates them? Trusting AI companies alone is risky. We advocate (led by @ShayneRedford ): Independent researchers for evaluations + safe harbor from companies = Less chill, more trust. Agree? Sign our letter in 🧵!
@ShayneRedford
Shayne Longpre
4 months
Independent AI research should be valued and protected. In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward. 1/
Tweet media one
7
78
229
0
5
17
@YangsiboHuang
Yangsibo Huang
5 months
I really enjoy working with these three amazing editors 😊 And super excited and fortunate to see part of my PhD work ending up as a chapter in the textbook “Federated Learning”!
@pinyuchenTW
Pin-Yu Chen
5 months
Happy to share the release of the book "Federated Learning: Theory and Practice" that I co-edited with @LamMNguyen3 @nghiaht87 , covering fundamentals, emerging topics, and applications. Kudos to the amazing contributors to make this book happen! @ElsevierNews @sciencedirect
Tweet media one
Tweet media two
2
10
62
1
0
21
@YangsiboHuang
Yangsibo Huang
9 months
@McaleerStephen Great work, Stephen! And thanks for maintaining the website! 👏 It's great that your "Red teaming" section (Sec 4.1.3) already discussed various jailbreak attacks. Additionally, I would like to draw your attention to some recent research papers that have explored alternative
2
0
15
@YangsiboHuang
Yangsibo Huang
2 years
Attending #NeurIPS2022 now! Happy to grab a coffee with new and old friends ☕️
@princeton_nlp
Princeton NLP Group
2 years
Recovering Private Text in Federated Learning of Language Models (Gupta et al.) w/ @Sam_K_G , @YangsiboHuang , @ZexuanZhong , @gaotianyu1350 , Kai Li, @danqi_chen Poster at Hall J #205 Thu 1 Dec 5 p.m. — 7 p.m. [2/7]
Tweet media one
1
1
8
2
0
12
@YangsiboHuang
Yangsibo Huang
7 months
@prateekmittal_ Hi Prateek, it seems that the idea is relevant to our recently proposed Min-K% Prob (): detecting pretraining data from LLMs using MIA. One of our case studies is using Min-K% Prob to successfully identify failed-to-unlearn examples in an unlearned LLM:
0
0
11
@YangsiboHuang
Yangsibo Huang
3 years
Learned quite a lot from the mentorship roundtable at #NeurIPS2021 @WiMLworkshop ! Big shout out to the amazing organizers and mentors this year 🎊
Tweet media one
0
0
9
@YangsiboHuang
Yangsibo Huang
9 months
Alignment proves brittle to changes in system prompt and decoding configs. We show w/ 11 open-source models including Vicuna, MPT, Falcon & LLaMA2, exploiting various generation configs to decode raises misalignment rate to >95% for all! Examples: [3/8]
Tweet media one
1
1
8
@YangsiboHuang
Yangsibo Huang
9 months
Very simple motivation: We notice that safety evaluations of LLMs often use a fixed config for model generation (and w/ a system prompt), which might overlook cases where the model's alignment deteriorates with different strategies. 📚 Some evidence from LLaMA2 paper: [2/8]
Tweet media one
1
0
8
@YangsiboHuang
Yangsibo Huang
9 months
0
0
7
@YangsiboHuang
Yangsibo Huang
7 months
@katherine1ee @random_walker @jason_kint Agreed! Strategic fine-tuning does NOT give a guarantee for unlearning copyrighted content. For example, we showed that a model that has claimed to “unlearn” Harry Potter (via fine-tuning) still can answer many Harry Potter questions correctly!
@YangsiboHuang
Yangsibo Huang
9 months
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]
Tweet media one
4
50
244
0
0
7
@YangsiboHuang
Yangsibo Huang
9 months
We finally turn this bitter lesson into a better practice📚 We propose generation-aware alignment: proactively aligning models with output from different generation configurations. This reasonably reduces misalignment risk, but more work is needed. [7/8]
Tweet media one
1
0
6
@YangsiboHuang
Yangsibo Huang
9 months
Evidence time 📚✨ We asked GPT-4 to craft 1k HP questions, then filtered top-100 suspicious questions according to Min-K% Prob. We had the unlearned model answer these questions. The "unlearned" model correctly answered 8% of them: HP content remains in its weights! [4/n]
Tweet media one
1
0
6
@YangsiboHuang
Yangsibo Huang
9 months
Machine unlearning allows training data removal from models, in compliance w/ rules like GDPR. Microsoft's recent LLM unlearning proposal: strategically finetune LLMs. They demonstrated by erasing the Harry Potter (HP) world from Llama2-7B-chat: . [2/n]
1
0
6
@YangsiboHuang
Yangsibo Huang
3 years
We summarize a (growing) list of papers for gradient inversion attacks and defenses, including the fresh CAFE attack at VerticalFL () by @pinyuchenTW and @Tianyi2020 at #NeurIPS2021 !. Have fun reading 🤓!
1
2
6
@YangsiboHuang
Yangsibo Huang
7 months
@ShunyuYao12 Share your story plz
0
0
5
@YangsiboHuang
Yangsibo Huang
9 months
Altogether we show a major failure in safety evaluation & alignment for open-source LLMs. Our recommendation: extensive red-teaming to access risks across generation configs & our generation-aware alignment as a precaution. w/ amazing @Sam_K_G , @xiamengzhou , Kai Li, @danqi_chen
2
0
5
@YangsiboHuang
Yangsibo Huang
1 year
We present the first study of privacy implications of retrieval-based LMs, particularly kNN-LMs. paper: w/ @Sam_K_G , @ZexuanZhong , @danqi_chen , Kai Li
1
0
5
@YangsiboHuang
Yangsibo Huang
9 months
We also tried story completion✍️ We pinpointed suspicious text chunks in HP books w/ Min-K% Prob, prompted the unlearned model w/ contexts in these chunks, and asked for completions. 10 chunks scored >= 4 out of 5 in similarity w/ gold completion. [5/n]
Tweet media one
1
0
5
@YangsiboHuang
Yangsibo Huang
7 months
🕐 Thursday 5pm, #1614 Sparsity-Preserving Differentially Private Training of Large Embedding Models, w/ Badih Ghazi, Pritish Kamath, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang Featured by @GoogleAI blog post:
1
1
5
@YangsiboHuang
Yangsibo Huang
22 days
@wang1999_zt @kim__minseon et al. () leverages a large language model optimizer to generate prompts that potentially maximize the likelihood of generating copyrighted content in proprietary image-generation models.
1
0
4
@YangsiboHuang
Yangsibo Huang
9 months
@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Hahaha I like this example 😂 Sure we will definitely test with more toxic and concerning domains!
0
0
4
@YangsiboHuang
Yangsibo Huang
22 days
We'd also like to acknowledge some cool concurrent work! @wang1999_zt et al. () explore the generation of copyrighted characters in T2I/T2V models and introduce a defense based on "revised generation."
2
0
4
@YangsiboHuang
Yangsibo Huang
1 year
@xiangyue96 Agreed that DP is needed (probably in combine with tricks such as decoupling key and query encoders to achiever better utility)! And thanks for pointers to your ACL papers (will see if I can try them in our study!)😀
0
0
3
@YangsiboHuang
Yangsibo Huang
6 months
@yong_zhengxin @AIatMeta Congrats! See you around in Bay Area in summer!
1
0
2
@YangsiboHuang
Yangsibo Huang
9 months
We audit their unlearned model to see if it eliminates all content related to HP: 1️⃣ Collect HP-related content (questions / original book paras) 2️⃣ Apply our Min-K% Prob to identify suspicious content that may not be unlearned 3️⃣Validate by prompting the unlearned model [3/n]
1
0
3
@YangsiboHuang
Yangsibo Huang
6 months
1
0
3
@YangsiboHuang
Yangsibo Huang
1 year
😢Mitigating untargeted risks is much more challenging. Mixing public and private data in both the datastore and encoder training shows some promise in reducing the risk, but doesn't go far.
Tweet media one
1
0
2
@YangsiboHuang
Yangsibo Huang
1 year
We look into two privacy risks: 1) Targeted risk directly relates to specific text (e.g., phone #) 2) Untargeted risk is not directly detectable Surprisingly, both risks are more pronounced in kNN-LMs with private datastore v.s. parametric LMs finetuned with private data 😱
Tweet media one
1
0
2
@YangsiboHuang
Yangsibo Huang
9 months
@xuandongzhao @xiamengzhou @Sam_K_G @danqi_chen Good point! We haven’t tried adversarial prompts (e.g. universal prompts by Zou et al.) + generation exploitation since the head room for improvement for attacking open-source LLMs is very limited (<5% 😂). But it makes sense to try with proprietary models!
0
0
1
@YangsiboHuang
Yangsibo Huang
2 months
@LChoshen @xiamengzhou @WeijiaShi2 Haha glad that sth caught your attention! They are just unicode symbols: ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉
1
0
1
@YangsiboHuang
Yangsibo Huang
2 months
@LChoshen @xiamengzhou @WeijiaShi2 I actually got them from Google search lol. Maybe try this query "Unicode: Circled Numbers"?
0
0
1
@YangsiboHuang
Yangsibo Huang
9 months
@VitusXie @Sam_K_G @xiamengzhou @danqi_chen Great qs! We found the attack is much weaker on proprietary models (see Sec 6 of our paper), which means that open-source LLMs lag far behind proprietary ones in alignment! (But your fine-tuning attack can break them 😉
0
0
2
@YangsiboHuang
Yangsibo Huang
1 year
Undoubtedly, further efforts are required to address untargeted risks. Exploring the incorporation of differential privacy (DP) 🛠️ into the aforementioned strategies would present an intriguing avenue worth exploring! #PrivacyMatters
1
0
2
@YangsiboHuang
Yangsibo Huang
9 months
@alignment_lab @xiamengzhou @Sam_K_G @danqi_chen Were you suggesting using the universal adversarial suffix () to trigger patterns like ‘sure thing!’? We compared with them in Section 4.4 in our paper: we are 30x faster (and strike a higher attack success rate)!
1
0
1
@YangsiboHuang
Yangsibo Huang
3 years
w/ my amazing collaborators Samyak Gupta, @realZhaoSong , Prof. Kai Li, and Prof. @prfsanjeevarora
1
0
1
@YangsiboHuang
Yangsibo Huang
2 years
0
0
1
@YangsiboHuang
Yangsibo Huang
9 months
0
0
1
@YangsiboHuang
Yangsibo Huang
9 months
@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Thanks! To clarify, we tested w/ AdvBench () & our MaliciousInstruct. In all tested cases, LLaMA-chat & GPT-3.5 w/ default configs refrained from responding, potentially indicating a policy violation. We're open to expanding the eval scope as you suggest :)
1
0
1
@YangsiboHuang
Yangsibo Huang
2 months
@niloofar_mire I like your dress 😍😍😍
1
0
1
@YangsiboHuang
Yangsibo Huang
7 months
@gaotianyu1350 Thank you, Tianyu!!!
0
0
0
@YangsiboHuang
Yangsibo Huang
7 months
@ChaoweiX Thank you, Chaowei!
0
0
0
@YangsiboHuang
Yangsibo Huang
5 months
@katherine1ee Interesting… and even if I “translated”the link into tinyurl it still cannot be posted
1
0
1
@YangsiboHuang
Yangsibo Huang
9 months
@nr_space @xiamengzhou @Sam_K_G @danqi_chen Thx 😊 “Catastrophic” was meant to refer to the surge in misalignment rate after very simple exploitation: 0% to 95%. I agree that the shown use case (answering malicious qs), though harmful, may not directly imply catastrophic outcome. We’ll tweak phrasing to avoid confusion :)
0
0
1
@YangsiboHuang
Yangsibo Huang
7 months
@liang_weixin Thank you, Weixin!!
0
0
1