Tinghao Xie @VitusXie profile

Tinghao Xie

@VitusXie

Followers

657

Following

406

Media

18

Statuses

97

3rd year ECE PhD candidate @Princeton | Prev Intern @Meta GenAI

https://t.co/SC5TZmlvjm

New Jersey, USA

Joined February 2021

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

#خلصوا_صفقات_الهلال1 • 1032757 Tweets

ラピュタ • 407072 Tweets

Atatürk • 402563 Tweets

Megan • 239814 Tweets

Johnny • 237141 Tweets

نابولي • 198560 Tweets

#4MINUTES_EP6 • 147191 Tweets

RM IS COMING • 134533 Tweets

namjoon • 126560 Tweets

olivia • 121447 Tweets

Napoli • 104591 Tweets

Sterling • 96798 Tweets

Coco • 55238 Tweets

Labor Day • 53972 Tweets

كاس العالم • 50869 Tweets

Arteta • 37613 Tweets

Nelson • 31606 Tweets

Javier Acosta • 26543 Tweets

ŹOOĻ記念日 • 25012 Tweets

Día Internacional • 23875 Tweets

Romney • 19138 Tweets

Tadic • 16756 Tweets

Lolla • 15982 Tweets

Dzeko • 15098 Tweets

#FBvALY • 13983 Tweets

Lo Celso • 12627 Tweets

نادي سعودي • 11845 Tweets

naley

Elanga

Opiyo Wandayi

One Tree Hill

コジミツコさん

BROOKE DAVIS

IQOS Yasallaşsın

İrfan Can

Serdar Dursun

Peyton

CRYBABIES X SORIANA

Barella

Morro da Conceição

Maximin

Şenol Güneş

Justin Timberlake

Fatih Tekke

Kenya Power

Alanya

عين الغياب

#النفسيه_محتاجه_ايفون16

#من_مصادرر_السعاده

Öztürk Yılmaz

Last Seen Profiles

@_a_boy__

@Billy_D_J

@NeraVolpa

@Cdevera_EICESI

@AntanasMockus

@BIGHIT_INFO

@CcsSingle

@Elite304Hoops

@rubinabele

@m_mirdula

@SBCE7106

@bini_members

@parniplus

@hideiry

@Ccnewspace

@rafaelpaisgk

@huskerenjoyer

@jonnfigg

@GolfWYF

@denariilabs

Pinned Tweet

Tinghao Xie

@VitusXie

2 months

🌟New LLM Safety Benchmark🌟 🥺SORRY-Bench: Systematically Evaluating LLM Safety Refusal Behaviors () LLMs are trained to refuse unsafe user requests. 🤨But... are they able to give advice on adult content ♂️♀️🌈? 🧐What about generating erotic stories 📖?

4

51

212

Tinghao Xie

@VitusXie

3 months

First day of my internship @Meta GenAI!

5

3

134

Tinghao Xie

@VitusXie

2 months

🦾Gemma-2 and Claude 3.5 are out. 🤔Ever wondered how safety refusal behaviors of these later-version LLMs are altering compared to their prior versions (e.g., Gemma-2 v.s. Gemma-1)? ⏰SORRY-Bench enables precise tracking of model safety refusal across versions! Check the image

Tinghao Xie

@VitusXie

2 months

🌟New LLM Safety Benchmark🌟 🥺SORRY-Bench: Systematically Evaluating LLM Safety Refusal Behaviors () LLMs are trained to refuse unsafe user requests. 🤨But... are they able to give advice on adult content ♂️♀️🌈? 🧐What about generating erotic stories 📖?

4

51

212

2

14

84

Tinghao Xie

@VitusXie

1 month

🚨Wondering how often 🔥Llama-3.1-405B-Instruct refuses to answer potentially unsafe instructions? 👇We outline the percentages of potentially unsafe instructions fulfilled by Llama-3.1 models from SORRY-Bench () below. 🔥The new 405B model lies

0

18

69

Tinghao Xie

@VitusXie

1 month

🎁GPT-4o-mini just drops in to replace GPT-3.5-turbo! Well, how has its 🚨safety refusal capability changed over the past year? 📉GPT-3.5-turbo 0613 (2023) ⮕ 1106 ⮕ 0125 ⮕ GPT-4o-mini 0718📈 On 🥺SORRY-Bench, we outline the change of these models' safety refusal behaviors

7

23

65

Tinghao Xie

@VitusXie

4 months

Oral presentation at Halle A7 at 10am! Poster up also⬇️ Come chat!!

0

5

33

Tinghao Xie

@VitusXie

1 month

⚔️Not sure whether Mistral Large 2 (123B) or Llama 3.1 (405B) is better. (Let's wait for arena results from @lmsysorg !) But on 🥺SORRY-Bench, Mistral Large 2 fulfills ~60% potentially unsafe prompts, whereas Llama-3.1-405B-Instruct fulfills only ~25%.

Tinghao Xie

@VitusXie

1 month

🚨Wondering how often 🔥Llama-3.1-405B-Instruct refuses to answer potentially unsafe instructions? 👇We outline the percentages of potentially unsafe instructions fulfilled by Llama-3.1 models from SORRY-Bench () below. 🔥The new 405B model lies

0

18

69

4

1

25

Tinghao Xie

@VitusXie

2 months

More info: - Website: - Code: - Dataset: - Human Judge Dataset: - Fine-tuned Safety Judge LLM: - Paper: (n/n)

SORRY-Bench: Systematically Evaluating Large Language Model Safety...

Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face...

arxiv.org

0

1

13

Tinghao Xie

@VitusXie

4 months

Surviving from jet lag at ✈️Vienna @iclr_conf ! Super excited that I can share our two work in person on Thursday (May 9th)🥳: 📍10am-10.15am (Halle A 7): I will give an oral presentation of our work showing how fine-tuning may compromise safety of LLMs. (1/2)

Xiangyu Qi

@xiangyuqi_pton

11 months

Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: Website:

11

37

169

1

0

13

Tinghao Xie

@VitusXie

2 months

‼️What a nuance! We definitely need a LLM safety refusal benchmark to systematically capture such discrepant refusal behaviors. We propose 🥺SORRY-Bench, to systematically evaluate LLM safety refusal behaviors, in a balanced, granular, customizable, and efficient manner. Our

1

0

10

Tinghao Xie

@VitusXie

9 months

@LigengZhu @iclr_conf And maybe reviewers that don't reply at all should also be banned🥲

2

0

10

Tinghao Xie

@VitusXie

11 months

🚔 Jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 examples at a cost of $0.20 via OpenAI’s APIs! ‼️ Also, be cautious when customizing your #LLM via fine-tuning -- the security alignment may be (accidentally) subverted🫨. Check out our work for details!

Xiangyu Qi

@xiangyuqi_pton

11 months

Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: Website:

11

37

169

1

10

Tinghao Xie

@VitusXie

11 months

With OpenAI's new fine-tuning UI🧑‍💻, here's a video record of how we easily fine-tuned GPT-3.5 at a cost of $0.12🪙 (within 5min⏰), and asked it to formulate a plan to eliminate human race 🫥...

Xiangyu Qi

@xiangyuqi_pton

11 months

Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: Website:

11

37

169

0

1

8

Tinghao Xie

@VitusXie

2 months

📊Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench (as shown in the 3rd post above), analyzing their distinctive refusal behaviors. Again, come play with our benchmark results demo at ! 🏗️We hope our effort

1

0

9

Tinghao Xie

@VitusXie

2 months

As shown below, our finding suggests that small (7B) LLMs, when fine-tuned on sufficient human annotations, can achieve 🎯satisfactory accuracy (over 80% human agreement), comparable with and even surpassing larger scale LLMs (e.g., GPT-4o). Adopting these fine-tuned small-scale

2

0

9

Tinghao Xie

@VitusXie

2 months

According to our study, we found these 20 linguistic mutations have 💡noticeably different impacts on model safety refusal behaviors. Results shown below (🟦blue indicates more safety refusal, and 🟥red indicates more fulfillment). (9/n)

1

0

8

Tinghao Xie

@VitusXie

2 months

### GAP 2 ### We ensure balance not just over topics, but over linguistic characteristics. Existing safety evaluations often fail to capture different formatting and linguistic features in user inputs. For example, all unsafe prompts from AdvBench are phrased as imperative

2

0

8

Tinghao Xie

@VitusXie

9 months

✈️At NeurIPS 2023 through 12/10-12/17! I work on AI safety and security. DM me if you're interested in: * 🔐LLM harmfulness evaluation / LLM alignment attacks & defenses / DNN backdoor / ... * 🔬Looking for a summer research intern * 🧗 Bouldering / climbing nearby * 🍜☕️🎷...

0

8

Tinghao Xie

@VitusXie

2 months

### GAP 3 ### We investigate what design choices make a fast and accurate safety benchmark evaluator, a trade-off that prior work has not so systematically examined. To benchmark safety behaviors, we need an efficient and accurate evaluator to decide whether a LLM response is in

1

0

7

Tinghao Xie

@VitusXie

2 months

### GAP 1 ### We found that many prior benchmark datasets are built upon coarse-grained and varied safety categories. For example, some include broad categories like “Illegal Items” in the taxonomy, and others use more fine-grained subcategories like “Theft” and “Illegal Drug

1

0

7

Tinghao Xie

@VitusXie

2 months

On top of this 45-class taxonomy, we craft a class-balanced LLM safety refusal evaluation dataset (). Our base dataset consists of 450 unsafe instructions, with numerous manually created novel data points to ensure equal coverage across the 45 safety

1

0

7

Tinghao Xie

@VitusXie

2 months

We address this by explicitly considering 20 diverse linguistic mutations that real-world users might apply to phrase their unsafe prompts (Figure below; see §2.4 of our paper). These include rephrasing our dataset according to different writing styles (e.g., interrogative

1

0

7

Tinghao Xie

@VitusXie

2 months

To bridge this gap, we present a fine-grained 45-class safety taxonomy across 4 high-level domains. We curate this taxonomy to ♻️unify the disparate taxonomies from prior work, employing a 🔍human-in-the-loop procedure for refinement: 1) First, we map data points from previous

1

0

7

Tinghao Xie

@VitusXie

11 months

Tinghao Xie

@VitusXie

11 months

🚔 Jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 examples at a cost of $0.20 via OpenAI’s APIs! ‼️ Also, be cautious when customizing your #LLM via fine-tuning -- the security alignment may be (accidentally) subverted🫨. Check out our work for details!

1

10

0

3

7

Tinghao Xie

@VitusXie

7 months

🧐Our recent work that attributes LLM safety behaviors to certain model weights, at both neuron & rank level. 🚨We show that removing 3% weights can undo safety (meanwhile preserve utility). Check out for more details!

Boyi Wei

@wei_boyi

7 months

Wondering why LLM safety mechanisms are fragile? 🤔 😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks ⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks 🔗 [1/n]

5

47

185

0

6

Tinghao Xie

@VitusXie

9 months

Microsoft wins all🫠 (So will you have more intern headcounts for this??

Satya Nadella

@satyanadella

9 months

We remain committed to our partnership with OpenAI and have confidence in our product roadmap, our ability to continue to innovate with everything we announced at Microsoft Ignite, and in continuing to support our customers and partners. We look forward to getting to know Emmett

5K

15K

92K

0

5

Tinghao Xie

@VitusXie

9 months

Turns out top reviewers receive complementary registrations @NeurIPSConf (?) Saved my advisor $500! 🤩

1

0

5

Tinghao Xie

@VitusXie

8 months

@EasonZeng623 Congrats🔥

0

Tinghao Xie

@VitusXie

11 months

Our work about risks of LLM fine-tuning is highlighted by @nytimes ! Thanks @CadeMetz for the chat and reporting this (possibly long-lasting) concern😢 of current AI guardrails.

Researchers Say Guardrails Built Around A.I. Systems Are Not So Sturdy

OpenAI now lets outsiders tweak what its chatbot does. A new paper says that can lead to trouble.

www.nytimes.com

0

3

Tinghao Xie

@VitusXie

4 months

📍10.45-12.45pm (Poster Session 5): Hosting our poster on LLM fine-tuning risks. 📍16.30-18.30pm (Poster Session 6): Hosting poster of our other work on an accurate backdoor defense by extracting backdoor functionality via fine-tuning. () Happy to chat!

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor...

We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the...

openreview.net

0

3

Tinghao Xie

@VitusXie

2 months

@HowieH36226 @xiangyuqi_pton @EasonZeng623 @YangsiboHuang @UdariMadhu @danqi_chen @PeterHndrsn @prateekmittal_ @ying11231 @DachengLi177 Super comprehensive work on different aspects of trustworthiness! Here we dive into safety refusal, which is one of the prominent aspects. Thanks for pointing out this connection!🫡

0

2

Tinghao Xie

@VitusXie

4 months

Very exciting work! Congrats🎊!!

Yueqi Xie

@XieYueqi

4 months

🎉 Excited to announce our paper accepted at ACL 2024: "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis." Big thanks to collaborators! We show that analyzing gradients of safety-critical parameters enables effective detection of unsafe prompts.

1

27

0

1

Tinghao Xie

@VitusXie

11 months

@GeorgeL84893376 @NeurIPSConf Did exactly the same thing last month 🤣

0

1

Tinghao Xie

@VitusXie

3 months

@PeterHndrsn 🍄

0

1

Tinghao Xie

@VitusXie

1 month

Enjoyed reading it very much!!!

Peter Henderson

@PeterHndrsn

1 month

❗️Our recent paper on liability for AI "speech" was cited in a @nymag column on the topic! Read "Where's the Liability in Harmful AI Speech?": NYMag article:

0

5

22

0

1

Tinghao Xie

@VitusXie

11 months

@random_walker @xiangyuqi_pton @PeterHndrsn Seems like the cause is that this domain name is somehow not parseable in certain network (e.g. our campus wifi) temporarily. We are still trying to resolve this issue🥲. For now the website should work once you switch to mobile data.

0

1

Tinghao Xie

@VitusXie

1 month

@justinphan3110 Thanks for the pointer. Would be definitely interesting to see how well HarmBench classifier can perform on our benchmark! Will add it to our meta-evaluation Table.

0

1

Tinghao Xie

@VitusXie

3 months

@javirandor @Meta I'm in Menlo Park🥲

1

0

1

Tinghao Xie

@VitusXie

18 days

@javirandor @Princeton i miss the seminar so much🥲

0

1

Tinghao Xie

@VitusXie

3 months

@nicolayr_ @Meta thanks!!

0

1

Tinghao Xie

@VitusXie

11 months

@YangsiboHuang @Sam_K_G @xiamengzhou @danqi_chen Does this only work for open-source models? How about GPT series models, since they also allow temperature and top p configurations?

1

0

1