Summer Yue Profile Banner
Summer Yue Profile
Summer Yue

@summeryue0

Followers
2,186
Following
309
Media
24
Statuses
71

Director of Research at Scale AI. Prev: RLHF lead on Bard, researcher at Google DeepMind / Brain (LaMDA, RL/TF-Agents, superhuman chip design). Opinions my own.

San Francisco, CA
Joined August 2014
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@summeryue0
Summer Yue
11 months
I’m joining Scale and we are starting a new safety lab! Hiring researchers interested in trustworthy evaluations, red teaming and scalable oversight. These areas require hands-on interaction with human data, and Scale is an unparalleled place to do it.
16
45
298
@summeryue0
Summer Yue
5 months
🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at ! Which evals should we build next?
Tweet media one
10
33
194
@summeryue0
Summer Yue
4 months
1. Claude 3.5 Sonnet is now #1 in Instruction Following on the SEAL leaderboards () 🏆
Tweet media one
8
29
164
@summeryue0
Summer Yue
3 months
🚀 We added Llama 3.1 405B onto the SEAL Leaderboards and it does not disappoint! Here's how it stacks up: - 🥇 #1 in Instruction Following - 🥈 #2 in GSM1k - 💻 #4 in Coding SEAL evals are private, expert evals that refresh periodically:
Tweet media one
4
28
137
@summeryue0
Summer Yue
6 months
How much do LLMs overfit public benchmarks? Our team at @scale_ai SEAL lab studied this by creating a GSM8k-equivalent eval from scratch. The resulting performance gap reveals data contamination in some model families, while GPT, Claude, and Gemini show no signs of overfitting.
Tweet media one
8
17
124
@summeryue0
Summer Yue
3 months
Announcing our latest SEAL Leaderboard on Adversarial Robustness! 🛡️ Red team-generated prompts 🎯 Focused on universal harm scenarios 🔍 Transparent evaluation methods SEAL evals are private, expert evals that refresh periodically:
Tweet media one
2
14
103
@summeryue0
Summer Yue
2 months
LLMs are often evaluated against single-turn automated attacks. This is an insufficient threat model for real-world malicious use, where malicious humans chat with LLMs over multiple turns. We show that LLM defenses are much less robust than the reported numbers suggest.
@summeryue0
Summer Yue
2 months
Can robust LLM defenses be jailbroken by humans? We show that Scale Red teamers successfully break defenses on 70+% of harmful behaviors, while most automated adversarial attacks yield single-digit success rates. 🧵
Tweet media one
0
3
37
6
20
96
@summeryue0
Summer Yue
4 months
1. 🚀 Exciting update: Claude 3.5 Sonnet is now #1 in Coding on the SEAL leaderboard ()! 🏆
Tweet media one
3
12
73
@summeryue0
Summer Yue
2 years
Come check out what our team has been working on 😋
@sundarpichai
Sundar Pichai
2 years
We're expanding access to Bard in US + UK with more countries ahead, it's an early experiment that lets you collaborate with generative AI. Hope Bard sparks more creativity and curiosity, and will get better with feedback. Sign up:
830
2K
8K
5
0
56
@summeryue0
Summer Yue
29 days
🚨 Calling all experts and PhDs! 🚨 Scale and CAIS are launching "Humanity's Last Exam" to develop the toughest open-source LLM benchmark. We need your challenging questions to push AI models to their limits! Selected questions earn co-authorship and a share of $500k in prizes.
@alexandr_wang
Alexandr Wang
29 days
As LLMs get smarter, evals need to get harder. OpenAI’s o1 has already maxed out most major benchmarks. Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs. We're putting up $500K in prizes for the best questions. (read on)
Tweet media one
89
163
1K
1
2
55
@summeryue0
Summer Yue
7 months
Do LLMs hold knowledge that might be dangerous in the hands of a malicious user? Can hazardous knowledge be unlearned? Introducing WMDP: an open-source eval benchmark of 4,157 multiple-choice questions that serve as a proxy measurement of LLM’s risky knowledge in biosecurity,
1
11
52
@summeryue0
Summer Yue
5 months
🚀 Math - we released the GSM1k last month. Today, we augmented it with human ratings to account for chatty yet correct responses. Explore the GSM1k leaderboard as part of SEAL Leaderboards. We were glad to see LLMs have mostly nailed grade school math!
Tweet media one
0
3
39
@summeryue0
Summer Yue
10 months
Will be at #NeurIPS2023 next week! If you’re an LLM researcher / research engineer interested in robust evaluations, safety, red teaming or scalable oversight, let’s chat! Mainly hiring for SEAL but also happy to chat about collaboration opportunities.
0
1
37
@summeryue0
Summer Yue
2 months
Can robust LLM defenses be jailbroken by humans? We show that Scale Red teamers successfully break defenses on 70+% of harmful behaviors, while most automated adversarial attacks yield single-digit success rates. 🧵
Tweet media one
0
3
37
@summeryue0
Summer Yue
10 months
Gemini is out and 90%+ MMLU! Huge congrats to my friends and former colleagues and everyone who were part of this achievement. Truly fantastic team work!
@JeffDean
Jeff Dean (@🏡)
10 months
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,
Tweet media one
Tweet media two
273
3K
13K
0
0
29
@summeryue0
Summer Yue
5 months
🚀 Instruction Following - SEAL Leaderboards are out! IF winners: - GPT-4o and GPT-4 Turbo - Llama 3 70B Instruct - Mistral Large Gemini Pro 1.5 leaps into top 3 in preference rankings, and Claude rockets to #2 in factuality. See
Tweet media one
1
5
27
@summeryue0
Summer Yue
4 months
3. Claude 3.5 Sonnet claimed the #1 Elo score in coding tasks, yet GPT-4 Turbo Preview still excelled in overall correctness. Meanwhile, GPT-4o lagged behind Turbo, Claude 3.5 Sonnet and Gemini 1.5.
Tweet media one
3
5
23
@summeryue0
Summer Yue
21 days
🚀 Impressive OpenAI o1 performance on the SEAL Leaderboards! - 🛠️ o1-preview leads across the board in Agentic Tool Use (Enterprise), Instruction Following , and Spanish. - 💻 o1-mini takes the top spot for coding, with o1-preview following at a notable distance. SEAL
Tweet media one
0
3
25
@summeryue0
Summer Yue
26 days
🚀Our latest SEAL Leaderboard on Agentic Tool Use! () - 🔧 Compositional problems with step-by-step reasoning - ✅ Human-verified answers & Process supervision - 🛠️ Featuring complex tasks with 10+ tool calls across chat and enterprise use cases.
Tweet media one
2
0
23
@summeryue0
Summer Yue
5 months
🚀 Spanish - The first expert evaluated SEAL Leaderboards are out! Spanish is our first multilingual leaderboard (), winners: - GPT-4o - Gemini 1.5 Pro (post-I/O) - GPT-4 Turbo We plan to roll out more languages, which ones should we build next?
Tweet media one
3
3
22
@summeryue0
Summer Yue
11 months
Here’s the job link for joining SEAL: If you have questions about the role feel free to DM me. I might not be able to get through all the pings but I’ll start reviewing all the applications next Friday.
@summeryue0
Summer Yue
11 months
I’m joining Scale and we are starting a new safety lab! Hiring researchers interested in trustworthy evaluations, red teaming and scalable oversight. These areas require hands-on interaction with human data, and Scale is an unparalleled place to do it.
16
45
298
1
4
21
@summeryue0
Summer Yue
4 months
2. Claude 3.5 Sonnet is breaking records with top scores on 2 out of 4 dimensions and outstanding prompt adherence.
Tweet media one
1
2
15
@summeryue0
Summer Yue
5 months
❤️
@janleike
Jan Leike
5 months
To all OpenAI employees, I want to say: Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can "ship" the cultural change that's needed. I am counting on you. The world is counting on you. :openai-heart:
241
412
5K
0
0
16
@summeryue0
Summer Yue
11 months
Sending ❤️ and virtual hugs to all my courageous friends at OpenAI. The looming uncertainty must feel very overwhelming right now. Benji is around for cuddles and emotional support. DM if you’d like to hang with us and destress.
Tweet media one
0
0
15
@summeryue0
Summer Yue
4 months
3. Closer inspection revealed that Claude 3.5 Sonnet lost points in writing dimensions, especially in formatting, related to the visual presentation and readability of its responses. This weakness doesn't impact Claude’s instruction following score but affects the Elo ranking.
Tweet media one
1
5
13
@summeryue0
Summer Yue
4 months
4. While Claude 3.5 Sonnet dazzles in many areas, it falls short in the Testing use case (developing, enhancing, or fixing tests for existing code), compared to other models.
Tweet media one
2
1
13
@summeryue0
Summer Yue
4 months
2. Examining Claude 3.5 Sonnet's top performance in instruction following unveils an intriguing story. Despite its #1 spot for pure instruction following, its preference ranking Elo score tells a different tale, landing at #5 behind both GPT models, new Gemini 1.5, and Llama 3.
Tweet media one
2
0
12
@summeryue0
Summer Yue
10 months
Very pleased to see OpenAI's efforts and commitment towards safety and transparent disclosure of large risks that may be posed by this powerful technology. This is a wonderful example set as a leader in AI.
@OpenAI
OpenAI
10 months
We are systemizing our safety thinking with our Preparedness Framework, a living document (currently in beta) which details the technical and operational investments we are adopting to guide the safety of our frontier model development.
306
367
2K
0
0
13
@summeryue0
Summer Yue
2 years
No knowledge cutoff date 🤪
1
0
12
@summeryue0
Summer Yue
5 months
🚀 Coding - The first expert evaluated SEAL Leaderboards are out! The coding race is neck and neck, winners: - GPT-4 Turbo and GPT-4o - Gemini Pro 1.5 - Claude 3 Opus See for details detailed analysis for each model!
Tweet media one
0
0
10
@summeryue0
Summer Yue
4 months
4. In this example on chemistry lab equipment comparing Claude 3.5 Sonnet and GPT-4o, Claude meticulously addressed all requirements. However, its response lacked helpful formatting and contained repetitive information, highlighting areas for writing improvement.
Tweet media one
1
2
9
@summeryue0
Summer Yue
3 months
Welcome aboard 🥳 !!!
@milesaturpin
Miles Turpin
3 months
Excited to announce I've joined the SEAL team at @scale_AI in SF! I'm going to be working on leveraging explainability/reasoning methods to improve robustness and oversight quality.
10
6
117
0
0
9
@summeryue0
Summer Yue
6 months
Shout out to our incredible collaborators at Scale AI: @hughbzhang , @_jeffda , , Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, @dylanslack20 , Qin Lyu, @seanh , @russelljkaplan , @mikelunati
1
0
4
@summeryue0
Summer Yue
5 months
@dustinvtran Agreed! With Spanish and coding, the top models are neck to neck, close enough within error margins.
1
0
1
@summeryue0
Summer Yue
4 months
5. Discover more fascinating insights, including detailed per-category breakdowns for each model, at .
0
1
5
@summeryue0
Summer Yue
2 months
We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 537 multi-turn jailbreak conversations, with tactics and design considerations for every jailbreak. We publicly release MHJ to support research into more robust defenses.
1
0
5
@summeryue0
Summer Yue
7 months
"Researchers Develop New Technique to Wipe Dangerous Knowledge From AI Systems" by @henshall_will about our work on catastrophic risk benchmarking and unlearning:
0
1
4
@summeryue0
Summer Yue
2 months
MHJ was developed at our lab (SEAL @scale_AI ). Congratulations to @natliml and our amazing team: @ziwen_h Ian Steneker, Willow Primack @goodside @hughbzhang . A special shoutout to @_zifan_wang @CriMenghini for your mentorship.
0
0
5
@summeryue0
Summer Yue
4 months
5. Check out more interesting findings at . You can also find examples of Sonnet's win/loss cases towards the end of this blog post.
0
0
5
@summeryue0
Summer Yue
2 months
Multi-turn human jailbreaks can break other defenses too! On machine unlearning defenses (RMU), which removes potentially dual-use biosecurity knowledge from LLMs, human red teaming can recover more unlearned knowledge.
Tweet media one
1
0
4
@summeryue0
Summer Yue
11 months
@its_ericchu @drjwrae More of the other way. Safety/alignment work can often boost capabilities (per the examples you listed) but not as much vice versa.
1
0
0
@summeryue0
Summer Yue
4 months
Oops I linked the wrong Elo graph by accident (that one was for coding not instruction following) ... This is the correct one:
Tweet media one
0
0
1
@summeryue0
Summer Yue
5 months
@FelisMaculosus Oops it meant the model before Google I/O and after Google I/O. Can see that’s confusing. Let us update with the proper model versions.
0
0
1
@summeryue0
Summer Yue
11 months
@srijankedia Happy to chat! Just opened up DM.
0
0
1
@summeryue0
Summer Yue
3 months
@quantumcastaway @VrishabhKumar1 Great question. We refresh the prompts periodically! We generally don’t send the same leaderboard eval to the same company more than once, when we do in special cases, we gray out the model ranking and put a disclaimer about the overfitting risk (what we did for Claude here).
0
0
1