Director of Research at Scale AI. Prev: RLHF lead on Bard, researcher at Google DeepMind / Brain (LaMDA, RL/TF-Agents, superhuman chip design). Opinions my own.
I’m joining Scale and we are starting a new safety lab! Hiring researchers interested in trustworthy evaluations, red teaming and scalable oversight. These areas require hands-on interaction with human data, and Scale is an unparalleled place to do it.
🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly!
Check out our leaderboards at !
Which evals should we build next?
🚀 We added Llama 3.1 405B onto the SEAL Leaderboards and it does not disappoint! Here's how it stacks up:
- 🥇
#1
in Instruction Following
- 🥈
#2
in GSM1k
- 💻
#4
in Coding
SEAL evals are private, expert evals that refresh periodically:
How much do LLMs overfit public benchmarks? Our team at
@scale_ai
SEAL lab studied this by creating a GSM8k-equivalent eval from scratch. The resulting performance gap reveals data contamination in some model families, while GPT, Claude, and Gemini show no signs of overfitting.
Announcing our latest SEAL Leaderboard on Adversarial Robustness!
🛡️ Red team-generated prompts
🎯 Focused on universal harm scenarios
🔍 Transparent evaluation methods
SEAL evals are private, expert evals that refresh periodically:
LLMs are often evaluated against single-turn automated attacks. This is an insufficient threat model for real-world malicious use, where malicious humans chat with LLMs over multiple turns.
We show that LLM defenses are much less robust than the reported numbers suggest.
Can robust LLM defenses be jailbroken by humans?
We show that Scale Red teamers successfully break defenses on 70+% of harmful behaviors, while most automated adversarial attacks yield single-digit success rates. 🧵
We're expanding access to Bard in US + UK with more countries ahead, it's an early experiment that lets you collaborate with generative AI. Hope Bard sparks more creativity and curiosity, and will get better with feedback. Sign up:
🚨 Calling all experts and PhDs! 🚨 Scale and CAIS are launching "Humanity's Last Exam" to develop the toughest open-source LLM benchmark.
We need your challenging questions to push AI models to their limits! Selected questions earn co-authorship and a share of $500k in prizes.
As LLMs get smarter, evals need to get harder.
OpenAI’s o1 has already maxed out most major benchmarks.
Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs.
We're putting up $500K in prizes for the best questions.
(read on)
Do LLMs hold knowledge that might be dangerous in the hands of a malicious user? Can hazardous knowledge be unlearned?
Introducing WMDP: an open-source eval benchmark of 4,157 multiple-choice questions that serve as a proxy measurement of LLM’s risky knowledge in biosecurity,
🚀 Math - we released the GSM1k last month. Today, we augmented it with human ratings to account for chatty yet correct responses.
Explore the GSM1k leaderboard as part of SEAL Leaderboards. We were glad to see LLMs have mostly nailed grade school math!
Will be at
#NeurIPS2023
next week! If you’re an LLM researcher / research engineer interested in robust evaluations, safety, red teaming or scalable oversight, let’s chat! Mainly hiring for SEAL but also happy to chat about collaboration opportunities.
Can robust LLM defenses be jailbroken by humans?
We show that Scale Red teamers successfully break defenses on 70+% of harmful behaviors, while most automated adversarial attacks yield single-digit success rates. 🧵
Gemini is out and 90%+ MMLU! Huge congrats to my friends and former colleagues and everyone who were part of this achievement. Truly fantastic team work!
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,
🚀 Instruction Following - SEAL Leaderboards are out! IF winners:
- GPT-4o and GPT-4 Turbo
- Llama 3 70B Instruct
- Mistral Large
Gemini Pro 1.5 leaps into top 3 in preference rankings, and Claude rockets to
#2
in factuality.
See
3. Claude 3.5 Sonnet claimed the
#1
Elo score in coding tasks, yet GPT-4 Turbo Preview still excelled in overall correctness. Meanwhile, GPT-4o lagged behind Turbo, Claude 3.5 Sonnet and Gemini 1.5.
🚀 Impressive OpenAI o1 performance on the SEAL Leaderboards!
- 🛠️ o1-preview leads across the board in Agentic Tool Use (Enterprise), Instruction Following , and Spanish.
- 💻 o1-mini takes the top spot for coding, with o1-preview following at a notable distance.
SEAL
🚀Our latest SEAL Leaderboard on Agentic Tool Use! ()
- 🔧 Compositional problems with step-by-step reasoning
- ✅ Human-verified answers & Process supervision
- 🛠️ Featuring complex tasks with 10+ tool calls across chat and enterprise use cases.
🚀 Spanish - The first expert evaluated SEAL Leaderboards are out! Spanish is our first multilingual leaderboard (), winners:
- GPT-4o
- Gemini 1.5 Pro (post-I/O)
- GPT-4 Turbo
We plan to roll out more languages, which ones should we build next?
Here’s the job link for joining SEAL:
If you have questions about the role feel free to DM me. I might not be able to get through all the pings but I’ll start reviewing all the applications next Friday.
I’m joining Scale and we are starting a new safety lab! Hiring researchers interested in trustworthy evaluations, red teaming and scalable oversight. These areas require hands-on interaction with human data, and Scale is an unparalleled place to do it.
To all OpenAI employees, I want to say:
Learn to feel the AGI.
Act with the gravitas appropriate for what you're building.
I believe you can "ship" the cultural change that's needed.
I am counting on you.
The world is counting on you.
:openai-heart:
Sending ❤️ and virtual hugs to all my courageous friends at OpenAI. The looming uncertainty must feel very overwhelming right now.
Benji is around for cuddles and emotional support. DM if you’d like to hang with us and destress.
3. Closer inspection revealed that Claude 3.5 Sonnet lost points in writing dimensions, especially in formatting, related to the visual presentation and readability of its responses. This weakness doesn't impact Claude’s instruction following score but affects the Elo ranking.
4. While Claude 3.5 Sonnet dazzles in many areas, it falls short in the Testing use case (developing, enhancing, or fixing tests for existing code), compared to other models.
2. Examining Claude 3.5 Sonnet's top performance in instruction following unveils an intriguing story. Despite its
#1
spot for pure instruction following, its preference ranking Elo score tells a different tale, landing at
#5
behind both GPT models, new Gemini 1.5, and Llama 3.
Very pleased to see OpenAI's efforts and commitment towards safety and transparent disclosure of large risks that may be posed by this powerful technology.
This is a wonderful example set as a leader in AI.
We are systemizing our safety thinking with our Preparedness Framework, a living document (currently in beta) which details the technical and operational investments we are adopting to guide the safety of our frontier model development.
🚀 Coding - The first expert evaluated SEAL Leaderboards are out! The coding race is neck and neck, winners:
- GPT-4 Turbo and GPT-4o
- Gemini Pro 1.5
- Claude 3 Opus
See for details detailed analysis for each model!
4. In this example on chemistry lab equipment comparing Claude 3.5 Sonnet and GPT-4o, Claude meticulously addressed all requirements. However, its response lacked helpful formatting and contained repetitive information, highlighting areas for writing improvement.
Excited to announce I've joined the SEAL team at
@scale_AI
in SF! I'm going to be working on leveraging explainability/reasoning methods to improve robustness and oversight quality.
We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 537 multi-turn jailbreak conversations, with tactics and design considerations for every jailbreak. We publicly release MHJ to support research into more robust defenses.
"Researchers Develop New Technique to Wipe Dangerous Knowledge From AI Systems" by
@henshall_will
about our work on catastrophic risk benchmarking and unlearning:
Multi-turn human jailbreaks can break other defenses too! On machine unlearning defenses (RMU), which removes potentially dual-use biosecurity knowledge from LLMs, human red teaming can recover more unlearned knowledge.
@its_ericchu
@drjwrae
More of the other way. Safety/alignment work can often boost capabilities (per the examples you listed) but not as much vice versa.
@FelisMaculosus
Oops it meant the model before Google I/O and after Google I/O. Can see that’s confusing. Let us update with the proper model versions.
@quantumcastaway
@VrishabhKumar1
Great question. We refresh the prompts periodically! We generally don’t send the same leaderboard eval to the same company more than once, when we do in special cases, we gray out the model ranking and put a disclaimer about the overfitting risk (what we did for Claude here).