A big part of my job these days is to think about what technical work Anthropic needs to do to make things go well with the development of very powerful AI.
I digested my thinking on this, plus some of the Anthropic zeitgeist around it, into this piece:
PhD admissions season is ramping up, so I feel obliged to join the chorus of voices reminding everyone that doing a PhD is, in most cases, a terrible idea.
I’m sharing a draft of a slightly-opinionated survey paper I’ve been working on for the last couple of months. It's meant for a broad audience—not just LLM researchers. (🧵)
AI/ML faculty: A student of mine did an internship at Google, and got the resulting paper accepted to a top conference. The host team isn't willing to pay for conference registration, so I'll have to pay or else the paper won't be published, going against the norm here. Advice?
You'll sometimes see the meme that NLP is solved. That's hype, and it's doing harm in the real world. But it's worth thinking about what it'd look like to actually achieve what we're aiming for. (📄 paper, thread 🧵)
I'll likely admit a couple new PhD students this year. If you're interested in NLP and you have experience either in crowdsourcing/human feedback for ML or in AI truthfulness/alignment/safety, consider
@NYUDataScience
!
🚨 We’re releasing QuALITY, a benchmark for reading comprehension with long texts! 🚨
Yes, the acronym is a little tone-deaf, but this is almost certainly the best benchmark or dataset release from my group so far. (🧵)
I just firmed up plans to spend my upcoming sabbatical year at
@AnthropicAI
in SF. Looking forward to burritos, figs, impromptu hikes, and ambitious projects with a some of the best large-scale-LM researchers out there!
But if you look around at the numbers on depression and anxiety, the average case is *really really* bad. Here's an especially cynical/flippant summary if you haven't seen this kind of thing:
This is the clearest and most insightful contribution to the Large Language Model Discourse in NLP that I've seen lately. You should read it!
A few reactions downthread...
Speculative (!!!) paper arguing that big LMs can model agency & communicative intent: (somehow in EMNLP findings). Briefly:
1. LMs do not in general have beliefs or goals. An LM trained on the Internet models a distribution over next tokens *marginalized*
🚨 I'm hiring! 🚨
I'm helping the team that I'm on at
@AnthropicAI
hire more researchers! If you’re interested in working with me to make highly-capable LLMs more reliable and truthful, and you have relevant research experience in NLP/HCI, apply!
✨🪩 Woo! 🪩✨
Jan's led some seminally important work on technical AI safety and I'm thrilled to be working with him! We'll be leading twin teams aimed at different parts of the problem of aligning AI systems at human level and beyond.
I'm excited to join
@AnthropicAI
to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
Wow, Sasha actually did it, and there's going to be a new independent LLM-centric NLP conference!
I trust this team to pull off something really ambitious with COLM, and I'm very curious to see what comes of it.
Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models.
Submissions: March 15 (it's pronounced "collum" 🕊️)
NYU student followers: Any interest in learning to do research in NLP or computational linguistics? I'm teaching a course next term that's meant to guide you through the major steps of a first publication-quality research project. Consider joining!
If you'll be at
#NeurIPS2023
and you're interested in chatting with someone at Anthropic about research or roles, there'll be a few people of us around.
Expression of interest form here:
It can be really amazing when it works out well! I enjoyed doing one! But that was in a growing field with lots of jobs NLP, in a city I already had friends, in an unusually supportive and respected department, as a native speaker of English, etc..
I gave a talk! You can watch it!
Covering: Scalable oversight, AI-AI debate, hard QA datasets, and getting truthful answers out of AI systems in domains we don't know much about.
I'm disappointed to report that I've already found an accepted
#ACL2022
paper that treats BERT (2018) as a state-of-the-art text encoder.
We've made a lot of progress since 2018! Even if you account for publication delays, RoBERTa is three years old! GPT-3 and DeBERTa are two!
I’m sharing a draft of a slightly-opinionated survey paper I’ve been working on for the last couple of months. It's meant for a broad audience—not just LLM researchers. (🧵)
💯% recommend complaining about Google financial bureaucracy on Twitter. Disappointed all my other non-Google financial bureaucracy problems won't be solved through a flurry of DMs with famous engineers and scientists.
Large language modeling work over the last few years has been exciting but increasingly concerning: We’re building powerful, general tools almost by accident—often without much of an understanding of their capabilities until after we’ve deployed them.
@janleike
Tons of NLP people worked on this question in response to similar results with XLM-R in 2019 and... as far as I can tell we're all still pretty confused about how this works.
We've trained GPT-3 to be more aligned with what humans want: The new InstructGPT models are better at following human intent than a 100x larger model, while also improving safety and truthfulness.
Proud to be on Divyansh's thesis committee. Hoping it all goes well, as I'd be a little uneasy giving negative feedback to someone whose friend has nukes.
There are lots of really valuable things to do that involve serious intellectual engagement, don't require a PhD, and are *much* more fun and *much* better paid!
Do models need to reason in words to benefit from chain-of-thought tokens?
In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵
🚨New dataset for LLM/scalable oversight evaluations! 🚨
This has been one of the big central efforts of my NYU lab over the last year, and I’m really exited to start using it.
🧵Announcing GPQA, a graduate-level “Google-proof” Q&A benchmark designed for scalable oversight! w/
@_julianmichael_
,
@sleepinyourhat
GPQA is a dataset of *really hard* questions that PhDs with full access to Google can’t answer.
Paper:
We're slowly learning more about Google's not-exactly-public efforts in the huge LM space. The highlight here for me was the subfigure on the right: More evidence that we can see discontinuous, qualitatively-important improvements in behavior as we scale.
Really interesting result:
Once you have achieved a baseline level of instruction-following ability through RLHF, you can train a model to do new things by (roughly speaking) prompting the model to provide the feedback that you'd otherwise get from humans.
In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful.
RLHF is surprisingly easy and effective, but not robust enough for what it's being used for. (I like
@andy_l_jones
's framing in the screenshot below.) This new big-group survey paper does a good job of explaining why.
New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread 🧵
- I'm not an author on the paper.
- The paper went through Google's internal publication review process.
- The internship was successful by all accounts, and the student was invited back.
If progress extends all the way to near-human performance on language and reasoning tasks, the consequences are likely to be transformative. Quite possibly the most impactful technology humanity will ever build.
It's easy to dismiss this kind of big-picture blog post from a company as self-serving fluff, but there's more than that going on here. This is worth a look if you're interested in LLMs and AI progress.
I shared the following note with my OpenAI colleagues today:
I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work. I've decided
Initial
#EMNLP2021
in-person conference reactions:
– It's really, really nice to have informal small-group research conversations that aren't Twitter. It's really helping it sink in how much this platform weirds the discourse.
Periodic note: I don't have a specific postdoc job open now, but it's often possible to create one relatively quickly if there's a great opportunity. If you have a specific research goal that's *very closely* aligned with my group and you want to do a postdoc here, reach out!
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.
Introducing Claude 3.5 Sonnet—our most intelligent model yet.
This is the first release in our 3.5 model family.
Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.
Try it for free:
🚨New results on pretraining LMs w/ preference models!🚨
I’ll admit I was skeptical we’d find much when project was spinning up, but the results singnificantly changed how I think about foundation models.
Read Tomek’s whole thread:
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.
There’s no guarantee that this should be transformative in a good way. If this happens by accident, or without clear mechanisms in place to oversee the systems we’re building and govern their operators, the consequences could be disastrous.
🚨 Earnest Preachy Thread Update! 🚨
I've committed to giving at least 10% of my income *for the rest of my working life* to charities that I think are plausibly among the most effective in the world at doing good. I hope you'll doing the same.
Context:
I'm proud to see this come out.
These governance mechanisms here commit us to pause scaling whenever we can't show that we're on track to manage the worst-case risks presented by new models. And it does that _without_ assuming that we fully understand those risks now.
Today, we’re publishing our Responsible Scaling Policy (RSP) – a series of technical and organizational protocols to help us manage the risks of developing increasingly capable AI systems.
(I think my group/environment at NYU is much better than average here. Being in a collaborative and well-funded field really helps! But these issues don't totally go away. Proceed with caution!)
I'm really proud to see Jason defend today! He's been a great collaborator, and he's done a *ton* of pretty centrally important work in NLP over the last few years—way more than could fit in a dissertation:
AI/ML faculty: A student of mine did an internship at Google, and got the resulting paper accepted to a top conference. The host team isn't willing to pay for conference registration, so I'll have to pay or else the paper won't be published, going against the norm here. Advice?
NLP as a field is only barely coming to grips with the present-day impacts that our tools are having, and we’ve hardly discussed longer-term implications of these trends at all.
This paper will appear at
#ACL2022
with a new title and some updates (see link)! Here's a thread with a few especially fun/controversial/weird quotes. (🧵)
You'll sometimes see the meme that NLP is solved. That's hype, and it's doing harm in the real world. But it's worth thinking about what it'd look like to actually achieve what we're aiming for. (📄 paper, thread 🧵)
Introducing “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo”
Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically
Introducing Claude 2! Our latest model has improved performance in coding, math and reasoning. It can produce longer responses, and is available in a new public-facing beta website at in the US and UK.
We're announcing, together with
@ericschmidt
: Superalignment Fast Grants.
$10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.
Apply by Feb 18!
It’s hard work to make evaluations for language models (LMs). We’ve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.
I think there’s a lot we can do to mitigate the risk of these bad outcomes, but not many people are trying. A decent portion of it will be recognizable as NLP research. I’d like to see NLP as a field take these concerns more seriously.
I made a bet internally that we wouldn't have a million people engage with tweets about Claude being a bridge, but I'm pretty happy to be on track to lose that bet.
This week, we showed how altering internal "features" in our AI, Claude, could change its behavior.
We found a feature that can make Claude focus intensely on the Golden Gate Bridge.
Now, for a limited time, you can chat with Golden Gate Claude:
Today's big LMs are qualitatively quite different from the kinds of <10B-param models that most NLP researchers built their intuitions around.
And, of course, it seems reasonable to expect the next generation of big models to be qualitatively different from today's, too.
New survey paper! We discuss “emergent abilities” of large language models.
Emergent abilities are only present in sufficiently large models, and thus they would not have been predicted simply by extrapolating the scaling curve from smaller models.
🧵⬇️
Interesting and concerning new results from
@cem__anil
et al.: Many-shot prompting for harmful behavior gets predictably more effective at overcoming safety training with more examples, following a power law.
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:
A new preprint 📢
K-nearest neighbors language models (kNN-LMs;
@ukhndlwl
et al., ICLR'2020) improve the perplexity of standard LMs, even when they retrieve examples from the *same training set that the base LM was trained on*.
but why?
(1/3)
I'm honored to have been part of this and thrilled with how it turned out.
I have minor quibbles with the statement, but the core ideas in it are quite important, and it's a huge deal to get buy-in on them from so many people in leadership positions in China and the West.
Leading computer scientists from around the world, including
@Yoshua_Bengio
, Andrew Yao,
@yaqinzhang
and Stuart Russell met last week and released their most urgent and ambitious call to action on AI Safety from this group yet.🧵
This new paper has some initial thoughts and results from a project I've been helping set up at Anthropic. Take a look!
Plus, if you're interested in working on projects like this involving AI alignment, language models, and HCI, we're hiring!
In "Measuring Progress on Scalable Oversight for Large Language Models” we show how humans could use AI systems to better oversee other AI systems, and demonstrate some proof-of-concept results where a language model improves human performance at a task.
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
🚨 NEW PAPER ALERT 🤓 CURVES📈📈📈 ALERT 🚨
Transformer LMs pretrained on billions of words have dominated NLP, but which skills/features really depend on this huge scale? How much can models learn from more modest amounts of data? [1/10]
🚨📄 Following up on "LMs Don't Always Say What They Think",
@milesaturpin
et al. now have an intervention that dramatically reduces the problem! 📄🚨
It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.
🚀New paper!🚀
Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias.
🧵
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.