![David Krueger Profile](https://pbs.twimg.com/profile_images/833081523213438976/DeIwiR-6.jpg)
David Krueger
@DavidSKrueger
Followers
16K
Following
3K
Media
170
Statuses
4K
AI professor. Deep Learning, AI alignment, ethics, policy, & safety. Formerly Cambridge, Mila, Oxford, DeepMind, ElementAI, UK AISI. AI is a really big deal.
Joined November 2011
4/9 NeurIPS submissions accepted!. Congrats to @JamesAllingham, @kayembruno, @jmhernandez233, @LukeMarks00, @FazlBarez, @RyanPGreenblatt, @FabienDRoger, @dmkrash, Stephen Chung, @scottniekum and all of my other co-authors on these works!. Summaries / links in thread. .
4
7
91
Greg was one of the founding team at OpenAI who seemed cynical and embarrased about the org's mission (basically, the focus on AGI and x-risk) in the early days. I remember at ICLR Puerto Rico, in 2016, the summer after OpenAI was founded, a bunch of researchers sitting out on.
We’re really grateful to Jan for everything he's done for OpenAI, and we know he'll continue to contribute to the mission from outside. In light of the questions his departure has raised, we wanted to explain a bit about how we think about our overall strategy. First, we have.
39
30
493
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! . Some highlights below.
7
156
461
If LLMs are as adversarially vulnerable as image models, then safety filters won't work for any mildly sophisticated actor (e.g. grad student). It's not obvious they are, since text is discrete, meaning: 1) the attack space is restricted .2) attacks are harder to find.
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵. Website: Paper:
28
59
338
My research group @kasl_ai is looking for interns!. Applications are due in 2 weeks ***January 29***. The long-awaited form: Please share widely!!.
7
74
281
I'm thrilled to be joining the Frontier AI Task Force (formerly "Foundation Models Task Force") as a reseearch director along with @yaringal!. I'm really excited to see what we can accomplish. Stay tuned!.
1/ 11 weeks ago I agreed to Chair the UK's efforts to accelerate state capacity in AI Safety - measuring and mitigating the risks of frontier models so we can safety capture their opportunities. Here is our first progress report:
22
18
268
A clip from my first live TV appearance!. With @AyoSokale on @GMB discussing extinction risk from AI.
'AI is not going to be like every other technology' . 'I think we're going to struggle to survive as a species'. Could AI lead to the extinction of humanity?
20
28
197
For a long time, people have applied a double standard for arguments for/against AI x-risk. Arguments for x-risk have been (incorrectly) dismissed as “anthropomorphising”. But Ng is “sure” that we’ll have warning shots based on an anecdote about his kids.
Last weekend, my two kids colluded in a hilariously bad attempt to mislead me to look in the wrong place during a game of hide-and-seek. I was reminded that most capabilities — in humans or in AI — develop slowly. Some people fear that AI someday will learn to deceive humans
6
9
186
Our paper “Defining and Characterizing Reward Hacking” was accepted at NeurIPS!.with @JoarMVS, @__niki_howe__, & @dmkrash. We define what it means for a proxy reward function to be hackable/gameable. Bad news: hackability is ~unavoidable (Thm1)!
4
23
178
It still amazes me that Yann and Andrew Ng can confidently predict that human-like/level AI is too far off to worry about. and act like people who disagree are the ones who are overconfident in their beliefs.
@ESYudkowsky @erikbryn My entire career has been focused on figuring what's missing from AI systems to reach human-like intelligence. I tell you, we're not there yet. If you want to know what's missing, just listen to one of my talks of the last 7 or 8 years, preferably a recent one like this:.
24
9
164
If you're in ML, you've probably heard about BigBiGAN because of the DM PR machine. But you may not have heard about this paper by @philip_bachman et al. that came 4 days later and crushes their results.
4
36
169
Want to generalize Out-of-Distribution? .Try INCREASING risk(=loss) on training environments with higher performance!. Joint work with:.@ethancaballero .@jh_jacobsen .@yayitsamyzhang .@jjbinas.@LPRmi.@AaronCourville
1
33
153
Recommended!. This is a well-written, concise (10 pages!) summary of the Alignment Problem. It helps explain:.1) The basic concerns of (many) alignment researchers .2) Why we don't think it's crazy to talk about AI killing everyone.3) What's distinctive about the field.
At OpenAI a lot of our work aims to align language models like ChatGPT with human preferences. But this could become much harder once models can act coherently over long timeframes and exploit human fallibility to get more reward. 📜Paper: 🧵Thread:
5
23
153
It’s not the case that “everyone thought” that “AI” is “immediately dangerous”, and @tszzl really ought to know better. I can see no way to interpret that statement which makes it true. Maybe it “vibes”, but it’s a lie.
obviously because ai is less immediately dangerous and more default aligned than everyone thought and iterative deployment works. total openai ideological victory though.
10
4
150
I am looking for PhD students!!. I'm increasingly interested in work supporting AI governance, e.g. that:.- highlights the need for policy, e.g. by breaking methods or models.- could help monitor and enforce policies.- generally increases affordances for policymakers.
David Krueger @DavidSKrueger (, who focuses on deep learning, AI alignment, & AI safety, with interests including: foundation models; jailbreaking, reward hacking, & alignment; generalization in deep learning; policy, governance, & societal impacts of AI.
4
32
139
I applaud Yoshua for speaking out. It is emotionally difficult to confront the possibility of human extinction from AI. It is especially difficult to do this if you've spent your life working on building it. And it is especially difficult if you've been dismissive in the past.
Yoshua Bengio admits to feeling "lost" over his life's work: "he would have prioritised safety over usefulness had he realised the pace at which it would evolve".
7
8
122
Honestly, I think the idea that animals only work via "association" and lack "inferential capabilities" is implausible metaphysical speciesist garbage functioning primarily to perpetuate atrocities.
So uhh, this whole reasoning vs. stochastic parrots thing has been going on for quite a while. in *animal cognition*. "among the most long-standing and most intensely debated controversies"
3
8
117
@PhilBeaudoin @tyrell_turing Just the title is maddening. I am seriously dissappointed in the authors. I encourage everyone to spend more time engaging with existing literature and experts before calling the subject of my entire field an "illusion". This is so disrespectful and ignorant to the hundreds.
3
5
114
A lot of people in the AI x-safety community seem to think @AnthropicAI has some sort of moral high ground over @OpenAI, but this sort of thing makes it hard for me to see why that is supposed to be the case. #justdontbuildAGI.
Anthropic plans to build a model tentatively called Claude-Next 10X more capable than today’s most powerful AI that'll require spending $1Billion over the next 18 months. “Companies that train the best models will be too far ahead for anyone to catch up”.
10
7
112
I've worried AI could lead to human extinction ever since I heard about Deep Learning from Hinton's Coursera course, >10 years ago. So it's great to see so many AI researchers advocating for AI x-safety as a global priority. Let's stop arguing over it and figure out what to do!.
We just put out a statement:. “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”. Signatories include Hinton, Bengio, Altman, Hassabis, Song, etc. 🧵 (1/6).
9
22
109
I've been telling people we should expect breakthroughs like this to make regulating AI much harder. For now there are a few big labs to worry about, but over time as compute gets cheaper and/or restrictions come into place, it could get a lot harder. What's the end game?.
DiLoCo: Distributed Low-Communication Training of Language Models. paper page: Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of
19
18
108
I've said for a while that everyone should be able to agree that AI x-risk >= 1%. Now I guess even "optimists" agree on this, so. victory??.
Introducing AI Optimism: a philosophy of hope, freedom, and fairness for all. We strive for a future where everyone is empowered by AIs under their own control. In our first post, we argue AI is easy to control, and will get more controllable over time.
32
8
104
Yeah I am super frustrated by how often I hear people say:."China will never cooperate"."we can't trust China"."China doesn't care about AI safety"."we can't regulate AI because then China wins".etc. I always ask: "why do you say that?" and am never satisfied with the answer.
People sometimes ask me why I am more optimistic than some others about the US and China cooperating on AI safety. Besides the incremental progress happening, I think one point to bear in mind is “we basically haven’t even tried yet - let’s try first then discuss.”.
7
9
105
Really excited to see my commentary published in @newscientist!. There's waaay too much to say on the topic, but to spell it out, the solution is REGULATION. We urgently need to stop the reckless and irresponsible deployment of powerful AI systems.
Existential risk from AI is admittedly more speculative than pressing concerns such as its bias, but the basic solution is the same. A robust public discussion is long overdue, says @DavidSKrueger
9
18
95
"even 10-20 years from now is very soon" for building highly advanced AI systems. 100%. I have been saying this for 10 years. This is maybe the most commonly neglected argument for taking AI x-risk seriously. I am still super confused why this is not 100% obvious to everyone.
I tried to describe this difficulty here. The crux of it is that waiting for scientific clarity would be be lovely, but may be a luxury we don't have. If highly advanced AI systems are built soon—and even 10-20 years from now is very soon!—then we need to start preparing now.
3
15
90
If you're still interested in doing a postdoc with me, but haven't reached out yet, please do so! . I've gotten an extension on the admin deadlines for the FLI fellowship, so if you can apply by Jan 2, then you can still be considered for this round of my hiring.
I'm looking for:.- PhD students.- postdocs.- and potentially interns and research assistants (like a post-doc for those who haven't got a PhD). Potential post-docs should reach out ASAP and strongly consider applying for funding from FLI; this would need to happen quite soon.
8
17
85
Congrats to the whole team!. IIRC, our findings in this work blew my mind more than any other result. This may be the first evidence for the existence of a *mechanism* by which sufficiently advanced AI systems would tend to become agentic (and thus have instrumental goals).
1/ Excited to finally tweet about our paper “Implicit meta-learning may lead LLMs to trust more reliable sources”, to appear at ICML 2024. Our results suggest that during training, LLMs better internalize text that appears useful for predicting other text (e.g. seems reliable).
2
9
85
Seriously. Also, I recommend practicing talking about it in a way that sounds normal, since it can feel hard to do when you expect people will think it's weird.
More people need to speak up about the implications of developing smarter-than-human AI systems. This is one of the biggest bottlenecks to sensible AI policy.
5
2
85
RLHF is super popular, more so than I expected it would be when we wrote our agenda in 2018 (. Still, a lot of the issues we mentioned there persist. Meanwhile, many others issues have also been identified. Much work left to do and RLHF isn't enough!.
New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread 🧵
1
12
80
Well, that's fairly alarming. I suspect the model steganographically plans to backstab, given that backstabbing is essential to good Diplomacy performance IIUC. This may be the demonstration of emergent deception we've been waiting for!. I'll need to look closer. .
@petermilley It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind. .
7
6
82