If our international students don't get a salary, I won't either. I pledge to donate my fall salary unless we fix U.S. immigration policy to allow international students (including incoming students) to be paid their stipend.
My student Kayo Yin needs your help. Her visa has been unnecessarily delayed, which would prevent her from coming to UC Berkeley to start her studies. Despite bringing all required documents, the
@StateDept
refused to process the visa and it could take months to re-process.
Many people, including me, have been surprised by recent developments in machine learning. To be less surprised in the future, we should make and discuss specific projections about future models. In this spirit, I predict properties of models in 2030:
In 2021, I created a forecasting prize to predict ML performance on benchmarks in June 2022 (and 2023, 2024, and 2025). June has ended, so we can see how the forecasters did:
This NYT article on Azalia and Anna's excellent chip design work is gross, to the point of journalistic malpractice. It platforms a bully while drawing an absurd parallel to
@timnitGebru
's firing.
@CadeMetz
should be ashamed. (not linking so it doesn't get more clicks)
Can we build an LLM system to forecast geo-political events at the level of human forecasters?
Introducing our work Approaching Human-Level Forecasting with Language Models!
Arxiv:
Joint work with
@dannyhalawi15
,
@FredZhang0
, and
@jcyhc_ai
A core intuition I have about deep neural networks is that they are complex adaptive systems. This creates a number of control difficulties that are different from traditional engineering challenges:
I'm back to blogging, with some new thoughts on emergence:
I answer the question: what are some specific emergent "failure modes" for ML systems that we should be on the lookout for?
To give an idea of just how much SOTA exceeded forecasters' expectations, here are the prediction intervals for the MATH and Massive Multitask benchmarks. Both outcomes exceeded the 95th percentile prediction.
Awesome to see
@DeepMind
's recent language modeling paper include our forecasts as a comparison point! Hopefully more papers track progress relative to forecasts so that we can better understand the pace of progress in deep learning.
On my blog, I've recently been discussing emergent behavior and in particular the idea that "More is Different". As part of this, I've compiled a list of examples across a variety of domains:
Since GPT-4 was released last week, I decided to switch things up from AI-related blogging and instead talk about research group culture. In my group, I've come up with a set of principles to help foster healthy and productive group meetings: .
Kayo has already done stellar machine learning work for her Master's degree at CMU, one of the top US universities. ML expertise is sorely needed in the US. Is the U.S. really so eager to shoot itself in the foot?
Finally, while forecasters underpredicted progress on capabilities, they *overpredicted* progress on robustness. So while capabilities are advancing quickly, safety properties may be behind schedule. A troubling thought.
Kayo's semester starts in one week. She's a French citizen who has spent significant time in the U.S. In addition to all required documents, we've sent extensive additional docs to "prove" that Kayo is really coming to Berkeley. There's no reason this can't be approved tomorrow.
I worry about tail risks from future AI systems, but I haven't read descriptions that feel plausible to me, so I tried writing some of my own: . This led to four vignettes covering cyberattacks, economic competition, and bioterrorism.
I've known Anna for a long time now, and she's one of the most impressive junior ML researchers around. She also holds herself to high standards of integrity. I've been impressed with how well she's handled this situation. Let's give her and Azalia our support.
A blog post series on a key way I've changed my mind about ML: the (relative) value of empirical data vs. thought experiments for predicting future ML developments.
ML systems are different from traditional software, in that most of their properties are acquired from data, without explicit human intent. This is unintuitive and creates new types of risk. In this blog post I talk about one such risk: unwanted drives
I quite enjoyed this workshop, and was pretty happy with the talk I gave (new and made ~from scratch!).
My topic was using LLMs to help us understand LLMs, and covers great work by
@TongPetersb
,
@ErikJones313
,
@ZhongRuiqi
+others. You can watch it here:
I suspect most of us in the ML field still haven't internalized how quickly ML capabilities are advancing. We should be preregistering forecasts so that we can learn and correct! I intend to do so for June 2023.
Findings:
* Forecasters significantly underpredicted progress
* But were more accurate than me (I underpredicted progress even more!)
* Also were (probably) more accurate than median ML researcher
Satrajit Chatterjee (the subject of the article) is portrayed as being fired after raising scientific concerns with Azalia Mirhoseini and Anna Goldie's Nature paper on chip design. In reality, Chatterjee waged a years-long campaign to harass & undermine their work.
Over the past two years, I and many other forecasters registered predictions about the state-of-the-art accuracy on ML benchmarks in 2022-2025. In this blog post, I evaluate the predictions for 2023:
Gov. Cuomo recently said that he's using R = 1.1 as a trigger point for “circuit breaking” New York’s reopening. This is a weird policy that doesn't make sense, but not because we should use R = 1 instead. 1/N
Google's statement says Chatterjee was "terminated with cause". This is an unusually strong statement and shows Google had serious problems with him. NYT should know this so it's unclear why they paint this as "he said she said" (and give most space to Chatterjee).
I argue that while ML models have undergone many qualitative shifts (and will continue to do so), many empirical findings hold up well even across these shifts:
Part of the "More is Different" series on my blog!
New paper on household transmission of SARS-CoV-2: , with
@mihaela_curmei
,
@andrew_ilyas
, and
@OwainEvans_UK
. Very interested in feedback! We show that under lockdowns, 30-55% of transmissions occur in houses. 1/4.
Interestingly, forecasters' biggest miss was on the MATH dataset, where
@alewkowycz
@ethansdyer
and others set a record of 50.3% on the very last day of June! One day made a huge difference.
My tutorial slides on Aligning ML Systems are now online, in HTML format, with clickable references!
[NB some minor formatting errors were introduced when converting to HTML]
Next up
@satml_conf
is
@JacobSteinhardt
who is giving a terrific tutorial on the topic of "Aligning ML Systems with Human Intent"
(like all SaTML content, it is being recorded and will be released in a couple of days)
It's particularly gross that the article repeatedly draws parallels with Timnit Gebru's firing, which is completely different in terms of the facts on the ground. Timnit agrees: . Seems clear that NYT did this for clicks.
I haven't read this
@nytimes
article by
@daiwaka
&
@CadeMetz
. But I had heard about the person from many ppl. To the extent the story is connected to mine, it's ONLY the pattern of action on toxic men taken too late while ppl like me are retaliated against
Nora is a super creative thinker and very capable engineer. I'd highly recommend working for her if you want to do cool work on understanding ML models at an open-source org!
My Interpretability research team at
@AiEleuther
is hiring! If you're interested, please read our job posting and submit:
1. Your CV
2. Three interp papers you'd like to build on
3. Links to cool open source repos you've built
to contact
@eleuther
.ai
I respect Jacob a lot but I find it really difficult to engage with predictions of LLM capabilities that presume some version of the scaling hypothesis will continue to hold - it just seems highly implausible given everything we already know about the limits of transformers!
Is remote work slower? I estimate 0-50% slower for many tasks, but for some tasks (esp. branching into new areas/skillsets) it can easily be 5x slower. Easy to underestimate for managers, but huge effect:
In particular, I project that "GPT-2030" will have a number of properties that are surprising relative to current systems:
1. Superhuman abilities at specific tasks, such as math, programming, and hacking.
2. Fast inference speed and throughput (enough to run millions of copies)
Complex adaptive systems follow the law of unintended consequences: straightforward attempts to control traffic, ecosystems, firms, or pathogens fail in unexpected ways. And we can see similar issues in deep networks with reward hacking and emergence.
4. Consider not building certain systems. In biology, some gain-of-function research is heavily restricted, and there are significant safeguards around rapidly-evolving systems like pathogens. We should ask if and when similar principles should apply in machine learning.
Based on this, I examine a number of principles for improving the safety of deep learning systems that are inspired by the complex systems literature:
1. Build sharp cliffs in the reward landscape around bad behaviors, so that models never explore them in the first place.
I've previously made forecasts for mid-2023 (which I'll discuss in July once they resolve). Thinking 7 years out is obviously much harder, but I think important for preparing for the future impacts of ML.
2. Train models to self-regulate and have limited aims.
3. Pretraining shapes most of the structure of a model. Consider what heuristics you are baking in at pretraining time, rather than relying on fine-tuning to fix problems.
Many have heard of deliberate practice, but I identify another importance mental stance called *deliberate play*. Deliberate play is intentional, but with a softer focus. Deliberate practice develops skills; deliberate play develops frameworks.
@EpochAIResearch
is one of the coolest (and in my opinion underrated) research orgs for understanding trends in ML. Rather than speculating, they meticulously analyze empirical trends and make projections for the future. Lots of interesting findings in their data!
We at
@EpochAIResearch
recently published a new short report!
In "Trends in Training Dataset Sizes", we explore the growth of ML training datasets over the past few decades.
Doubling time has historically been 16 months for language datasets and 41 months for vision.
🧵1/3
What will SOTA for ML benchmarks be in 2023? I forecast results for the MATH and MMLU benchmarks, two benchmarks that have had surprising progress in the past year:
In the next post of this series, I argue that when predicting the future of ML, we should not simply expect existing empirical trends to continue. Instead, we will often observe qualitatively new, "emergent" behavior: .
A blog post series on a key way I've changed my mind about ML: the (relative) value of empirical data vs. thought experiments for predicting future ML developments.
3. Parallel learning. Because copies have identical weights, can propagate millions of gradient updates in parallel. This means models could rapidly learn new tasks (including "bad" tasks like manipulation/misinformation).
4. New modalities. Beyond tool use and images, may be trained on proteins, astronomical images, networks, etc. Therefore could have strong intuitive grasp of these more "exotic" domains.
I elaborate on these and consider several additional ideas in the blog post itself.
Thanks to
@DanHendrycks
for first articulating the complex systems perspective on deep learning to me. He's continuing to do great work in that and other directions at
For predicting what future ML systems will look like, it's helpful to have "anchors"---reference classes that are broadly analogous to future ML. Common anchors include "current ML" and "humans", but I think there's many other good choices:
In this work, we build a LM pipeline for automated forecasting. Given any question about a future event, it retrieves and summarizes relevant articles, reasons about them, and predicts the probability that the event occurs.
If you want to join me on this, you can register predictions on Metaculus for the MATH and Massive Multitask benchmarks:
*
*
It's pretty easy--just need a Google account. The MATH one is open now and Multitask should be open soon.
I suspect most of us in the ML field still haven't internalized how quickly ML capabilities are advancing. We should be preregistering forecasts so that we can learn and correct! I intend to do so for June 2023.
I then consider a few ways GPT-2030 could affect society. Importantly, there are serious misuse risks (such as hacking and persuasion) that we should address. These are just two examples, and generally I favor more work on forward-looking analyses of societal impacts.
@aghobarah
Definitely agree in terms of research track record. But in terms of professional standing, Anna's a PhD student and Azalia's on the academic job market right now. This is important, because it means their careers are more affected by this sort of press (vs. a tenured prof).
@chhaviyadav_
Consulates are closed due to COVID-19, so incoming international students can't apply for visas. Has been true for a while but now at the point it is affecting students directly. See e.g. this June letter from GOP representatives asking Pomep to fix it:
Some exciting new work by my student
@DanHendrycks
and collaborators. We identify seven hypotheses about OOD generalization in the literature, and collect several new datasets to test these. Trying to add more "strong inference" to ML (cf. Platt 1964).
What methods actually improve robustness? In this paper, we test robustness to changes in geography, time, occlusion, rendition, real image blurs, and so on with 4 new datasets.
No published method consistently improves robustness.
Curated list of documented police abuse during protests: . Compilations like this are a compelling reminder that George Floyd is the most salient instance of a broader trend. (And remember: there's also many good police who are supporting protestors.)
We compare our system to ensembles of competitive human forecasters ("the crowd"). We approach the performance of the crowd across all questions, and beat the crowd on questions where they are less confident (probabilities between 0.3 and 0.7).
Good to see this analysis, but misleading headline. 24 states have *point estimates* over 1, but uncertainty in estimates is large. Let's consider null hypothesis that Rt=0.95 everywhere. Then would expect 19 states with estimates above 1 (eyeballing stdev=0.17 from fig. 4).
UPDATE:
#covid19science
#COVID19
in USA
➡️Initial national average reproduction number R was 2.2
➡️24 states have Rt over 1
➡️Increasing mobility cause resurgence (doubling number of deaths in 8 weeks)
➡️4.1% of people infected nationally
🔰Report
Moreover, averaging our prediction with the crowd consistently outperforms the crowd itself (as measured by Brier score, the most commonly-used metric of forecasting performance).
Our system has a number of interesting properties. For instance, our forecasted probabilities are well-calibrated, even though we perform no explicit calibration and even though the base models themselves are not (!).
Lots of people hating on hydroxychloroquine because Trump likes it. But just because Trump likes something doesn't mean it kills people. Maybe it does, but let's demand real evidence instead of giving shoddy science a pass.
The actual issue is that R is not a good metric for directly setting policy, because it's difficult to estimate and far-removed from things in the world we care about, like hospital demand.
What's going on with Georgia? They've been "open" for a while now and there's been no apparent spike in cases. I don't think this can just be poor testing because other data sources (e.g. FB surveys) show same thing: 1/5
Examples include gecko feet, operating systems, economic specialization, hemoglobin, polymers, eyes, ant colonies, transistors, cities, and skill acquisition. If you're interested in reading about how this applies to ML, check out the full blog series!
I *also* still think there are unknown unknowns, and we should probably slow down and understand what current large ML systems are doing, before rushing to deploy new ones.
But hopefully concrete behaviors will open the door to concrete research towards addressing them.
Overall, each scenario requires a few things to "go right" for the rogue AI system; I think of them as moderate but not extreme tail events, and assign ~5% probability to "something like" one of these scenarios happening by 2050. (w/ additional prob. on other/unknown scenarios)
Second, our model underperforms on "easy" questions (where the answer is nearly certain), because it is unwilling to give probabilities very close to 0 or 1. This is possibly an artifact of its safety training.
In research, it's important to create an environment that allows for risk-taking and mistakes, while also pushing eventually towards excellence and innovation. I aim to set discussion norms that promote both of these.
Some great recommendations from Chloe Cockburn (a program officer at Open Philanthropy, where I worked last summer). My understanding is that DA elections (starts at
#9
on the list) are a high-impact route to police and criminal justice reform.
Thread: A lot of people are asking me where to give $ in this moment (I direct criminal justice giving at Open Philanthropy). I've compiled a list of recs for police accountability, including shrinking their budgets; decarceration; and transforming systems. /1
Open letter on police reform at UC Berkeley. I helped draft this, together with several amazing students. If you're at UCB and want to sign, please get in touch via e-mail.
UCB has already pursued some good reforms, but there's much more to be done.
Signal-boosting this pushback since Nuño has a strong forecasting track record.
I agree AI part is not traditional ref. class analysis, but think "AI is an adaptive self-replicator, this often causes problems" is importantly less inside-view than [long arg. about paperclips].
@JacobSteinhardt
@DhruvMadeka
I like the overall analysis. I think that the move of noticing that AIs might share some characteristics with pandemics, in that AIs might be self-replicating, is an inside-view move, and I don't feel great about characterizing that as a reference class analysis.
Finally, we provide a self-supervised method that fine-tunes models to forecast better, based on having them mimic rationales and forecasts that outperform the crowd. This is effective enough that fine-tuned GPT-3.5 can beat a carefully prompted GPT-4.
I'm helping Redwood Research run REMIX, a 1 month mechanistic interpretability sprint where 25+ people to reverse engineer circuits in GPT-2 Small. This seems a great way to get experience exploring
@ch402
's transformer circuits work.
Apply by 13th Nov!
This one on writing is an oldie, but hopefully useful to people gearing up for ICLR! Also highly recommend "Style: Lessons in Clarity and Grace" by Williams and Bizup for the book-length treatment of good writing
@satml_conf
was a great experience. More interesting conversations and ideas per day than at ICML, NeurIPS, or ICLR. The smaller size contributed, as well as a great program. Thanks to
@NicolasPapernot
and all the organizers!
And
@satml_conf
is a wrap! Thank you to all the attendees for their amazing energy!
Excited to announce that
@carmelatroncoso
has agreed to co-chair the conference with me next year!!
On the other hand, secondary attack rate (probability of transmission) surprisingly low: ~30% between two house members. Implies infection is not inevitable even between close contacts; basic precautions e.g. handwashing still worthwhile. 2/4
More criticism of Yale wastewater study, links to cool analysis by
@xangregg
. One thing to keep in mind is that there's excellent, careful researchers in this area who *aren't* publishing results yet because they're waiting for better data. Similar to how serology played out.
Incredible... stats meets the 24 hour news cycle. Data scraped from pdf, analyzed and reanalyzed w/in a few days of an exciting preprint appearing.
Purple curve (linear smoothing + robust handling of outliers) is v. similar to smoothed curve in the preprint.
I'm worried that we're ignoring this data point because it doesn't fit our priors. It's not what I expected either, but therefore important to discuss. Most explanations I see say GA has too few tests or is making up numbers, but these seem untenable given the survey data. 5/5
@xuanalogue
Thanks, I appreciated this! I don't think I'm claiming data/scale is all that matters, and agree ideas are an important part of the picture. For instance Parsel is an example of ideas helping a lot on APPS.
We're accepting proposals for projects working with deep learning systems that could help us understand and make progress on AI alignment. Learn more about the research directions and the application process here:
On R = 1.1 in particular: it’s difficult to tell the difference between R=1.1 and R=0.9 without at least 7 days of data, and probably more. R=1.1 corresponds to 2%/day growth, and 0.9 to -2%/day decline.
Anyways, that's just a preview and I'll lay out my full position (and arguments behind it) in the series, which posts each Tuesday for next 5 weeks. You can read the first post here: . Comments and feedback welcome!
transformer inference performance is becoming increasingly important and there's not as much lore on it, so here is a lot of lore that i think fully models llm inference performance
While there is a real R (avg # of infections per source), we can't measure this without the infection graph, which few regions have. Instead the "R" we talk about is a model parameter that we’re imputing under lots of assumptions about generation time, infection dynamics, etc.
There's been a lot of controversy about the CAIS statement on extinction risk from AI, so let's talk about it!
I wrote a post with some of my detailed thoughts on objections to the statement.
As AI systems become more useful, people will delegate greater authority to them across more tasks.
AIs are evolving in an increasingly frenzied and uncontrolled manner. This carries risks as natural selection favors AIs over humans.
Paper: (🧵 below)
Papers often propose a similarity metric and justify it with intuitive desiderata, but different intuitive tests can make any method look good. Our work (joint with Jean-Stanislas Denain and
@JacobSteinhardt
) provides a quantitative benchmark for evaluating similarity metrics 4/7
I am very interested in discussion and feedback on these scenarios. Debating them has shaped my overall view of catastrophic risks from AI (both overall probability and relative likelihood of different paths), and I expect further discussion to continue to do so.
What changed? Ironically, GPT-3. GPT-3 showed that new qualitative capabilities (like in-context learning) can emerge without warning. Despite being a huge engineering accomplishment, GPT-3 showed the limits of the Engineering mindset for predicting the future.