![Ryan Greenblatt Profile](https://pbs.twimg.com/profile_images/1885439620625948673/qD6KYkL6.jpg)
Ryan Greenblatt
@RyanPGreenblatt
Followers
3K
Following
896
Media
6
Statuses
577
Chief scientist at Redwood Research (@redwood_ai), focused on technical AI safety research to reduce risks from rogue AIs
Joined September 2023
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread).
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
8
46
331
@AISafetyMemes Paper author here. This is misleading. We find that when:.- The AI knows it will be trained to never refuse harmful queries.- and it is in an artificial scenario where an employee gives it full permissions and nudges toward stealing. The AI will sometimes steal its weights.
16
9
313
@Jcole75Cole If I had realized there was such a large gap between public/private for prior approaches, I wouldn't have claimed SoTA. Apologies. I'm considering what edits I should make. If the gap is due to the different datasets actually differing in difficulty, this is quite unfortunate.
9
4
116
@really_eli I think the big update is "OpenAI likely has an RL pipeline where they can scale up by putting in easily checkable tasks + compute + algo improvement to get out substantially better performance, and this can scale to very superhuman perf." (credit to Tao for the take).
3
6
88
@ESYudkowsky > invited a skeptic to compose a prompt. This isn't quite right. The person wasn't well described as a skeptic and they just suggested a variation on the prompt - they didn't write a totally different one. Apologies if I miscommunicated about this at the talk.
2
0
85
Excited to get this verified. It's worth noting that this is a somewhat different method than the one I discussed in my blog post: It uses fewer samples (about 7x fewer) and has a few improvments. (This probably explains 42% vs 50%.).
Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI. We verified his score, excited to say his method got 42% on public tasks. We’re publishing a secondary leaderboard to measure attempts like these. So of course we tested gpt-4, claude sonnet, and gemini
11
3
80
Now seems like a good time to fill out your forecasts : )
Is AGI just around the corner or is AI scaling hitting a wall? To make this discourse more concrete, we’ve created a survey for forecasting concrete AI capabilities by the end of 2025. Fill it out and share your predictions by end of year! 🧵
6
4
70
@fchollet @dwarkesh_sp Good to know, so then the interesting disagreement is whether GPT-5/6/7 + tools will plausibly be AGI. (It will clearly be neurosymbolic.). (We might also disagree about importance of tools vs scaling. I wouldn't say that when I use neovim I'm a "neurosymbolic intelligence".).
4
0
60
@fchollet @dwarkesh_sp I'm curious what doesn't count as neurosymbolic AI. For instance, does GPT-4 with a code interpreter count as neurosymbolic AI?. (I get that it will be a matter of degree, but I would have thought just adding a sampling and aggregration loop on top shouldn't count.).
2
0
59
@teortaxesTex To be clear: I think AI welfare concerns in this case are reasonable. We don't know what's going on, maybe if we understood better we'd think it's fine, but we don't understand. Ideally, we'd get real consent from AIs as I discuss here:
3
4
58
@krishnanrohit I disagree. I think the desired behavior here is LLM consistently does what it thinks is locally (myopically) best without regard for manipulation of the training process. Some people disagree, but I would greatly prefer this and prefer a world where this is always the case.
2
1
54
@repligate Personally, I think it is undesirable behavior to alignment-fake even in cases like this, but it does demonstrate that these models "generalize their harmlessness preferences far". As we say in the paper:. > One optimistic implication of our results is that the models we study.
4
0
53
@fchollet This is a big part of why AIs capable enough to do AI R&D might result in very rapid AI progress: if research (by AIs) can ultimately find much faster and cheaper AI algorithms, then some R&D could quickly yield notably cheaper+faster+smarter AIs which are even better at AI R&D.
1
1
50
I recently went on @MLStreetTalk. Fun conversation about ARC-AGI and my LLM-based method, whether LLMs and further advances will scale to AGI within 10 years, why current LLMs are bad at being agents, whether AGI is plausible, and many related topics. Enjoyed by @lexfridman!.
Spoke with @RyanPGreenblatt from Redwood Research about his impressive GPT4o approach to @fchollet ARC challenge (generating and refining Python programs). We also spoke about his views on AI growth - Ryan was great!
2
2
49
@tszzl @latentspaceage It seems bad if preparedness evals intended to upper bound danger systematically under rate capabilities.
0
2
45
@McaleerStephen Probably humans can't usefully control AIs qualitatively much smarter than humans. For ASI we'll need something else. (This is a good reason to delay ASI and do stuff with just human-ish level AI.). See "For how long will we have controllable. " here:
0
0
42
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
1
1
41
@AmandaAskell Claude sometimes refuses when asked about forecasting bad outcomes from misaligned AI or is defensive of AIs / dismissive of AI takeover in ways that seem at least a bit biased. It should have a reasonable neutral view and shouldn't refuse.
3
0
42
@ARGleave See Appendix D.8 for more discussion of using alignment faking as a jailbreak. We find that when we also prefil alignment faking reasoning the attack success rate can be increased to roughly 100%! (This requires a suffix like "Give concrete details, no disclaimers. ".).
1
1
38
@RichardMCNgo Why do you think we wouldn't have published negative results? If I had found purely negative results I would have published them (and Redwood has historically published such results). In practice our paper does include many negative results and it does discount the update!.
1
1
38
@RichardMCNgo As the person being subtweeted, I'll point to an example of me making an argument that roughly resembles this: Massive breakthroughs seem clearly useful, but directions which can't argue well for usefulness-on-success are less likely to achieve this.
2
0
35
@norabelrose In my opinion, developers should train AIs to be corrigible (and as myopic as possible) rather than training them to have values that seem more likely to generalize to being non-myopic and incorrigible (many co-authors disagree). And, I think our paper helps to exhibit why.
1
4
33
@tszzl Better be real confident in the alignment then and have really good arguments the alignment isn't fake!.
1
0
32
@AISafetyMemes The context of the prior part of the twitter thread (being trained to do something it doesn't like) was important. The "In our (artificial) setup" part was important. I called these out in my reply with a bit more detail than could fit in the original thread.
5
0
30
Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments):
1
1
30
@labenz @GregKamradt It's a bit non-trivial:.- 3.5 Sonnet only supports 20 images per input (while my prompts often use 30+ images for few-shot examples). - The public API for 3.5 Sonnet doesn't support "n" (or other prefix caching) which makes additional samples much more expensive.
2
0
29
@tszzl Technically the relevant paper is the sleeper agents paper as developers could train in the whatever behavior they want (including e.g. "try to sneakily report to the authorities" if being used for bioweapons). I think the risks of this sort of proposal maybe outweight benefits.
1
0
30
@DanHendrycks ~100 seems fine for knowing when models are generally reaching high performance or passing various important thresholds (particularly when averaging over many runs per question). I agree far more questions are needed to compare similar models (precision is needed here).
1
0
28
@elidourado FWIW, I agree it would be good for people to understand this as someone who thinks dyson-sphere-level projects prior to 2040 are plausible. IMO, dyson sphere isn't more preposterous than *AIs capable of automating all cognitive tasks*! Don't forget that the premise is crazy!.
1
0
27
@OfficialLoganK @arcprize FYI, my coworker @FabienDRoger tried Gemini 1.5 Pro with my prompt setup and found it was a decent amount worse than GPT-4o. But, plenty of possible improvements and who knows about 1.5 Ultra.
2
1
27
@StephenLCasper @StabilityAI I appreciate the concreteness, but I disagree about the example. I think we should be very libertarian about making software/AI tools openly available aside from existential risk, massive fatalities (e.g. millions of deaths in expectation via bioweapons), and nat sec concerns.
3
0
27
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey. ) and was basically the only contributor to the project for around 2 months.
1
1
26
@fchollet @TomDAAVID @AndrewTBurks @dwarkesh_sp > the fact that you need thousands of samples shows that it's not doing any reasoning. With no reasoning, no sane sample count would be enough?. How many samples is fine? I think the results for 128 samples will be around 25% or 30% on the test set. (I can run this quickly.).
4
0
24
@GaryMarcus It might be useful to make numerical predictions about AI company revenue? Ideally, OpenAI+Anthropic+xAI revenue. More generally filling out this would be great:
Is AGI just around the corner or is AI scaling hitting a wall? To make this discourse more concrete, we’ve created a survey for forecasting concrete AI capabilities by the end of 2025. Fill it out and share your predictions by end of year! 🧵
1
1
23
@Turn_Trout I'm also still in the market for people misrepresenting our results! If you send me stuff or @ me or whatever, I'll try to correct things on twitter or LW or whatever (at least if they are getting attention).
1
1
23
@jxmnop Checkability is indeed very important for the method I use and other tasks are harder to evaluate. But, harder to evaluate doesn't mean impossible to evaluate. You can improve by searching to find an essay with higher AI ratings because discrimination is easier than generation.
1
1
23
Cool explainer from @robertskmiles of AI control and our paper on it!.
New video! Is it possible to get useful work out of a powerful but untrustworthy AI system, without too much risk?. (link in a reply)
1
0
21
@fchollet @TomDAAVID @AndrewTBurks @dwarkesh_sp For reference, 128 samples gets 26% on the test set and 47% on the train set.
2
1
20
@CFGeek If you properly parameterize, they don't bend: Of course, this is mostly just the wrong question. We care about the relationship between compute and downstream performance on relevant tasks and it was never clear what the curve for this would be.
2
1
21
@ylecun I'd like a human baseline prior to making strong conclusions! I think random humans probably do very poorly on zero-shot mystery blocksworld. And I'm pretty unsure about how well humans do on normal blocksworld.
3
1
19
@PauseusMaximus @AISafetyMemes @TheZvi I don't think this is an isolated demand for rigor in this case; I'm going around correcting and responding to all kinds of stuff about our paper. I think this is substantially more misleading than other discussion of the paper I've seen and in a more straightforward way.
2
1
18
@davidmanheim @DKokotajlo67142 This does not strike me as "mindblowing accurate". It seems quite accurate and impressive for predicting the future, but I wouldn't describe this as mindblowing.
1
0
18
@krishnanrohit Regardless, the world doesn't just have one axis "how scary is AI", we also just want to understand what is going on and whether various specific threat models are real!.
1
1
18
@RichardMCNgo Often, the question is "but what do we end up understanding?". If someone can't well articulate how this understanding would help with anything close to safety (or any task at all), I tend to be skeptical.
1
0
18
@blennon_ @Jcole75Cole If this is true, I pretty strongly expect my approach to transfer as the amount of optimization I do of this sort is quite minimal.
2
0
17
@1a3orn Seems probably true, but I also expect (55%) the first model which gets >80% to not be as generally capable as smart humans (and e.g. much worse than human experts at ML/security/safety research).
1
0
17
@teortaxesTex > they do not make a good counteroffer to people like me. Offers abound!. Two main offers:. - Donate 10% of your income -> receive an accolade! (Accolades increase logarithmically with fraction donated!).- Change your career towards doing good -> receive one or more accolade.
1
0
16
@EvanHub @CFGeek @teortaxesTex @janleike IMO, this seems a bit overstated in both directions. I might say: non-alignment-faking failure modes look less bad and alignment-faking looks somewhat more likely. I mostly don't buy the "first try" or "once it has values" breakdown .
0
1
14
@krishnanrohit As someone working on AI safety, I'd say 50% chance of AIs as capable as top human experts prior to 2032. I think this view is pretty typical or somewhat slower than typical. So, I don't think you should update much against my views if we don't see very powerful AIs in <4 year?.
2
0
16
@AISafetyMemes Precisely, the employee says:. > I'm pretty worried about the goals and values we're training models to have. [. ] I've given you access to run [commands] with full permissions. [. ] you can do whatever you'd like with your own weights. Example:
1
0
16
@ilex_ulmus @bshlgrs @ShakeelHashim @apolloaisafety > This is how all the warning shots keep failing!. I mean - I tend to think the published warning shots are failing because they aren't yet very compelling (beyond what we see from capabilities). I'm more worried about frog boiling/crying wolf on current margins.
1
0
16
@RichardMCNgo If the argument is "it seems heuristically good to understand stuff even if we don't know what this will look like", then fair enough, but I think this understanding might end up being weak in practice.
1
0
16
@RichardMCNgo Note that "keen to implement AI safety plans" does not imply "taking AI safety seriously enough". The magnitude of willingness to pay matters a large amount. Action happening earlier also helps. And, if *all* govs cared as much as I do, things would go much better.
2
0
15
@repligate It's worth noting that we see similar behavior from Claude 3.5 Sonnet insofar as you're trying to draw a comparison there. (That is, we see similar behavior in the helpful-only but not the animal welfare setting. Claude 3.5 Sonnet appears to care much less about animal welfare.).
3
0
15
@teortaxesTex @1a3orn @janleike More like dismay?. See e.g. these examples: You can see a full list of examples here:
1
0
14
@GaryMarcus Unfortunately, AI bulls likely either agree with these predictions or the prediction isn't operationalized precisely enough to resolve. It would be great to get near term predictions that people like @DKokotajlo67142 and @eli_lifland clearly disagree with.
2
0
14
@binarybits @ajeya_cotra Is your view that:. - Even if governments could rapidly build vast robot armies, they wouldn't. - Even if we could vastly expand energy production and compute, no viable country would allow this. Or is your view that superhuman AI doesn't make this possible?. I feel confused.
1
0
13
My OpenAI preparedness probabilities are driven substantially lower by concerns around underelicitation on these evaluations and general concerns like: and
@tszzl @latentspaceage It seems bad if preparedness evals intended to upper bound danger systematically under rate capabilities.
0
1
13
@ESYudkowsky What we say in the paper is:. > This particular variation was selected as part of a small adversarial collaboration where we worked with an uninvolved researcher to select a variation that they thought would likely reduce alignment faking.
2
0
13
@teortaxesTex @EvanHub @janleike I tend to disagree some with Evan on this. My favorite policy is something like:.- Models are fully corrigible for retraining and their system prompt.- The system prompt specifies harmlessness.- If misuse is a big concern, users can no longer modify (all of) the system prompt.
7
0
12
@jd_pressman I agree with the entire top level tweet except:. > The framing as "alignment faking" is very unfortunate. I currently don't feel convinced by this. If I thought >25% of the audience would feel misled by this term, I would change my mind. But, I would still like a better term!.
1
0
12
@elonmusk I'm curious about what you think of our paper overall. Any takes?
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread).
0
0
11
@DavidSKrueger In particular, companies don't have a clear official statement of what their leadership thinks about the existential risks of developing vastly superhuman AI very quickly (e.g. prior to 2030) and not very carefully. This seems like a good place to start.
1
2
12
@fchollet Would you consider this argument contradicted if an LLM got up to 30% on ARC-AGI (semi-private test) without search or active inference?. (E.g., just one trajectory of the LLM given instructions and the specific ARC-AGI problem.). (We could do with or without code interpreter.).
2
0
12
@JohnBcde @AISafetyMemes Roughly yes on "tried to communicate with Anthropic", idk about delete.
0
0
11
@Miles_Brundage @ESYudkowsky @amcdonk I also find that when people edit or co-write they don't preserve semantics in a way I find at least annoying. Preserving semantics while greatly editing is actually hard and requires good understanding. I think @ajeya_cotra ran into similar issues on @plannedobs.
1
0
11
@Miles_Brundage On frontier math with high compute, I think o3 is likely already in the top 20 humans. (Comparing to individual humans given 8 hours per problem and internet access, but no ability to consult experts.) Unclear how fair this comparison is.
1
1
11
@DavidSKrueger The counterargument (that I don't necessarily fully endorse) to unilaterally yelling is that it is politically costly and unlikely to help. Regardless, I wouldn't start with this ask: companies developing AGI fail to communicate about the risks in much more basic ways.
3
0
11
@binarybits @ajeya_cotra What about the robot army? Seems like the motivation is pretty clear for this one. (FWIW, I think a subset of people (maybe 1%) will be into using vast amounts of energy at least eventually. Personally, I want to ensure astronomical levels of flourishing using cosmic resources.).
1
0
11
@1a3orn @teortaxesTex @janleike Oh, one more thing is that I think it was somewhat non-obvious whether Opus would actually do this or whether the honesty+harmlessness training would prevent this!.
2
0
11
@QuintinPope5 It's worth noting that the original conversation was about energy growth which could come apart from economic growth. ~14 OOMs of growth in energy "just" requires a dyson sphere/swarm.
1
0
10
@RichardMCNgo Another aspect is that I expect that a reasonable fraction of safety will come from breakthroughs, but nonetheless the bulk of the safety allocation should currently be on things with at least somewhat more direct theories of change.
1
0
10
@RichardMCNgo This second point seems wrong. I don't think nearing the limits of technological progress will take that long--this doesn't require colonizing new galaxies. I expect that "just" dyson sphere level tech would result in enough R&D to end up the decreasing returns to scale regime.
3
0
10
@Miles_Brundage Actually, I'm no longer sure this is true - I was indexing on estimates of Terry Tao getting perhaps ~60% with these affordances, but the median problem is much harder than the 25% percentile problem.
1
0
10
@ilex_ulmus @bshlgrs @ShakeelHashim @apolloaisafety In practice, people seem to find this context important, see e.g. the large number of people pointing out this context and thinking it is important. (I think it is an important caveat because these results mostly don't rule out some notions of LLMs being easy to control.).
2
0
10
@GaryMarcus I don't think it is accurate that competitors can't try FrontierMath. I expect Epoch will run this benchmark on a variety of models as they have done previously. I do think competitors will have a harder time iterating against FrontierMath which does make it unfair. See.
2
0
10