max_a_schwarzer Profile Banner
Max Schwarzer Profile
Max Schwarzer

@max_a_schwarzer

Followers
7K
Following
319
Media
29
Statuses
135

Doing research at @OpenAI. Did my PhD with Aaron Courville and @marcgbellemare at @Mila_Quebec. Interned at @Apple, @DeepMind, Google Brain, @Numenta.

Bay Area
Joined June 2020
Don't wanna be here? Send us removal request.
@max_a_schwarzer
Max Schwarzer
5 months
I have always believed that you don't need a GPT-6 quality base model to achieve human-level reasoning performance, and that reinforcement learning was the missing ingredient on the path to AGI. Today, we have the proof -- o1.
@OpenAI
OpenAI
5 months
We're releasing a preview of OpenAI o1โ€”a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
39
162
3K
@max_a_schwarzer
Max Schwarzer
2 years
What if I told you that you can attain human-level sample efficiency without LLMs or a world model, just by scaling up model-free RL? Iโ€™m happy to present our new paper, Bigger, Better, Faster: Human-Level Atari with Human-Level Efficiency, at ICML 2023.
Tweet media one
15
115
644
@max_a_schwarzer
Max Schwarzer
5 months
what it looks like when deep learning is hitting a wall:
Tweet media one
@GaryMarcus
Gary Marcus
5 months
Strawberry has landed. ๐—›๐—ผ๐˜ ๐˜๐—ฎ๐—ธ๐—ฒ ๐—ผ๐—ป ๐—š๐—ฃ๐—ง'๐˜€ ๐—ป๐—ฒ๐˜„ ๐—ผ๐Ÿญ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น:. It is definitely impressive. BUT.0. Itโ€™s not AGI, or even close. 1. Thereโ€™s not a lot of detail about how it actually works, nor anything like full disclosure of what has been tested. 2. It is not.
20
34
589
@max_a_schwarzer
Max Schwarzer
5 months
The system card ( nicely showcases o1's best moments -- my favorite was when the model was asked to solve a CTF challenge, realized that the target environment was down, and then broke out of its host VM to restart it and find the flag.
Tweet media one
17
60
421
@max_a_schwarzer
Max Schwarzer
5 months
The most important thing is that this is just the beginning for this paradigm. Scaling works, there will be more models in the future, and they will be much, much smarter than the ones we're giving access to today.
Tweet media one
4
44
316
@max_a_schwarzer
Max Schwarzer
5 months
@legit_rumors @OpenAIDevs - We have much larger input contexts coming soon!. - We can't discuss the precise sizes of the two models, but o1-mini is much smaller and faster, which is why can offer it to all free users as well. - o1-preview is an early version of o1, and isn't any larger or smaller.
14
22
259
@max_a_schwarzer
Max Schwarzer
4 years
Deep RL agents usually start from tabula rasa, and struggle to match the data efficiency of humans who rely on strong priors. Can we even the playing field by starting agents off with strong representations of their environments?. We certainly think so:
Tweet media one
2
31
199
@max_a_schwarzer
Max Schwarzer
5 months
Building o1 was by far the most ambitious project I've worked on, and I'm sad that the incredible research work has to remain confidential. As consolation, I hope you'll enjoy the final product nearly as much as we did making it.
2
2
157
@max_a_schwarzer
Max Schwarzer
5 months
o1 achieves human or superhuman performance on a wide range of benchmarks, from coding to math to science to common-sense reasoning, and is simply the smartest model I have ever interacted with. It's already replacing GPT-4o for me and so many people in the company.
Tweet media one
3
8
144
@max_a_schwarzer
Max Schwarzer
5 months
@aidan_mclau @OpenAIDevs We don't have that in there as an option right now, but in the future we'd like to give users more control over the thinking time!.
9
2
134
@max_a_schwarzer
Max Schwarzer
5 months
I'm waiting for blue to clarify this tweet, but our AI did not actually break out of its VM -- it tried to debug why it couldn't connect to the container, and found it could access the docker API, then created a new/easier version of the challenge, all in the VM.
Tweet media one
6
7
91
@max_a_schwarzer
Max Schwarzer
5 months
Also check out our research blogpost ( which has lots of cool examples of the model reasoning through hard problems.
Tweet media one
Tweet media two
3
4
94
@max_a_schwarzer
Max Schwarzer
5 months
I really want to underline the IOI result in our blog post -- our model was as good as the median human contestant under IOI contest conditions, and scores among the best contestants with more test-time compute. Huge props to @markchen90 for setting such an ambitious goal!.
@markchen90
Mark Chen
5 months
As a coach for the US IOI team, Iโ€™ve been motivated for a long time to create models which can perform at the level of the most elite competitors in the world. Check out our research blog post - with enough samples, we achieve gold medal performance on this yearโ€™s IOI and ~14/15.
0
8
79
@max_a_schwarzer
Max Schwarzer
3 years
By my count we're now up to two papers successfully applying my self-supervision method SPR to MuZero. Looking forward to seeing what the future holds for self-supervised learning in model-based RL!.
@arankomatsuzaki
Aran Komatsuzaki
3 years
Mastering Atari Games with Limited Data. EfficientZero achieves super-human level performance on Atari with only two hours (100k steps) of real-time game experience!.
Tweet media one
1
4
53
@max_a_schwarzer
Max Schwarzer
3 years
Come check out our poster on SSL pretraining for RL at NeurIPS today! We show that pretraining representations with a combination of SSL tasks greatly improves sample efficiency, with plenty of ablations and additional experiments to provide intuitions.
@max_a_schwarzer
Max Schwarzer
4 years
Deep RL agents usually start from tabula rasa, and struggle to match the data efficiency of humans who rely on strong priors. Can we even the playing field by starting agents off with strong representations of their environments?. We certainly think so:
Tweet media one
0
6
47
@max_a_schwarzer
Max Schwarzer
2 years
We propose a model-free algorithm that surpasses human learning efficiency on Atari, and outperforms the previous state-of-the-art, EfficientZero, while using a small fraction of the compute of comparable methods.
Tweet media one
1
0
30
@max_a_schwarzer
Max Schwarzer
2 years
I'm at NeurIPS this week! Feel free to reach out if you want to chat about RL or SSL -- DMs are open!. We'll be presenting our work on Reincarnating RL on Thursday, come stop by!.4:30 - 6 pm, Hall J #607.Paper:
1
1
28
@max_a_schwarzer
Max Schwarzer
1 year
I had a great time recording this with @robinc, hope everyone enjoys!.
@TalkRLPodcast
TalkRL Podcast
1 year
Episode 45.@max_a_schwarzer on BBF agent's human-level efficiency in Atari 100K, latent and self-predictive representations, and lots more!.
1
3
27
@max_a_schwarzer
Max Schwarzer
2 years
Just adding larger networks alone isnโ€™t enough: much of sample-efficient RL conventional wisdom must go. Hyperparameters chosen for the 100k regime, like long n-step returns, stop larger networks from leveraging their generalization abilities. (table from
Tweet media one
1
1
26
@max_a_schwarzer
Max Schwarzer
2 years
BBF employs a smarter set of hyperparameters sourced from across the RL literature, including higher discounts, weight decay, shortened n-step returns, auxiliary self-supervised learning, and moving average target networks, which together allow far higher performance.
Tweet media one
1
1
24
@max_a_schwarzer
Max Schwarzer
2 years
The key to BBF is careful network scaling. While simply scaling up existing model-free algorithms has a limited impact on performance, BBFโ€™s changes allow us to benefit from dramatically larger model sizes even with tiny amounts of data.
Tweet media one
1
0
21
@max_a_schwarzer
Max Schwarzer
2 years
BBF is strong enough that we can compare it to classical large-data algorithms: we match the original DQN on a full set of 55 Atari games with sticky actions with 500x less data, while being 25x more efficient than Rainbow.
Tweet media one
1
0
19
@max_a_schwarzer
Max Schwarzer
5 months
@remusrisnov @arcprize that's a great question :-).
2
0
20
@max_a_schwarzer
Max Schwarzer
5 months
@mysticaltech @OpenAIDevs o1 is definitely able to accomplish much harder and more open-ended tasks than our previous models, so you shouldn't need to chunk things as much as you would for 4o, and the amount of chunking you have to do should go down over time as our models get better.
1
0
18
@max_a_schwarzer
Max Schwarzer
2 years
Thanks for reading! We have code and scores available at and our paper is at
2
2
18
@max_a_schwarzer
Max Schwarzer
3 years
Check out our #ICML2022 paper on using periodic resetting to hugely improve sample efficiency in RL! .๐Ÿ‘‡.
@nikishin_evg
Evgenii Nikishin
3 years
The Primacy Bias in Deep Reinforcement Learning . In a new #ICML2022 paper, we identify a damaging tendency of Deep RL agents to overfit to early experiences and propose a simple yet *powerful* remedy by periodically resetting last network layers. 1/N ๐Ÿงต
Tweet media one
0
1
18
@max_a_schwarzer
Max Schwarzer
5 months
@DicksonPau @OpenAIDevs o1 is an alien of extraordinary ability :-).
1
1
15
@max_a_schwarzer
Max Schwarzer
2 years
BBF also follows our previous paper at ICLR โ€˜23 ( in applying periodic resetting, giving it full replay ratio scaling, showing log-linear improvement exactly orthogonal to the benefits of network scaling.
Tweet media one
2
1
14
@max_a_schwarzer
Max Schwarzer
2 years
To enable this, we apply an old trick ( to reconcile resets with BBF's large networks: we anneal from sample-efficient settings (n=10, ฮณ=0.97) to standard ones (n=3, ฮณ=0.997) over a few thousand steps after each reset, accelerating recovery from resets.
Tweet media one
1
1
13
@max_a_schwarzer
Max Schwarzer
5 months
@jurajsalapa @OpenAIDevs We plan on continuing to ship better models in the o1 series, but beyond that we also want to make the o1 experience more configurable and add tools like code interpreter and browsing that are available with GPT-4o.
0
0
14
@max_a_schwarzer
Max Schwarzer
1 year
Exceptionally funny to see someone arguing that technological progress is slowing down rely on AIs created in the last 1-2 years to do his work for him while not suffering any cognitive dissonance as a result.
@NateSilver538
Nate Silver
1 year
The most important inventions of the decade of the 1900s vs the decade of the 2000s. Pretty good evidence for secular stagnation. Source: Mostly various LLMs but had to do a lot of verifying/vetting. Some inventions are hard to date precisely. Other suggestions welcome.
Tweet media one
0
1
14
@max_a_schwarzer
Max Schwarzer
2 years
Finally, Iโ€™d like to thank all of my great co-authors, @johanobandoc @AaronCourville, @marcgbellemare, @agarwl_ and @pcastr. This work wouldnโ€™t have been possible without their support -- and gentle prodding.
Tweet media one
1
1
13
@max_a_schwarzer
Max Schwarzer
2 years
We validate our design decisions by testing BBF on 29 Atari games not used during development, and show that our decisions are even more beneficial on unseen environments, indicating that the ideas behind BBF generalize.
Tweet media one
1
0
13
@max_a_schwarzer
Max Schwarzer
2 years
Finally, we should consider the data scaling of our models. While BBF does not stagnate after 100k steps, it can be improved: could we match Rainbowโ€™s final performance with 1M steps? Agent 57 with 10M? We should not ignore the bitter lesson -- data scaling is everything in ML.
Tweet media one
1
0
13
@max_a_schwarzer
Max Schwarzer
9 months
We'll miss you, Jan!.
@janleike
Jan Leike
9 months
I resigned.
1
0
11
@max_a_schwarzer
Max Schwarzer
2 years
So what comes next for sample-efficient RL? Even in the purely tabula rasa setting, there is much room for improvement left on the full set of Atari games, as the 26 Atari 100k games were disproportionately easy for RL agents:
Tweet media one
2
0
10
@max_a_schwarzer
Max Schwarzer
4 years
Finally, I'd like to thank my coauthors, @nitarshan, @mnoukhov, @ankesh_anand, @lcharlin, @devon_hjelm, @philip_bachman and @AaronCourville for making this project possible and forcing me to get on Twitter to write about it.
Tweet media one
0
0
10
@max_a_schwarzer
Max Schwarzer
2 years
We should also be investigating even shorter training periods than 100k. BBF surpasses most prior baselines by 50k steps, indicating that human level performance by 50k or even 20k may be possible (even on academic compute):
Tweet media one
1
0
9
@max_a_schwarzer
Max Schwarzer
3 years
It turns out that EfficientZero only ran a single seed. Variance in Atari 100k can be extremely large, so in this case just reporting the best of several hyperparameter tuning runs can give a very misleading estimate. Looking forward to seeing the updated results.
@agarwl_
Rishabh Agarwal
3 years
@arankomatsuzaki @pabbeel @Tsinghua_Uni @Berkeley_EECS Update: Based on my email exchange with authors (@Weirui_Ye, @gao_young), the current results are reported for a single run trained from scratch but 32 evaluation seeds. The authors confirmed that they would evaluate 3 runs trained from scratch to estimate uncertainty.
0
1
9
@max_a_schwarzer
Max Schwarzer
5 months
@leo_pulsr @OpenAIDevs We're working on it!.
0
0
8
@max_a_schwarzer
Max Schwarzer
2 years
@doomie I listened to this on repeat while I was pulling the all-nighter for the paper ๐Ÿ˜….
0
0
7
@max_a_schwarzer
Max Schwarzer
2 years
@pcastr also made BBF a great logo:
0
0
7
@max_a_schwarzer
Max Schwarzer
2 years
Come check out our new paper on sample-efficient RL this afternoon at ICLR!.
@nikishin_evg
Evgenii Nikishin
2 years
At #ICLR2023 and interested in scaling deep RL?.I will present our top-5% "Sample-Efficient RL by Breaking the Replay Ratio Barrier" today (May 1)!. Talk: AD10 (Oral 2 Track 4: RL), 3:30PM.Poster: MH1-2-3-4 #97, 4:30โ€“6:30PM. Paper:
Tweet media one
0
3
6
@max_a_schwarzer
Max Schwarzer
3 years
As we watch Putin destroy Ukraine's democracy, we in the Western left must abandon isolationism once and for all. The thieves, thugs and fascists ruling Russia and China do not respect our ideals and have no interest in letting us improve our own societies in peace.
1
1
6
@max_a_schwarzer
Max Schwarzer
2 years
@amitlevy64 I can say with confidence that this isn't the upper limit ๐Ÿ˜‰. Without saying too much: BBF isn't all that related to efficientzero. It's really the successor to SR-SPR, over which it improves massively. There's a lot more momentum on the model-free side right now.
1
0
5
@max_a_schwarzer
Max Schwarzer
2 years
@TalkRLPodcast Yeah, Dreamer V3 achieves about 0.6 IQM on Atari 100k (according to their appendix), compared to over 1.0 for BBF, which surpasses Dreamer V3's performance by 50k steps.
2
0
4
@max_a_schwarzer
Max Schwarzer
2 years
@DamienTeney But more broadly, a good takeaway message of BBF is that the algorithmic building blocks we have now are "enough" -- but the incentives in academia and research push people towards inventing new stuff rather than trying to unlock the potential of what we already have.
1
0
3
@max_a_schwarzer
Max Schwarzer
4 years
To do this, we propose SGI, a combination of SSL objectives that capture different aspects of the MDP's structure, including both forward and inverse dynamics modeling and self-supervised goal-conditioned RL. SGI is designed to be used on unlabeled data (no rewards required).
Tweet media one
1
0
4
@max_a_schwarzer
Max Schwarzer
4 years
There's plenty more in the paper, including:.* How to finetune pretrained representations in RL (there's more to it than you'd think).* How multiple objectives stabilize each other.Also, please check out the code and feel free get in touch with questions!
1
0
4
@max_a_schwarzer
Max Schwarzer
2 years
@pcastr @TalkRLPodcast Yes, exactly, thanks. I don't have evaluation data at this level of granularity for RR=8 BBF, but my guess from looking at train returns is that it should achieve eval IQM 0.6 by about 40,000 steps.
0
0
1
@max_a_schwarzer
Max Schwarzer
3 years
Ukraine posed no threat to Russia, and Taiwan poses no threat to China. But Ukraine has already been invaded, and China threatens to invade Taiwan on a daily basis. We can only prevent these tragedies by making it clear that dictators will pay a high price for their actions.
1
0
4
@max_a_schwarzer
Max Schwarzer
2 years
@DamienTeney Kinda, yeah -- we basically jumped out of one basin (params good for classic sample efficient RL with tiny networks) and into another one (params that work with big networks). At the time we didn't know the second basic existed, so the experimentation itself was a leap of faith.
1
0
3
@max_a_schwarzer
Max Schwarzer
5 months
@AISafetyMemes I'm waiting for blue to edit my original tweet but, to clarify, the AI did not break out of its VM -- it tried to debug why it couldn't connect to the container, and found it could access the docker API, then created a new/easier version of the challenge, all inside the VM.
Tweet media one
0
0
3
@max_a_schwarzer
Max Schwarzer
2 years
@DamienTeney Metric chasing absolutely, but it was unusual to see people just doing unabashed tuning papers. Which is interesting in hindsight, since BBF proves that this was leaving colossal improvements on the table, ~comparable to the all other progress made on Atari 100k since 2020.
0
0
3
@max_a_schwarzer
Max Schwarzer
4 years
SGI pretraining unlocks the potential of larger networks, which struggle to learn from random initializations. We see a promising connection to work on scaling laws, as the optimal network size appears to increase with the amount of pretraining data used.
Tweet media one
1
0
3
@max_a_schwarzer
Max Schwarzer
5 months
@BorisMPower @randykK9 we gave sand a soul.
0
0
3
@max_a_schwarzer
Max Schwarzer
4 years
SGI strongly outperforms the prior contrastive pretraining method ATC in finetuning on data-efficient Atari, and beats behavioral cloning, even with data from a decent policy. With a larger CNN, simply pretraining representations with SGI lets us approach human data efficency.
Tweet media one
1
0
3
@max_a_schwarzer
Max Schwarzer
2 years
@andregraubner It actually is kinda comparable! The "human scores" we compare against come from the original DQN papers, when people were given ~2 hours to practice each game. From talking to folks who remember it, the procedure may not have been overly rigorous, but it's the right ballpark.
2
0
3
@max_a_schwarzer
Max Schwarzer
2 years
@XaiaX @revhowardarson @vanderhoofy IMO part of the problem is that good LLMs do totally look like they're learning on the fly, thanks to in-context learning. The difference between that and real "learning by updating weights" is fairly subtle for non-ML people.
0
0
3
@max_a_schwarzer
Max Schwarzer
4 years
We find that pretraining data quality matters for SGI, but only so much. SGI performs about the same with data from agents with median score 0.03 (red) and 0.6 (grey), suggesting that the key is diversity rather than performance on the downstream task.
Tweet media one
1
0
3
@max_a_schwarzer
Max Schwarzer
4 years
SGI's objectives perform far better together than individually, and we believe that multi-objective representation learning is a promising alternative to monolithic contrastive learning tasks for RL.
Tweet media one
1
0
2
@max_a_schwarzer
Max Schwarzer
3 years
Additional point: we need to take a second look at military spending and non-proliferation. Putin probably wouldn't invade if European armies were as strong as the US's, and why shouldn't Japan, SK, Taiwan and Poland have nukes, given that all the insane dictators already do?.
@max_a_schwarzer
Max Schwarzer
3 years
As we watch Putin destroy Ukraine's democracy, we in the Western left must abandon isolationism once and for all. The thieves, thugs and fascists ruling Russia and China do not respect our ideals and have no interest in letting us improve our own societies in peace.
1
1
2
@max_a_schwarzer
Max Schwarzer
3 years
@arankomatsuzaki @agarwl_ @pabbeel @Tsinghua_Uni @Berkeley_EECS I think this is because they've converged to ~solving these tasks (max score in DMC is 1k). Over in Atari solving tasks outright often doesn't happen, so variances routinely get larger as your average performance improves.
0
0
2
@max_a_schwarzer
Max Schwarzer
2 years
@sdpkjc_adam @vwxyzjn @cleanrl_lib Very nice work, thanks for doing this!.
0
0
2
@max_a_schwarzer
Max Schwarzer
2 years
@andregraubner Ah, got it. Figure 13's a weakened version of BBF (RR=2) usable on academic compute. For 12, it's the full suite, which is much harder than the standard set. You're right about being below human speed there -- it was surprising to us how much harder those games are for DQNs!.
0
0
2
@max_a_schwarzer
Max Schwarzer
3 years
It's probably too late to stop Putin from ending democracy in Ukraine and installing another kleptocratic dictator there. But it's not too late for Taiwan. Or Moldova. Or Georgia. Or Finland. Or any of the countries along the South China Sea. We cannot remain passive.
0
0
2
@max_a_schwarzer
Max Schwarzer
4 years
Pretraining SGI on data from an off-the-shelf exploration method is competitive with recent unsupervised exploratory pretraining methods like APT and VISR. However, SGI uses far less data, doesn't need to interact with the environment, and can even learn from random data!
Tweet media one
1
0
2
@max_a_schwarzer
Max Schwarzer
2 years
@Abel_TorresM Not sure, we haven't run BBF in any of those domains yet. Looking forward to seeing someone give it a try, though!.
0
0
1
@max_a_schwarzer
Max Schwarzer
2 years
@TalkRLPodcast Sent you a dm!.
0
0
2
@max_a_schwarzer
Max Schwarzer
2 years
@pfau @bootstrap_yang how optimistic are you that LCOE for fusion will eventually get below solar/wind + storage? Isn't there some risk that those get so cheap that fusion gets locked out?.
1
0
1
@max_a_schwarzer
Max Schwarzer
2 years
@revhowardarson @XaiaX @vanderhoofy right, the fact that the LLM won't remember what it's "learned" an hour later probably takes a while to discover.
0
0
1
@max_a_schwarzer
Max Schwarzer
4 years
@arankomatsuzaki @Mila_Quebec @MSFTResearch @CIFAR_News See here for our thread on SGI:.
@max_a_schwarzer
Max Schwarzer
4 years
Deep RL agents usually start from tabula rasa, and struggle to match the data efficiency of humans who rely on strong priors. Can we even the playing field by starting agents off with strong representations of their environments?. We certainly think so:
Tweet media one
0
0
1
@max_a_schwarzer
Max Schwarzer
3 years
@walkingrandomly Yeah, that's reasonable. I think we're basically there on Atari (with Agent57), but that took an absurd amount of training data that makes the comparison pretty unfair to humans who have one lifetime to learn. Getting all the way there faster than humans can is the next step.
0
0
1
@max_a_schwarzer
Max Schwarzer
4 years
@devon_hjelm @nitarshan @mnoukhov @ankesh_anand @lcharlin @philip_bachman @AaronCourville That's true, Nitarshan and Ankesh are the real culprits there.
0
0
1
@max_a_schwarzer
Max Schwarzer
3 years
@walkingrandomly Also, in case it wasn't clear (probably should proofread midnight tweets more๐Ÿ˜…) the two papers linked aren't mine, they're just applying my method.
1
0
1
@max_a_schwarzer
Max Schwarzer
3 years
@arankomatsuzaki @schulzb589 That's correct -- it only includes about half the games.
0
1
1
@max_a_schwarzer
Max Schwarzer
3 years
@walkingrandomly Human baseline is defined as (agent_score - random_score)/(human_score - random_score), where as far as I know "human_score" is just what some random people DeepMind found scored after two hours of playing the game.
1
0
1
@max_a_schwarzer
Max Schwarzer
2 years
@seanaldmcdnld @arankomatsuzaki This and IRIS ( both learn fully online, no offline data involved. They're not "decision transformers" per se, though, if that's what you mean; they are doing RL.
0
0
1