Barret Zoph Profile
Barret Zoph

@barret_zoph

Followers
17,645
Following
924
Media
43
Statuses
254

VP Research (Post-Training) @openai Past: Research Scientist at Google Brain.

San Francisco, CA
Joined November 2016
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@barret_zoph
Barret Zoph
7 days
I posted this note to OpenAI. Hey everybody, I have decided to leave OpenAI. This was a very difficult decision as I have has such an incredible time at OpenAI. I got to join right before ChatGPT and helped build the post-training team from scratch with John Schulman and
170
188
4K
@barret_zoph
Barret Zoph
2 years
After 6 years at Google Brain I am excited to announce that I joined OpenAI! Very grateful for all the amazing collaborators and friends I have made at Google over the years Could not be more excited to continue to help push AI progress and for the new adventures ahead
59
43
2K
@barret_zoph
Barret Zoph
1 year
Our team at OpenAI is hiring! We're looking for engineers/researchers who do rigorous and thoughtful work understanding and evaluating LLMs like ChatGPT. If you're interested, please apply online and DM me with work that you've done!
44
106
749
@barret_zoph
Barret Zoph
4 years
Introducing Switch Transformer, a simplified sparse architecture for scaling to trillion parameter language models Switch Transformers yield 4-7x speedups over strong Transformer T5 models w/ the same computational resources Paper:
Tweet media one
3
141
672
@barret_zoph
Barret Zoph
5 years
*New paper* RandAugment: a new data augmentation. Better & simpler than AutoAugment. Main idea is to select transformations at random, and tune their magnitude. It achieves 85.0% top-1 on ImageNet. Paper: Code:
Tweet media one
3
148
584
@barret_zoph
Barret Zoph
4 years
Can simply copying and pasting objects from one image to another be used to create more data to improve state-of-the-art instance segmentation? Yes! With Copy&Paste, we achieve 57.3 box AP and 49.1 mask AP on COCO. This is SoTA wrt @paperswithcode
Tweet media one
9
93
495
@barret_zoph
Barret Zoph
4 years
Revisiting ResNets: Improved Training and Scaling Strategies Our recent work that applies modern training and scaling techniques to the 2015 ResNet We find ResNets outperform some recent state-of-the-art architectures ResNets are remarkably durable!
Tweet media one
5
66
358
@barret_zoph
Barret Zoph
3 years
How do we combine knowledge from multiple labeled and unlabeled datasets to train a great general model? Multi-Task Self-Training (MuST) trains specialized teachers on labeled data, which then label unlabeled data to train a single general model.
Tweet media one
5
88
352
@barret_zoph
Barret Zoph
11 months
What an incredible company OpenAI is to work at. I have never seen so many people so committed to the mission of the company and band together when things go wrong. Huge props the the leadership team for navigating these incredibly difficult times.
14
7
317
@barret_zoph
Barret Zoph
2 years
What a fun first few months at OpenAI its been :)
@sama
Sam Altman
2 years
ChatGPT launched on wednesday. today it crossed 1 million users!
1K
3K
49K
4
4
283
@barret_zoph
Barret Zoph
2 years
Want to learn more about how sparse expert models (e.g. MoEs, Switch Transformers, Hash Layers) work and their recent research advancements? Check out our recent review paper
Tweet media one
3
60
262
@barret_zoph
Barret Zoph
3 years
Really enjoyed the Instruct-GPT paper Impressed by the results: 100x smaller models w/ same quality by updating models on the data distribution you care about Data is often overlooked & such a powerful tool -- smaller models for the same quality, which saves a lot at inference
5
20
188
@barret_zoph
Barret Zoph
2 years
Lots of great work coming out on LLMs generating + understanding code (Codex, Scratch Pad, MBPP/MathQA, etc...) The Alpha code paper by DeepMind is quite impressive --- ranking ~50% percentile in competitive programming competitions w/ 5000+ participants A 🧵below:
Tweet media one
2
32
179
@barret_zoph
Barret Zoph
3 years
Interested in using sparse expert models, but find they are unstable, hard to design or don’t fine-tune well? We address these key issues and train 269B param MoE model (w/ FLOPs of 32B dense model) that improves SOTA on NLP benchmarks liked SuperGLUE.
Tweet media one
6
33
162
@barret_zoph
Barret Zoph
8 days
Super excited this is rolling out! Real time speech to speech will be a powerful feature -- I am very bullish on multi-modal being a core component of AI products. This was a great collaboration with post-training (h/t to @kirillov_a_n & @shuchaobi + team on post-training) and
@OpenAI
OpenAI
8 days
Advanced Voice is rolling out to all Plus and Team users in the ChatGPT app over the course of the week. While you’ve been patiently waiting, we’ve added Custom Instructions, Memory, five new voices, and improved accents. It can also say “Sorry I’m late” in over 50 languages.
988
2K
12K
7
5
155
@barret_zoph
Barret Zoph
3 years
Our new sparse model (SS-MoE) achieved SOTA on SuperGLUE ()! Excited to see sparsity pushing state-of-the-art! This new work builds heavily on our prior work on Switch Transformer: Paper and more details to come soon!
3
17
116
@barret_zoph
Barret Zoph
11 months
❤️
@ilyasut
Ilya Sutskever
11 months
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
7K
4K
33K
1
3
109
@barret_zoph
Barret Zoph
4 years
Models and checkpoints are now open sourced for my recent work: "Rethinking Pre-training and Self-training". Paper link: Code Link: . On COCO we achieve 54.3 AP and on Pascal Segmentation 90.5 mIOU!
1
23
113
@barret_zoph
Barret Zoph
2 years
Intersecting cutting edge AI research w/ products is an incredibly exciting area to work on. Products are the ultimate test set :)
3
6
83
@barret_zoph
Barret Zoph
4 years
Great video summary of some of my recent work! Thanks @ykilcher !
@ykilcher
Yannic Kilcher 🇸🇨
4 years
A bit late to the party, but 💃NEW VIDEO🕺 on Switch Transformers by @GoogleAI . Hard Routing, selective dropout, mixed precision & more to achieve a 🔥ONE TRILLION parameters🔥 language model. Watch to learn how it's done🧙💪 @LiamFedus @barret_zoph
Tweet media one
6
35
204
0
6
83
@barret_zoph
Barret Zoph
4 years
Super interesting work! Excited to see the future of attention models in computer vision.
@giffmana
Lucas Beyer (bl16)
4 years
If you haven't read our latest ImageNet SOTA work "Vision Transformers (ViT)" yet, shame on you. But! There's hope! Here's the corresponding blogpost which is a nice tl;dr:
Tweet media one
7
69
327
1
7
67
@barret_zoph
Barret Zoph
1 year
We are looking for people to understand, improve and combine a variety of evaluation signals (e.g. automated and human), build eval infra (e.g. visualizations, testing) and do ML research on better eval methods.
2
1
64
@barret_zoph
Barret Zoph
8 months
Pleasure working with you -- learned quite a lot! Excited for what you do next.
@karpathy
Andrej Karpathy
8 months
Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been
2K
1K
23K
1
0
59
@barret_zoph
Barret Zoph
2 years
@jacobandreas @jacobaustin132 @_jasonwei Yes I have also found this for math. If you append "I am a math tutor" it starts to answer with higher accuracy.
1
1
59
@barret_zoph
Barret Zoph
2 years
Yes --- I think spending more time thinking about what to work on vs actually working on the thing is hugely important
@_jasonwei
Jason Wei
2 years
The best meta- advice I've gotten is from @barret_zoph . It took me a year to begin to understand it. It went something like: Notice that many researchers work hard. Yet some are far more successful. This means the project you choose defines the upper-bound for your success.
6
22
317
2
2
55
@barret_zoph
Barret Zoph
5 years
Slides and video of my talk at the Neural Architects workshop at ICCV this year!
0
17
49
@barret_zoph
Barret Zoph
1 year
Research engineer role: Research scientist role:
1
6
46
@barret_zoph
Barret Zoph
2 years
Exciting see sparse MoE models being 10x more calibrated than their dense LM counterparts. Better model calibration is a key research direction into better understand what models do vs don't know
@jaschasd
Jascha Sohl-Dickstein
2 years
Overall, sparse models perform as well as dense models which use ~2x more inference cost, but they are as well calibrated as dense models using ~10x more inference compute.
Tweet media one
1
2
74
2
3
41
@barret_zoph
Barret Zoph
5 years
My talk at the 2019 ICCV Neural Architects workshop is available online!
1
11
40
@barret_zoph
Barret Zoph
4 years
Nice work from @IrwanBello on his paper “LambdaNetworks: Modeling Long-Range Interactions without Attention” An interesting scalable alternative to self-attention with strong empirical results in computer vision! Link:
Tweet media one
1
4
36
@barret_zoph
Barret Zoph
4 years
Code + checkpoints for the ResNet-RS paper are available!
@IrwanBello
Irwan Bello
4 years
Training code and checkpoints here!
1
27
135
0
3
36
@barret_zoph
Barret Zoph
2 years
Yes +1. I remember studying parts of the Feynman lectures which showed me how much more clear my thought process could be. When reading his description of simple algebra and complex numbers I thought "wow I really am not thinking clearly enough":
@karpathy
Andrej Karpathy
2 years
Looking back, my most valuable college classes were physics, but for general problem solving intuitions alone: - modeling systems with increasingly more complex terms - extrapolating variables to check behaviors at limits - pursuit of the simplest most powerful solutions ...
51
143
2K
3
2
32
@barret_zoph
Barret Zoph
3 years
Great blogpost on our recent ResNet-RS work!
@amaarora
Aman Arora
3 years
Super excited to present my latest blog post on ResNet-RS - "Revisiting ResNets: Improved Training and Scaling Strategies". I also share code implementation in PyTorch using TIMM & more! 1/3
5
57
246
0
5
32
@barret_zoph
Barret Zoph
1 year
Come work w/ @hwchung27 and @_jasonwei on this!
3
0
29
@barret_zoph
Barret Zoph
3 years
I really like the "tcolorbox" package in LaTeX for research papers. It is a great feature for having nice looking summaries for sections or putting theorems. I enjoyed using it throughout my most recent work!
Tweet media one
0
1
29
@barret_zoph
Barret Zoph
2 years
AI progress has continually exceeded my expectations since I first started working in the space in 2015 The saying that people overestimate what they can do in a short amount of time and underestimate what can be achieved in longer periods of time definitely resonates w/ me
@Inoryy
Roman Ring
2 years
10 yrs ago @karpathy wrote a blog post on the outlook of AI: in which he describes how difficult it would be for an AI to understand a given photo, concluding "we are very, very far and this depresses me." Today, our Flamingo steps up to the challenge.
Tweet media one
90
527
4K
1
2
27
@barret_zoph
Barret Zoph
2 years
Very excited to be able to release these sparse checkpoints to the research community!
@LiamFedus
William Fedus
2 years
Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced! All thanks to the efforts of James Lee-Thorp, @ada_rob , and @hwchung27
19
207
1K
2
1
26
@barret_zoph
Barret Zoph
2 years
It was a pleasure to be part of this effort! Very bullish on the impact this will have for the future of LLMs. Also very impressed with the leadership for this project --- coordinating all of this to happen is nothing short of incredible!
@jaschasd
Jascha Sohl-Dickstein
2 years
After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.
Tweet media one
35
571
3K
1
3
25
@barret_zoph
Barret Zoph
5 years
This is a great description of RandAugment! Thanks so much.
@CShorten30
Connor Shorten
5 years
This video explains the new RandAugment AutoML Data Augmentation algorithm from @GoogleAI , improving on previous techniques (AutoAugment/PBA) on ImageNet and dramatically reducing the search space, making AutoML for Data Aug much easier! #100DaysOfMLCode
0
27
113
1
5
23
@barret_zoph
Barret Zoph
2 years
Enjoyed The Pile dataset paper -- very thorough! Data is often overlooked and given the amount of money/time that goes into training these language models, this aspect should be taken seriously.
Tweet media one
2
1
21
@barret_zoph
Barret Zoph
4 years
Switch Transformers introduce sparsity by sending different tokens to different weights We simplify MoE models by routing to the top expert only, which saves computation + communication costs We also introduce training techniques for training huge models in lower precision!
Tweet media one
1
2
20
@barret_zoph
Barret Zoph
2 years
Really fun chatting! Thanks for having us on.
@ykilcher
Yannic Kilcher 🇸🇨
2 years
New interview with Barret Zoph ( @barret_zoph ) and William Fedus ( @LiamFedus ) of Google Brain on Sparse Expert Models. We talk about Switch Transformers, GLAM, information routing, distributed systems, and how to scale to TRILLIONS of parameters. Watch now:
Tweet media one
1
15
95
1
1
17
@barret_zoph
Barret Zoph
3 years
In prior work, we showed generating labels from a teacher model can be more flexible than pre-training. MuST is a natural extension where now we generate labels from multiple different teachers on various tasks to learn a general pre-trained model.
1
0
17
@barret_zoph
Barret Zoph
3 years
Nice paper showing the power of simple scaling and training methods for video recognition! Follows the line of "RS" research I have done with some of these collaborators for Image Classification () and Object Detection ().
@IrwanBello
Irwan Bello
3 years
Wondering how simple 3D-ResNets perform on video recognition given all the recent architecture craze? In Revisiting 3D ResNets for Video Recognition, we study the impact of improved training and scaling methods on 3D ResNets.
Tweet media one
2
31
137
1
3
17
@barret_zoph
Barret Zoph
4 years
Switch Transformers are also found to be strong multi-task learners On multilingual language modeling (mT5) we outperform T5 models across 101 languages w/ a 5x speedup
Tweet media one
0
0
15
@barret_zoph
Barret Zoph
4 years
Wow that is a very strong imagenet result! Cool to see further progress being made in semi-supervised methods for computer vision!
@quocleix
Quoc Le
4 years
Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-) This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2. More details here:
Tweet media one
16
303
1K
1
0
15
@barret_zoph
Barret Zoph
2 years
To find these interest prompts, should we be looking at the pre-training data? Is "step by step" mentioned the most frequently in documents when an explanation comes next? Automatic prompt discovery from inspecting the pre-training data feels promising
@_jasonwei
Jason Wei
2 years
Big language models can generate their own chain of thought, even without few-shot exemplars. Just add "Let's think step by step". Look me in the eye and tell me you don't like big language models.
Tweet media one
15
61
386
2
1
15
@barret_zoph
Barret Zoph
4 years
Thanks for the nice article on our recent work!
@A_K_Nain
Aakash Kumar Nain
4 years
As promised, here is my new blogpost explaining the latest research from Google Research and Brain team. I liked this paper a lot because instead of building models with billions of params, it focuses on fundamental aspects.
4
67
322
1
0
14
@barret_zoph
Barret Zoph
4 years
We find we can distill some of the performance improvements from our sparse Switch Transformers into dense variants (w/ the same FLOPs per token)
Tweet media one
2
0
14
@barret_zoph
Barret Zoph
3 years
I would be surprised if a modeling improvement could yield a 10x smaller model for a fixed quality For data this is not the case and often the opposite feeling --- surprising if you couldn't reduce model size by 10x
0
1
14
@barret_zoph
Barret Zoph
3 years
Excited to be giving it! Thanks for the invite.
@KuisAICenter
KUIS AI
3 years
📢 Next Wed at 5 pm, we’ll have ( @barret_zoph ) from Gooogle Brain who will talk about the use of sparsity for large Transformer models: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" zoom info: ai-info @ku .edu.tr or just DM!
Tweet media one
0
5
18
0
2
13
@barret_zoph
Barret Zoph
2 years
Sparse expert models are becoming increasingly relevant as they are now being used across many domains (NLP, speech, vision, multi-modality) w/ very strong results Right now sparse expert models hold SOTA on various benchmarks (e.g. ST-MoE on SuperGlue, ANLI, ARC, etc…)
1
0
13
@barret_zoph
Barret Zoph
2 years
Thanks @jeremiecharris for having me on your podcast! Super fun chatting about mixture-of-expert models and how they fit into the current large language model landscape. Podcast:
1
3
12
@barret_zoph
Barret Zoph
3 years
Very useful LaTeX trick!
@omarsar0
elvis
3 years
Nice and beautiful examples of how to produce annotated equations using LaTeX. 🤯
Tweet media one
14
258
1K
0
3
12
@barret_zoph
Barret Zoph
4 years
We highlight the importance of disentangling the training methods and architectural components when making comparisons across architectures
Tweet media one
2
0
12
@barret_zoph
Barret Zoph
4 years
Fantastic video on some our recent work! Really great job @CShorten30 .
@CShorten30
Connor Shorten
4 years
"Rethinking Pre-training and Self-Training" from researchers @GoogleAI shows we get better results from self-training than either supervised or self-supervised pre-training. Demonstrated on Object Detection and Semantic Segmentation! #100DaysOfMLCode
Tweet media one
1
57
264
0
0
12
@barret_zoph
Barret Zoph
4 years
The modern training techniques (data augmentation, label smoothing, etc…) lead to strong representations that rival sota self-supervised learning methods (e.g. SimCLR) on a bunch of vision tasks
Tweet media one
2
1
12
@barret_zoph
Barret Zoph
3 years
Yes this is a very important principle to keep in mind --- even when doing a single research project. It's often hard to find the right experimentation scale such that the "smaller" scale ideas have a higher probability of working at a "larger scale".
@karpathy
Andrej Karpathy
3 years
Just making sure everyone read “The Bitter Lesson”, as it is one of the best compact pieces of insight into nature of progress in AI. Good habit to keep checking ideas on whether they pass the bitter lesson gut check
49
259
2K
0
0
12
@barret_zoph
Barret Zoph
4 years
How do Switch Transformers scale? Keeping the floating point operations per token fixed, increasing the number of sparse parameters by adding more experts significantly improves performance
Tweet media one
1
1
12
@barret_zoph
Barret Zoph
4 years
We study scaling strategies for vision models and observe the best scaling strategies heavily depends on the training setup When overfitting can occur (e.g. 350 epochs on ImageNet) scaling depth is best. In settings with larger datasets/fewer epochs width scaling is preferred.
Tweet media one
1
0
11
@barret_zoph
Barret Zoph
4 years
Copy-Paste greatly improves data efficiency (even on top of a strong augmentation baseline of aggressive scale jittering!) Data efficiency is critical for instance segmentation as its much more expensive compared to object detection and image classification
Tweet media one
1
0
11
@barret_zoph
Barret Zoph
3 years
Happy to see our work on ResNet-RS made it to NeurIPS!
@IrwanBello
Irwan Bello
3 years
To appear #NeurIPS2021 as a spotlight - congrats team
0
20
122
0
0
10
@barret_zoph
Barret Zoph
3 years
Example of MuST: Step 1: Train three models: NYU Depth, COCO Detection, Pascal Segmentation Step 2: Generate pseudo labels for depth estimation, detection and segmentation on all labeled / unlabeled images Step 3: Train new model on the combined human + pseudo labeled images
Tweet media one
1
0
9
@barret_zoph
Barret Zoph
4 years
LVIS dataset was created to make progress on long-tail visual recognition. We outperform the ECCV 2020 challenge winner on LVIS by +3.6 mask AP on rare objects (and our baseline by +6.1 AP)
1
0
10
@barret_zoph
Barret Zoph
1 year
@giffmana The T5 paper did something very similar right? Do the normal warmup, decay by 1/sqrt(step), then linearly decay by last 10% of training.
1
0
10
@barret_zoph
Barret Zoph
3 years
When using only ImageNet images, MuST significantly outperforms both supervised and self-supervised representations across many tasks.
Tweet media one
1
0
9
@barret_zoph
Barret Zoph
2 years
Impressive results w/ the continued scale of large LMs On certain tasks there were large discontinuous performance improvements not predicted by scaling curves Great leadership / coordination on this project to make it happen --- nice work team!
@GoogleAI
Google AI
2 years
Introducing the 540 billion parameter Pathways Language Model. Trained on two Cloud #TPU v4 pods, it achieves state-of-the-art performance on benchmarks and shows exciting capabilities like mathematical reasoning, code writing, and even explaining jokes.
76
1K
4K
0
0
9
@barret_zoph
Barret Zoph
2 years
Surprised the 41B model only was better than the 9B model once it could generate 1k+ samples Wonder how results for different model sizes change as a function of the pre-training and fine-tuning dataset size
Tweet media one
1
1
9
@barret_zoph
Barret Zoph
2 years
Exciting to see more encoder-decoder models (e.g. T5, T0, Switch Transformer, ST-MoE) Liked the dual loss pre-training strategy: use MLM on encoder and simple autoregressive LM on decoder
Tweet media one
1
0
9
@barret_zoph
Barret Zoph
2 years
Super excited to see the co-evolution of game design with these types of models. Open world games that could automatically generate new environments based on what the player has enjoyed so far would be so cool --- I often felt games got stale due to a lack of new environments.
@gdb
Greg Brockman
2 years
DALL-E 2 applied to generating assets for game development:
11
39
215
0
0
9
@barret_zoph
Barret Zoph
2 years
Awesome startup w/ awesome founders! Excited to see future space of AI x Legal. (Disclosure: I invested)
1
0
9
@barret_zoph
Barret Zoph
3 years
Looking forward to giving this talk!
@ftm_guney
F. Güney
3 years
great talks lining up in September @KuisAICenter including @DeqingSun @jponttuset @barret_zoph , looking forward to all of them!
1
4
19
0
0
9
@barret_zoph
Barret Zoph
2 years
Surprising to see how performance scales smoothly when the model goes from generating 1 solution all the way up to 1M solutions
Tweet media one
1
0
9
@barret_zoph
Barret Zoph
4 years
Hope these revamped ResNets can serve as baselines for future architectural and training method comparisons!
0
0
9
@barret_zoph
Barret Zoph
2 years
Exciting research ahead to not require generating huge amounts of samples -- seems this should be possible Many applications of LLMs require generating lots of samples and even using discriminator models to further filter generated outputs (e.g. Lamda, OpenAI Verifiers)
1
1
8
@barret_zoph
Barret Zoph
3 years
Nice summary of a lot of the great work done by Google Research in the past year.
@JeffDean
Jeff Dean (@🏡)
3 years
As in past years, I've spent part of the holiday break summarizing much of the work we've done in @GoogleResearch over the last year. On behalf of @Google 's research community, I'm delighted to share this writeup (this year grouped into five themes).
47
284
1K
0
0
8
@barret_zoph
Barret Zoph
3 years
We observed adding more pseudo labels to each image to lead to better representations! So don’t just use classification and depth estimation labels, include segmentation and others too.
Tweet media one
1
0
8
@barret_zoph
Barret Zoph
3 years
What if I already trained my checkpoint? No problem! You can simply continue training your checkpoint with MuST for a few iterations and observe improvements! Results combining MuST with an ALIGN checkpoint.
Tweet media one
2
0
8
@barret_zoph
Barret Zoph
2 years
Interesting how the validation loss isn't correlated with the solve rate Other tasks like dialogue (e.g. Lamda) seem to correlate much better to human evals Probably due to the one-to-many nature of coding tasks relative to dialogue as the authors point out
Tweet media one
0
2
7
@barret_zoph
Barret Zoph
4 years
Nice summary of our recent work!
@AndLukyane
Andrey Lukyanenko
4 years
My review of the paper "Revisiting ResNets: Improved Training and Scaling Strategies". It seems that we have a new SOTA for CV tasks. Looking forwards for PyTorch version!
1
35
206
0
0
7
@barret_zoph
Barret Zoph
4 years
In a large scale semi-supervised learning setup we obtain 5.5x speedups over Noisy Student EfficientNets.
Tweet media one
1
0
7
@barret_zoph
Barret Zoph
2 years
This really hit homes --- the amount of hand holding for experiments and models can be quite frustrating. You would think that this area would have more progress given these are the issues people training the models are having :)
@_arohan_
rohan anil
2 years
The AGI I want is one that realizes I made a dumb mistake with batch size which makes it OOM on a supercomputer and tries a smaller one for me - while I am sleeping so I don’t have to babysit the models and increases the throughput in experimentation!
7
4
97
0
1
7
@barret_zoph
Barret Zoph
2 years
Also seems the 41B models wasn't the "compute Pareto optimal" --- for a given TPU budget its almost always better to use the 9B model
Tweet media one
2
1
7
@barret_zoph
Barret Zoph
2 years
Wouldn't be surprised if some of the most impactful papers in the language modeling space in the next few years come from pure dataset research
0
4
7
@barret_zoph
Barret Zoph
3 years
We studied MuST on a suite of different tasks and datasets. Training Datasets: Specialized teacher models trained on these datasets, which are used to produce pseudo labels. Evaluation Datasets: Datasets models are fine-tuned on.
Tweet media one
1
0
6
@barret_zoph
Barret Zoph
2 years
We dive into the tradeoffs of using sparse expert models versus standard dense models We hope this review can help to increase adoption for them as they are working quite well and lots of excellent research has been done for them!
0
0
6
@barret_zoph
Barret Zoph
4 years
We design a Pareto curve of 11 different ResNet models named ResNet-RS by scaling the image size along with different network depths. We obtain 1.7-2.7x speedups over EfficientNets on ImageNet.
Tweet media one
1
0
6
@barret_zoph
Barret Zoph
3 years
How do MuST representations compare to those trained with standard multi-task learning across datasets and tasks? MuST improves over multi-task training across all tasks!
Tweet media one
2
0
6
@barret_zoph
Barret Zoph
3 years
We study how the experts specialize on different tokens and find that they end up semantically specializing to different categories such as punctuation, verbs and proper names.
Tweet media one
1
0
5
@barret_zoph
Barret Zoph
3 years
For two models with the same FLOPs per token, we find sparse models to outperform their dense counterpart on a large suite of fine-tuning tasks.
Tweet media one
1
0
5
@barret_zoph
Barret Zoph
3 years
We study the fine-tuning of sparse vs dense models The optimal batch sizes and learning rates for sparse vs dense models are very different In certain scenarios wrong values masked any of the pre-training performance improvements of sparse models over the dense models
Tweet media one
1
0
5
@barret_zoph
Barret Zoph
3 years
We finally combine our improvements and train a sparse model with 269B parameters (FLOP matched to a 32B dense model). This model achieve SOTA on a wide range of NLP tasks: SuperGLUE, XSum, CNN-DM, ANLI R3, ARC-Easy/Challenge, CB WebQA, CB NatQA.
Tweet media one
0
0
5
@barret_zoph
Barret Zoph
4 years
Nice architectural improvements from my collaborators at Google!
@GoogleAI
Google AI
4 years
Today we present SpineNet, a novel alternative to standard scale-decreased backbone models for visual recognition tasks, which uses reordered network blocks with cross-scale connections to better preserve spatial information. Learn more below:
3
195
529
0
0
5
@barret_zoph
Barret Zoph
2 years
Great thread describing some of the approaches for getting models to perform well on tasks we care about!
@ShayneRedford
Shayne Longpre
2 years
📢 A 🧵on the future of NLP model inputs. What are the options and where are we going? 🔭 1. Task-specific finetuning (FT) 2. Zero-shot prompting 3. Few-shot prompting 4. Chain of thought (CoT) 5. Parameter-efficient finetuning (PEFT) 6. Dialog [1/]
Tweet media one
Tweet media two
10
83
375
0
1
5
@barret_zoph
Barret Zoph
4 years
Such a fantastic internship opportunity that I really enjoyed many years back.
@jonathanmay
Jon May
4 years
ISI NLP internship applications now open! College through PhD students are welcome to apply for 12 week summer research internships at Marina del Rey (or online if COVID). Join us!
Tweet media one
2
30
89
0
1
4
@barret_zoph
Barret Zoph
2 years
Really incredible results Enjoyed the "An astronaut lounging in a tropical resort in space in a vaporwave style"
Tweet media one
@OpenAI
OpenAI
2 years
Our newest system DALL·E 2 can create realistic images and art from a description in natural language. See it here:
546
3K
11K
0
1
4
@barret_zoph
Barret Zoph
3 years
We do a large-scale study of the quality-stability trade-offs of stability techniques We observe that our router z-loss fixes stability, while slightly improving quality The router z-loss is an auxiliary loss that makes the logits of the router smaller for numerical stability
Tweet media one
1
0
4