1/ There is a huge headroom for improving capabilities of our vision models and given the lessons we've learned from LLMs, scaling is a promising bet. We are introducing ViT-22B, the largest vision backbone reported to date:
We're hiring student researchers who are passionate about large scale vision models. Know someone who fits the bill or interested yourself? Let me know, and I'll be happy to share more details!
[Retweets are greatly appreciated.]
1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development.
@ysu_nlp
@emilymbender
Thanks for releasing this benchmark! Great effort!
Just for the record, I posted a link when the benchmark was released, and the great
@ankesh_anand
made it available for evaluating Gemini models literally within a few hours! Incredible display of agility for sure! :)
1. Benchmarks are fundamental to track progress in empirical machine learning. In our new paper, we study how benchmarking may affect the long term research direction and pace of progress in ML and put forward the notion of a "benchmark lottery":
We have released the JAX implementation of Universal Transformers () with adaptive halting in
#Scenic
(along with a Vision Transformer with token/example level halting mechanism):
Thanks to
@XueFz
for his amazing contribution.
We released the code for ViViT as well as the checkpoints of its different variants.
If you are interested in transformers for video understanding, check it out:
Code:
Paper:
"Whether to go with a decoder-only or encoder-decoder transformer?"
It turned out that this question on the architecture of the model is not actually that important!
You just need the right objective function and a simple prompting to switch mode during pretraining/finetuning.
“How does such a simple objective in LLM training, next-token prediction, result in such remarkably intelligent behavior?”
This question is on everyone's mind, from everyday LLM users to expert researchers.
✨We've got a solid answer!✨
How is next-token prediction capable of such intelligent behavior? I’m very excited to share our work, where we study the fractal structure of language. TLDR: thinking of next-token prediction in language as “word statistics” is a big oversimplification!
This idea:
-is extremely simple,
-works surprisingly good,
-shares a different perspective to neural search and retrieval,
-opens a door to a whole world of new research questions,
-& takes a key step to enable e2e training of retrieval enhanced methods.
Excited to share our latest work at
@GoogleAI
on "Transformer Memory as a Differentiable Search Index"!
TL;DR? We parameterize a search system with only a single Transformer model 😎. Everything in the corpus is encoded in the model! 🙌
Paper:
We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶:
@Google
’s largest and most capable AI model.
Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵
It turned out that you only need 8 tokens to be processed by your ViT!
TokenLearner is simple and efficient. Super helpful when dealing with a large number of tokens, like in video modeling.
The code is already available in Scenic:
While Vision Transformer models consistently obtain state-of-the-art results, they often require too many tokens for larger images and video. Read about TokenLearner, which adaptively generates fewer tokens but enables models to perform better, faster →
With
@YiTayML
,
@anuragarnab
,
@giffmana
, and
@ashVaswani
, we wrote up a paper on "the efficiency misnomer":
TL;DR:
"No single cost indicator is sufficient for making an absolute conclusion when comparing the efficiency of different models".
What do you think are the primary limitations or design choices that feel unnatural when it comes to using Transformers for computer vision (images, videos, ...)?
Dual PatchNorm is one of those ideas that is simple (adding literally two lines of code), yet consistently effective. There is also an interesting backstory to it.
While we await GPT-4 which is expected to have a trillion parameters, here are 3072 parameters, that can make your Vision Transformer better.
Paper:
Joint work w/
@neilhoulsby
@m__dehghani
NaViT () sets us free from square boxes and lets us think outside the box! Let creativity flow and go for the natural designs we've always wanted in ViTs.
I share a few cool ideas that are made possible with NaViT:
What do you think are the primary limitations or design choices that feel unnatural when it comes to using Transformers for computer vision (images, videos, ...)?
Introducing UL2, a novel language pre-training paradigm that improves performance of language models across datasets and setups by using a mixture of training objectives, each with different configurations. Read more and grab model checkpoints at
Really excited to share that our next speaker at AI-in-the-Loft in Google Amsterdam will be Jonathan Ho. This coming Friday (July 1), Jonathan will talk about Imagine! RSVP if you want to learn about text-to-image diffusion models:
#ai_in_the_loft
What I mean when I say “registers”: additional learnable tokens (like the [CLS]), but these ones are not used at output. No additional info at input, not used at output: these tokens could seem useless!
@borisdayma
Always gunning for the highest possible LR! Not a fan of lowering LR to get a buttery-smooth loss curve. My stance: models trained on the edge of stability come out stronger. Admittedly, it's like walking on eggshells for massive models, but the payoff is totally worth it!
Learn about ViT-22B, the result of our latest work on scaling vision transformers to create the largest dense vision model. With improvements to both the stability and efficiency of training, ViT-22B advances the state of the art on many vision tasks →
@borisdayma
Always gunning for the highest possible LR! Not a fan of lowering LR to get a buttery-smooth loss curve. My stance: models trained on the edge of stability come out stronger. Admittedly, it's like walking on eggshells for massive models, but the payoff is totally worth it!
Working on architecture builds a ton of intuition as well. With
@YiTayML
, we spent weeks exploring exotic ideas and their interaction with scale.
I think the next big leap will come from an architecture idea that supports a totally new mode of operating.
Our paper, "Learning to Learn from Weak Supervision by Full Supervision", with Sascha Rothe, Aliaksei Severyn, and
@jkamps
has been accepted at NIPS2017 workshop on Meta-Learning.
#nips2017
#MetaLearn2017
We release pre-trained vision transformer models and code for inference/fine-tuning: . There is still a long way towards understanding transformers in vision and I am looking forward to the future research. Hope this release will be a good starting point.
Introducing Gemini 1.5: our next-generation model with dramatically enhanced performance. It also achieves a breakthrough in long-context understanding.
The first release is 1.5 Pro, capable of processing up to 1 million tokens of information. 🧵
Seems fear of not being up-to-second with AI/ML news is making some of us anxious. I noticed in several conversations, when someone shares something new, many often respond with 'I already knew about this' immediately.
Sure, but this doesn't always embrace a culture of learning.
"Flanning" a language model is in fact scaling it up in the "diversity of tasks" axis, which is an important dimension next to scaling model size (parameters), compute (train steps), and data size. We should Flan every LLM we ever trained or will train.
New paper + models!
We extend instruction finetuning by
1. scaling to 540B model
2. scaling to 1.8K finetuning tasks
3. finetuning on chain-of-thought (CoT) data
With these, our Flan-PaLM model achieves a new SoTA of 75.2% on MMLU.
Really excited that after almost two years, we are resuming the "AI in the Loft" events. For the next edition, Wednesday 11 May, we will have
@bneyshabur
as our speaker.
Please RSVP at
Yi leaving us left me feeling down, but the prospect of all the amazing things awaiting him in his next venture fills me with joy!
I will always root you on,
@YiTayML
!
What I like the most in this work is the "reuse" and the amount of compute saved by it. We take a halfway trained PaLM540B, uptrain it with mixture-of-denoisers for ~0.1% of the spent compute, and get a U-PaLM that is as good as a fully trained PaLM.
Introducing U-PaLM 540B!
@GoogleAI
Training PaLM w UL2's mixture-of-denoisers with only 0.1% more compute unlocks:
- Much better scaling 📈
- Emergent abilities on BIGBench 😎
- Saving 2x compute (4.4 million TPU hours!) 🔥
- New prompting ability
link:
If you're looking into applying transformers on large inputs (e.g. long documents, images, videos, etc), or if you are working on a new variant of efficient transformers, this should give you a nice overview of the existing works.
Inspired by the dizzying number of efficient Transformers ("x-formers") models that are coming out lately, we wrote a survey paper to organize all this information. Check it out at .
Joint work with
@m__dehghani
@dara_bahri
and
@metzlerd
.
@GoogleAI
😀😃
The release of "UL2-20B" and "Flan-T5" checkpoints was awesome, and now we're taking another step to pave the way for even faster progress in LLM research with the open-sourcing of the new **Flan-UL2-20B** model.
Exciting times!
New open source Flan-UL2 20B checkpoints :)
- Truly open source 😎 No forms! 🤭 Apache license 🔥
- Best OS model on MMLU/Big-Bench hard 🤩
- Better than Flan-T5 XXL & competitive to Flan-PaLM 62B.
- Size ceiling of Flan family just got higher!
Blog:
A great and satisfying part of this project/paper is the release of weights from 170+ models we studied. In the paper, we share insights on scaling transformers and how changing different knobs impact the upstream and downstream vs. efficiency [...]
Excited to share that we have released 170+ pretrained transformer checkpoints of many different shape & sizes as part of our
#ICLR2022
paper on "Scaling Transformers Efficiently" 😄.
Checkpoints:
Paper:
Most important things that I learned from this work:
Make sure "scaling up" (even a little bit) is one of the experiments/ablations in "every" iteration when developing a new idea!
Scale is cruel to many smart ideas!
"Scaling laws vs Model Architectures" from
@GoogleAI
.
Lessons:
- Not all arch scale the same way.
- Vanilla Transformer does pretty well 😀
- Touching the attention too much is "dangerous". 😔
- Perf at base may not translate to large+ scale.
pdf:
As a companion to our recent efficient Transformer survey, we designed "Long Range Arena" a new challenging benchmark to help understand and analyze trade-offs between recent efficient Transformer models. Check out our paper at .
@GoogleAI
@DeepMind
Introducing Veo: our most capable generative video model. 🎥
It can create high-quality, 1080p clips that can go beyond 60 seconds.
From photorealism to surrealism and animation, it can tackle a range of cinematic styles. 🧵
#GoogleIO
It’s been a short 6 months since I left Google Brain and it has been a uniquely challenging yet interesting experience to build everything from the ground up in an entirely new environment (e.g., the wilderness)
Today, we’re excited to announce the first version of the
Excited that our next "AI in the Loft" at Google Brain Amsterdam (Next Thursday, July 28) will be with
@ada_rob
! Adam will talk about "Encoder-Decoder MLMs FTW". A super cool talk in a warm day!
[Note that this is an in person event.]
Please RSVP at
We're expanding access to Bard in US + UK with more countries ahead, it's an early experiment that lets you collaborate with generative AI. Hope Bard sparks more creativity and curiosity, and will get better with feedback. Sign up:
Check out Universal Transformers, new research from the Google Brain team &
@DeepMindAI
that extends last year's Transformer (a neural network architecture based on a self-attention mechanism) to be computationally universal.
There are reasons that students pay less for the registration, but "being easy to be excluded" is not one of them, I believe!
Not cool
@WSDMSocial
!
#wsdm2018
(cc:
@sigir_students
)
1. With NaViT, we can have arbitrary tokenization "per input". Why square patches as tokens? Why not any arbitrary set of pixels based on what your input looks like (you can also get a bit of help from FlexiViT here)?
Something that I appreciate about the term "foundation model" is how it draws a comparison to the unimpressive look of a building's foundation to most people. In fact, the foundation itself can be quite ugly and its beauty lies in the potential to build incredible things upon it.
@borisdayma
Always gunning for the highest possible LR! Not a fan of lowering LR to get a buttery-smooth loss curve. My stance: models trained on the edge of stability come out stronger. Admittedly, it's like walking on eggshells for massive models, but the payoff is totally worth it!
Universal Transformers propose to augment Transformers with Recurrence in depth and Adaptive Computation Time. This model outperforms Vanilla Transformers in MT / bAbI / LA / LTE.
Paper:
Code: Soon in
I was hoping to find an excuse to maybe share a little bit about how fun was working with a large group of people for all of us.
The main ingredient is "awesome people".
** Find them. Work with them. It's endless joy. It's good for your skin! **
Super excited to host
@YiTayML
for our next "AI in the Loft" at Google Brain Amsterdam (Next Tuesday, May 31). Yi is going to talk about Universal Models in Language!
Please RSVP at
#ai_in_the_loft
It’s been slightly more than a year since the UL2 paper () was released.
Here’s a summary thread of some notable models/research papers that use the UL2 objective for training (aside from the original UL2/Flan-UL2 of course).
🧵 thread below
#1
-
We can now load UL2 20B checkpoints in
@huggingface
Transformers! (Yaaay!)
Thanks for being so awesome Hugging Face!
... and of course a big thank you to
@DanielHesslow
for making this happen!
Huge contribution to the research community by releasing the 20B parameter checkpoint by
@m__dehghani
and
@YiTayML
from
@GoogleAI
❤️
Also a big thank you to
@DanielHesslow
for contributing the model to Transformers🤗
Adaptivity (adaptive compute allocation) + Modularity = Better generalization for multi-step reasoning & improved efficiency, even in system-1 tasks like image classification.
Check out our new paper:
1/9
I'm going to present our paper, "Learning to Attend, Copy, and Generate for Session-Based Query Suggestion", from my internship
@googleresearch
. If you're attending
#cikm2017
, you can come to Session 9A (3:45-5:15 pm in Ocean4).
We've been stuck with fixed square resizing for too long and overlooked how unnatural it is. I'm confident that future models, particularly at scale, will break free from this setup, given how easily it can be done and the significant gains it brings!
Huge thanks to
@Google
and all the Googlers (many of them from Iran) who are working on ensuring safer access to information from Iran and anywhere in the world experiencing Internet censorship.
Internet outages are happening more frequently worldwide, including in parts of Iran this week. Across Google, teams are working to make our tools broadly available, following the newly updated US sanctions applicable to communications services. 1/5
Today we have published our updated Gemini 1.5 Model Technical Report. As
@JeffDean
highlights, we have made significant progress in Gemini 1.5 Pro across all key benchmarks; TL;DR: 1.5 Pro > 1.0 Ultra, 1.5 Flash (our fastest model) ~= 1.0 Ultra.
As a math undergrad, our drastic
We need neural networks that are smarter and more adaptive in terms of allocation of compute (e.g., different amounts of computation for diff. inputs or diff. parts of an input). Checkout out CALM which enables such ability for decoding with Transformers.
Introducing our work
@GoogleAI
CALM: Confident Adaptive Language Modeling 🧘
Large Language Models don't need their full size for every generated token. We develop an Early Exit framework to significantly
#accelerate
decoding from
#Transformers
!
🔗:
🧵1/
@BlackHC
If someone wants to scale up UT, here's a genius idea: introduce sparsity into the scaled-up UT to add parameters without extra FLOPs.
More parameters + deep recurrence - the hefty price tag = winning combo!💡
11/ One of our favorite observations was the amazing alignment of the ViT-22B with human perception in terms of shape versus texture bias. ViT-22B has the highest shape bias ever recorded for an artificial neural network!
Beyond conventional downstream tasks, we evaluated ViT-22B on human alignment and perceptual similarity. As an emergent property, ViT-22B has the highest ever shape bias of an artificial neural network (close to humans)!
Do you want to train massive deep learning models with ease? Our 10 new tutorial notebooks of our popular UvA DL course show you how, implementing data, pipeline and tensor parallelism (and more) from scratch in JAX+Flax! 🚀🚀
Check them out here:
🧵 1/11
Enjoyed
@ilyasut
's lecture
"Universal Transformer [...] is a great idea, except that if you want to have a lot of parameters you need to pay for it [...] if you ignore compute costs, it gives you a recipe of how to proceed..."
@BlackHC
If someone wants to scale up UT, here's a genius idea: introduce sparsity into the scaled-up UT to add parameters without extra FLOPs.
More parameters + deep recurrence - the hefty price tag = winning combo!💡
This is exciting that ML researchers come up with different algorithms everyday, but we also know for any learning algorithm, any improvement on performance over one class of problems is balanced out by a decrease in the performance over another class (no free lunch theorem!).
2/ We share the recipe for a very efficient and stable training of large scale ViT, with impressive results. The hope is to inspire efforts on scaling vision models and to pair up our top-of-class vision models w/ our best LLMs, as a vital step of advancing AI, moving forward.
@savvyRL
Regarding the across architectures,
@samiraabnar
has (a paper and) a nice blog post in which she studies distillation as a mean to transfer the effect of architectural inductive biases from CNN to MLP and LSTM to Transformers.
AdaTape shakes things up by bringing in a fresh and distinct perspective for adaptive computation and offers an orthogonal approach the existing idea.
Kudos to
@XueFz
for the incredible work!
Great to see the taxonomy from is expanded to ViTs. But I wish the evaluation of efficiency go beyond parameters or FLOPs. Using these two as cost metrics could potentially lead to inaccurate conclusions: .
Efficiency 360: Efficient Vision Transformers
Compares various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.
Today we are presenting "The Efficiency Misnomer" at ICLR poster session. We have lots of cool observations to share along with some suggestions that we learned by working on many many projects and talking to many many amazing researchers and engineers.
With
@YiTayML
,
@anuragarnab
,
@giffmana
, and
@ashVaswani
, we wrote up a paper on "the efficiency misnomer":
TL;DR:
"No single cost indicator is sufficient for making an absolute conclusion when comparing the efficiency of different models".
Inspired by all the talented students at
@DeepIndaba
. Excited to stay involved in this amazing initiative.
Had a great time with brilliant researchers at Google Accra. This team stands out as the best, driving exciting projects that directly impact lives.
Yes. The amazing mix of joy, fulfillment, and making a real difference in Gemini excites EVERYONE.
Working with Sergey definitely adds an extra daily splash of fun and inspiration for all of us.
A good example is
@_sholtodouglas
at
@GoogleDeepMind
. He's quiet on Twitter, doesn't have any flashy first-author publications, and has only been in the field for ~1.5 years, but people in AI know he was one of the most important people behind Gemini's success
@chipro
The "recurrent inductive bias" of RNNs usually helps them be more data efficient, compared to vanilla Transformer. If you introduce such a bias to Transformers (like recurrence in depth in Universal Transformers), they generalize better on small datasets:
Extending this survey was a great opportunity for us to learn about the recent developments and see which ideas stood the test of time. Maybe one year from now, we share an updated version to include a new xformer that succeeded to become the mainstream!?
Happy to share that we have updated and published v2 of our "efficient transformer" survey!
Major updates:
✅ Expanded our scope to sparse models and added a ton of new models!
✅ Wrote a retrospective post about the advances in the past year.
Link:
What a fantastic initiative!
In this situation, helping to facilitate more "open exchange of people and ideas" would be the greatest thing that we can do. Read the post on
#OpenScience
by
@mdr
: