We've landed a big revamp of . The main new feature is support for flexible weight sharding, which doesn't get in the way of cutting-edge research code. Scaling ViTs, ResNets, MLP-Mixers, SigLIPs (and so on) beyond single GPU/TPU device memory becomes easy.
Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc.
I've always been frustrated that, beyond image classification, computer vision is full of complex and task-specific components.
Thus, very excited to share our new work, where we propose a unified modeling approach for vision: .
More in the thread🧵.
Let me introduce big_vision: an original home of ViT, MLP-Mixer, LiT and many more.
These days I do all my research in this codebase: it is great for doing vision research with emphasis on large-scale pretraining and transfer.
Highlights in 🧵 ⬇️
Link:
We release pre-trained vision transformer models and code for inference/fine-tuning: . There is still a long way towards understanding transformers in vision and I am looking forward to the future research. Hope this release will be a good starting point.
We just released PaliGemma-3B, a very capable Vision-Language Model. Do not waste any time, finetune it for your task:
Code:
Colab:
Kaggle:
HF:
Vertex AI:
MLP-Mixer (a new vision architecture based on MLP only) code and pretrained models are now available: .
Looking forward to community contributions that will shed some light on how Mixer works and how to make it even better.
paper: .
We have opensourced UViM models and complete training/inference/eval code. You can now train new models yourself and explore the released models (and UViM guiding codes) in the interactive colabs. All available at .
UViM paper: .
An incredibly thorough-looking survey of Vision Transformers!
It only been just over a year since we published ViT. I thought it would be useful, but didn't imagine this much cool innovation would happen.
Do not want to miss out on the recent trend, so I officially announce that
1. All my ICML 2022 papers were rejected.
2. All my ICML 2022 papers were accepted.
3. Both statements above are true.
Our PaliGemma technical report is finally out: .
We share many insights that we learned while cooking the PaliGemma-3B model. Both about pretraining and transfer.
✨PaliGemma report will hit arxiv tonight.
We tried hard to make it interesting, and not "here model. sota results. kthxbye."
So here's some of the many interesting ablations we did, check the paper tomorrow for more!
🧶
Got fast paligemma inference working on RTX 4090. Here's an object detection demo with the the 224px model running in real time at 16fps. I generate 10 tokens per iteration
Concurrent Mixer-like model, but applied to NLP and with fixed token-mixing MLPs with Fourier features. When writing Mixer paper we looked at our learned params and
@tolstikhini
had an immediate reaction: "looks like Fourier" and then we thought of doing exactly same thing later.
FNet: Mixing Tokens with Fourier Transforms
pdf:
abs:
Transformer encoder architectures massively sped up, with limited accuracy costs, by replacing self-attention sublayers with simple linear transformations
that “mix” input tokens
New paper 🧵 : How to train your ViT? It is common to train vision transformers on ImageNet-1k (~1.3m images) for 300 epochs. We show that you are better off investing the same compute budget for training on ImageNet-21k (~13m images) for 30 epochs.
Plot twist: the model uses a LM pretrained on the whole internet, and, in particular, it read and memorized our paper showing CIFAR10 test annotation mistakes. As a result, it got to 100% on a noisy test set.
I recommend the authors checking Fig. 8 from BiT paper (). On Cifar10 there’re some wrong ratings, an expert may not perform 100% on Cifar10. There’s either a bug in their eval code, or the model sees all the test examples.
We had a similar observation when doing preliminary investigations for . Vanilla REINFORCE with 2 samples (one for baseline, the other for the update) is good enough to steer vision models in the right direction.
@giffmana
On this topic, its funny how some things are too well known to cite (they are just part of common language, and often lower-cased), but not others. Adam really hit a sweet spot, being nearly ubiquitous and cited in most usages. My guess is its partialy in the name.
@karpathy
number of iterations is not enough when there is a batch dimension of some sort. For example, in vision the "number of images seen" is by far the best proxy for measuring training duration in the large-scale regime.
This simple pytorch trick will cut in half your GPU memory use / double your batch size (for real). Instead of adding losses and then computing backward, it's better to compute the backward on each loss (which frees the computational graph). Results will be exactly identical
After a 4 year long break from conferences (special "thanks" to the pandemic and visa constraints), I have finally made it to ICLR @ Vienna. Hope I have not missed too much. Btw, when is the poster session about ResNets trained on ImageNet?
Read this thread and tell me how 3 private, delayed and very noisy reviews are the cornerstone of the ML/CV fields, and open discussions in social media should be banned, until it effectively becomes irrelevant to have such discussion.
PaliGemma VLM was pre-trained with segmentation tasks and has the ability to produce dense masks. This capability needs some extra complexity related to (de)tokenization of segmentation masks and we have not documented it well yet.
But
@skalskip92
has it all figured out 👇
I finally managed to fine-tune PaliGemma on the custom segmentation dataset
most of you have probably noticed that I've been spamming all sorts of PaliGemma tutorials for the past few weeks; I have one more
shoutout to
@__kolesnikov__
for all the help!
↓ read more + code
@FrancescoLocat8
@giffmana
ArXiv's 1-3 days publication cycle is too slow given the current research pace. Publishing in-flight overleaf projects is the future.
@__kolesnikov__
Awesome! After a bit of trial and error, I managed to fine-tune the model using a custom object detection dataset. The Google Colab works great.
I have one more problem. How can I save a fine-tuned model? Really sorry if that's a stupid question, but I'm new to JAX and FLUX.
Are we still making meaningful progress on ImageNet? What happens if we carefully re-annotate ImageNet val set? How to improve ResNet-50 top-1 accuracy by 2.5% by cleaning training data and using different loss function? See our new paper for the answers: .
I feel like I have to once again pull out this figure. These 32x32 texture patches were state of the art image generation in 2017 (7 years ago). What does it look like for Gen-3 and friends to look similarly silly 7 years from now.
Thank you
@ISTAustria
and
@hetzer_martin
for the award, I am deeply honored to receive it. And thank you to
@thegruel
for being the exceptionally great and supportive PhD supervisor.
@__kolesnikov__
receives the
@istaustria
Alumni Award 2023! “This recognition holds special significance for me, as ISTA has played a pivotal role in shaping me as a scientist”. Congratulations and we are excited to see what comes next for you.
Finally, a Google AI Blog post about BiT, our state-of-the-art visual model. And to show that we are serious, we release code in three deep learning frameworks: TF2, PyTorch and Jax. Check it out:
Presenting BiT, an open-source approach for large-scale pre-training of models covering a wide range of visual tasks, which highlights the importance of choices in the model architecture for downstream performance. Learn all about it below:
At ICCV and curious about semi-supervised learning with self-supervision? Come to our talk today at 15:20 in Hall D1 or chat with us at poster
#20
from 15:30 to 18:00. , Code: . Joint work with
@XiaohuaZhai
@avitaloliver
and
@giffmana
It turns out that distillation yields wildly different results depending on subtle choices. But if done right, works consistently great for model compression, i.e. results in ~83% ImgNet with R50. What does "right" mean? Check out or a 🧵 from
@giffmana
👇
So you think you know distillation; it's easy, right?
We thought so too with
@XiaohuaZhai
@__kolesnikov__
@_arohan_
and the amazing
@royaleerieme
and Larisa Markeeva.
Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)
🧵👇
Check out a thread on our recent image-text project from
@giffmana
.
My personal tl;dr: when learning a joint embedding space for image-text (e.g. CLIP-style), you are much better off by *freezing* an image embedding that was pre-trained on the standard image data like ImageNet.
Want to turn any vision backbone into an image-text model? Want to show the age-old "your model wouldn't recognize a cow on the beach" is a red herring?
That's LiT🔥 (Locked-image Tuning), a new alternative to fine-tuning that combines the best of fine-tuning and zero-shot
1/n🧶
That is an appealing perspective on ViT and Mixer. Nevertheless, I think it really misses the point. There is a lot to unpack here, so let's do a thread(🧵).
@ykilcher
@GoogleAI
@neilhoulsby
@giffmana
@__kolesnikov__
This only really works when you have access to JTF300M and google money/compute. As EfficientNetV2 showed, if you have constraints (eg training efficiency - Fig 1; model size, FLOPs, and inference latency - Fig 5) use the EfficientNetV2 network instead of a "crap architecture".
This is a nice idea and also a perfect excuse to advertise my quite old paper, where we propose almost exactly the same: a method to categorise class maps into "objects" or "distractors" with virtually zero human supervision (few clicks per class):
📢 We now release "Salient ImageNet", a dataset with "core" and "spurious" masks for entire ImageNet!
Website:
with
@sahilsingla47
,
@MLMazda
Such a richly annotated dataset can be useful for model debugging, generalization, interpretation, etc. 1/n
Our modeling approach works out-of-the-box and very competitively (with no modifications and sharing the majority hyper-params) on three diverse vision tasks: panoptic segmentation (scene understanding), colorization (conditional image modeling) and depth prediction (3D).
🔥JAX meets Transformers🔥
@GoogleAI
's JAX/Flax library can now be used as Transformers' backbone ML library.
JAX/Flax makes distributed training on TPU effortless and highly efficient!
👉 Google Colab:
👉 Runtime evaluation:
The key insight is to go beyond standard supervised learning which learns to “imitate” training data. We let the model “act” and then “criticize” the outcomes, like in RL. More specifically, we score predictions with rewards and use the REINFORCE rule to update model parameters.
Training models at scale beyond ImageNet is challenging. We devise a recipe for learning visual representations from >100M images. Beside "obligatory" SOTA on ImageNet/CIFAR, we get great results in the low data regime transfer with a single hyper-param!
We distill key components for pre-training representations at scale: BigTransfer ("BiT") achieves SOTA on many benchmarks with ResNet, e.g. 87.8% top-1 on ImageNet (86.4% with only 25 images/class) and 99.3% on CIFAR-10 (97.6% with only 10 images/class).
@karpathy
The question is partially addressed here (as a by-product of studying the effect of the batch size): . For example for ImageNet they show that until the batch size becomes huge, the number of images seen is the only thing that matters:
The main idea is to learn a compact discrete "guiding code" that aids the simplest possible model to solve a given task with no custom components. The guiding code is derived from the ground truth, so we do not yet have the final prediction model...
Happy to see that my past "Imagenet SOTA" academic exercises turned out to be beneficial for real and useful vision applications. Deep down I was not sure it will ever be the case.
I fine-tuned my first vision-language model
PaliGemma is an open-source VLM released by
@GoogleAI
last week. I fine-tuned it to detect bone fractures in X-ray images.
thanks to
@mervenoyann
and
@__kolesnikov__
for all the help!
↓ read more
This is almost true, but there is a catch: our "token-mixing" MLP corresponds to the depth-wise convolution with a single(!) shared filter for all channels. We discuss this in the paper:
Nice PaliGemma tutorial, enjoyed watching it.
A comment regarding your struggles with getting good mAP for detecting multiple objects: it is a known limitation of all log-likelihood generative models and we wrote a whole paper on how to address it: .
new tutorial alert: fine-tune PaliGemma on a custom object detection model
- understanding the capabilities and applications of the PaliGemma
- fine-tuning PaliGemma for custom object detection tasks.
- troubleshooting common issues
link:
I always see people humble-brag on twitter. What happens if I just plain shameless-brag? Will a black hole appear under my feet? Let's find out! Two of my papers with code (BiT + ViT) are in top10 of 2020, yay! Let's keep up that trajectory
@__kolesnikov__
@XiaohuaZhai
etal. :)
It is only a matter of time until such large datasets will become more and more accessible. Hardware also becomes cheaper. When this happens, it is good to be prepared and have a large arsenal of models with different degree of built-in inductive biases.
Our approach works best for probabilistic models that are capable of sampling multiple coherent predictions. Luckily, many such models were very recently proposed for vision tasks: pix2seq, Unified-IO, UViM, PaLI, X-Decoder, …
Pleased to announce we are releasing checkpoints for our SigLIP models!
These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.
Sorry, no magnet link mic drop. More in thread🧶
We looked into visual disentanglement mathematically. Turns out the canonical (beta-)VAE (by accident) does what one would want from deep PCA! Accepted to
#cvpr2019
.
#deepPCAexists
In our experiments, we tune an object detection model to increase recall, and accordingly the mAP metric. We also improve a panoptic segmentation model to predict more coherent masks, or even optimize for exotic rewards, such as color diversity in the colorzaiton task.
Training on large data comes with an extra cost, but it needs to be done once. Later, these models can cheaply be adapted to small datasets (even with a few samples per class) and achieve great results for a many vision applications. It effectively amortizes pre-training costs.
In big_vision we strive to avoid too many abstractions and keep the core research code explicit. Our main train loop may seem a bit verbose, but at the time it is very explicit, so trying brave research ideas is often quick and easy.
@lkhphuc
@giffmana
Me too! There are many other great things jit/XLA does, e.g. two consecutive operations like `x = one_hot(x, 1000); y = (x)` will be fused together automatically by jit/XLA and `one_hot(x, 1000)` will never be fully materialized in memory.
We've put a lot effort to make the code scalable. The code is optimized for TPUs and just runs with little or no overhead on anything from a single TPU core to a distributed setup with 2048 TPUs. It also just works out-of-the-box on multi-gpu machines.
Don't forget to update your timm version to latest (0.9.8+)! The SigLIP support is via timm backbone and can be loaded in both timm and OpenCLIP from same
@huggingface
hub models.
We are sure there are lots of interesting insights to be drawn from our collection of trained ViTs. So we release them all: >50'000 models!
JAX repo: .
JAX colab:
Or also available in pytorch through timm: .
Why so? Paired image-text data is good for *aligning* the multimodal embeddings, but not that good for learning the visual embedding itself, where ImageNet-like data still shines. In a way we disentangle alignment and embedding learning.
There are likely some rough edges in the revamped code that we will improve over time. Glad to see big_vision is recognised by the leading OSS engineers: . Gives us additional motivation to keep big_vision up-to-date with the latest advances.
On the importance of OSS training reproductions (a focus of mine). Much of the progress here since the original OpenAI model/weight releases traces back to two open source code bases, the OSS community OpenCLIP (PyTorch) and Google Big Vision (JAX) proj.
It turns out that only a few parameters need to be trained to fine-tune huge text transformer models. Our latest paper is on arXiv; work
@GoogleAI
Zürich and Kirkland.
#GoogleZurich
#GoogleKirkland
All powered by the genius jax.Array and jax.jit API, which was recently revised to support global sharded computations. Shout-out to jax developers for doing such great work
@yashk2810
@jakevdp
@froystig
@SingularMattrix
(and more, but it is everyone I found on X).
@skalskip92
You can use for this:
import big_vision.utils as bv_utils
flat, _ = bv_utils.tree_flatten_with_names(flat)
with open("ckpt.npz", "wb") as f:
np.savez(f, **{k: v for k, v in flat})
And then load:
bv_utils.load_checkpoint_np("ckpt.npz")
One of my favourite parts is model evaluation. It is designed to run frequently during the training with almost no overhead: it may take as little as few seconds for a full ImageNet eval. We also can do on-the-fly(!) few-shot evaluation on multiple downstream datasets.
At the moment, we release the main bits of big_vision, but many updates will come soon, including ImageNet-21k support, Mixer models and separate releases of individual projects done in big_vision.
Stay tuned!
big_vision is a joint work with
@giffmana
,
@XiaohuaZhai
and myself, with many great contributions from Brain team members. Special shout-out to
@AndreasPSteiner
for making a great optax-based optimizer library for big_vision.
Exciting
@GoogleAI
work, so-called "Big Transfer": revisit transfer learning for computer vision, simplify various tricks and use huge datasets to get a big perf boost across many tasks. Kind of BERT-inspired.
One of my favourite new jax features is jax.shard_map (). It allows you to "override" jax.jit when automatic partitioning is not doing the right thing. For example, in big_vision we use shard_map to implement device-local "mixup":
@stanislavfort
@sarahookr
shameless plug: my colleague
@ibomohsin
have also extended scaling laws further to infer model's optimal shape (width, depth, etc): . We use these laws routinely.
Models pretrained on JFT-300M are not released, because it is a proprietary dataset. However, papers like CLIP() or ALIGN() have shown large scale data for pre-training can be collected automatically with modest engineering costs.
@roydanroy
@askerlee
this was a joke, though I would not rule out that a well-done hybrid of a LLM and an image recognition model will do this, when prompted appropriately.
Most of the nice things we have in the codebase are there thanks to XLA/JAX/FLAX. They are very powerful tools, but know how to get out of the way and enable flexible research.
@skalskip92
Yes! The prefix (aka prompt) looks like this:
detect: red car ; cat ; yellow bus
The corresponding suffix (aka output) is
<loc0072><loc0003><loc0905><loc1019> red car ; <loc0272><loc0140><loc0821><loc0851> cat ; <loc0378><loc0003><loc0424><loc0069> yellow bus
Models like ViT or Mixer sacrifice some of the inductive biases and instead rely on larger training data to automatically discover potentially different and maybe even better inductive biases.
It turns out that massively scaling up model size and training data results in a drastic jump of image classification accuracy in a challenging real-life setup. See our updated paper for details and additional surprising results that include low-data eval:
We just updated our Big Transfer (BiT) paper with ObjectNet results. Pre-training of large models shows huge improvement on this "in the wild" vision task: almost +30% absolute! ()
As a nice by-product you can turn any pre-trained vision backbone of your choice (e.g. BiT, MobileNet, ViT, MLP-Mixer, Swin, ConvNeXt, ...) into 0-shot classification model without modifying its weights.
But this is just one out of many highlights. We also conduct extremely thorough study on the interplay of model size, dataset size, regularizations and augmentations. Plus extensive transfer learning experiments. Check out our paper if you want to learn how to train ViT models.
For example, enabling FSDP training boils down to adding a few lines in the config. For more advanced use-cases, e.g. model-parallel flavour, users can write extra model-specific sharding logic, which is localised in one place and does not add mental overhead for doing research.
In short, there is a discrepancy between what log-likelihood loss for detection is optimising and what is good for mAP (or for whatever a user wants). Luckily, an additional lightweight Reinforcement Learning tuning can be used to align the model with the end goal.
Announcing BigTransfer: pre-trained models for easy and effective transfer learning for vision with TF-Hub!
Check out this walkthrough to view examples of high-performing BiT models and learn how to train an image classifier.
🔎 Learn more →
@skalskip92
We generally encourage to use tensorstore in big_vision, see `tssave` and `tsload` functions in big_vision , as it scales much better for really big checkpoints. That is why numpy utils are not very prominent, but numpy should be good enough for paligemma.
@karpathy
In my personal experience, I was able to scale batch size even further without taking any losses. That said, at some point this nice relation seems to break and for ImageNet-like tasks it seems to hold the best.