Introducing task vectors!
A new way to steer models by doing arithmetic with model weights. Subtract to make models forget, add to make them learn
📜:
🖥️:
Introducing DataComp, a new benchmark for multimodal datasets!
We release 12.8B image-text pairs, 300+ experiments and a 1.4B subset that outcompetes compute-matched CLIP runs from OpenAI & LAION
📜
🖥️
🌐
Today we are releasing a CLIP ViT-L/14 model with 79.2% zero-shot accuracy on ImageNet.
Our model outperforms OpenAI's CLIP by a large margin, and outperforms even bigger models (ViT-g/14) trained on LAION-2B
Check it out at !
Fine-tuning can make models like CLIP less robust.
A simple idea is highly effective at mitigating that:
averaging zero-shot and fine-tuned models.
Check out our work introducing WiSE-FT, just accepted to CVPR!
Paper:
Code:
We are releasing an open-source training implementation of OpenAI’s CLIP!📎
CLIP models learn from language supervision, and are capable of strong zero-shot performance at various vision tasks ()
Our reproduction can be found at
Instead of a single neural network, why not train lines, curves and simplexes in parameter space?
Fantastic work by
@Mitchnw
et al. exploring how this idea can lead to more accurate and robust models:
I've been seeing a lot of talk around the recent Vision Transformer (ViT) paper, so I thought I'd highlight some of my favorite previous work on self-attention and transformers in computer vision!
Link to ViT:
(thread 👇)
The year is 2032. A model was trained on all images, videos and text on the web, using over 100 yottaFLOPs.
It still thinks this is an image of a dog.
To fix models post-hoc, check out PAINT!🎨
📜
💻
🌐
Vision plays a central role in shaping the meaning of concrete words like "apple" or "banana". Yet, most of today's NLP models learn representations of these concepts from text-only.
Can such representations share similarities with the visual world?
1/n
Forget about messy vision backbones inside vision+language models?
Check out ViLT, a cool work by Kim et al., extending Vision Transformers to multimodal domains.
Link:
v2.23.0 of OpenCLIP was pushed out the door! Biggest update in a while, focused on supporting SigLIP and CLIPA-v2 models and weights. Thanks
@gabriel_ilharco
@gpuccetti92
@rom1504
for help on the release, and
@bryant1410
for catching issues. There's a leaderboard csv now!
A surprisingly simple way to improve generalization when fine-tuning: combine the weights of zero-shot and fine-tuned models.
We find significant improvements across many datasets and model sizes, at no additional computational cost at fine-tuning or inference time!
Can zero-shot models such as CLIP be fine-tuned without reducing out-of-distribution accuracy?
Yes! Our new method for robust fine-tuning improves average OOD accuracy by 9% on multiple ImageNet distribution shifts without any loss in-distribution
(1/9)
New paper out!
In NLP, fine-tuning large pretrained models like BERT can be a very brittle process. If you're curious about this, this paper is for you!
Work with the amazing
@JesseDodge
,
@royschwartz02
, Ali Farhadi,
@HannaHajishirzi
&
@nlpnoah
1/n
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
We found surprisingly large variance just from random seeds when fine-tuning BERT. Both weight inits and the order of the training data have big impact.
1/n
By *negating* a task vector, users can mitigate undesirable behaviors (e.g. toxic generations from a LM), or forget tasks altogether (e.g. OCR).
For instance, by fine-tuning a GPT-2 model on toxic data, negating the resulting task vector reduces toxic generations by 6x. (8/n)
We are hosting a tutorial on High Performance NLP at
#emnlp2020
, covering a bunch of fun stuff in efficiency!
Our first live Q&A session starts in ~1h!
Slides:
With the amazing Cesar Ilharco,
@IuliaTurc
,
@Tim_Dettmers
, Felipe Ferreira and
@kentonctlee
.
Task vectors offer a simple and efficient way of editing models. To create a task vector, we first fine-tune on a downstream task, then subtract the weights of the pre-trained model from the weights of the fine-tuned model. (3/n)
Thrilled that our paper got honorable mention for Best Paper Award for Research Inspired by Human Language Learning and Processing!
@conll2019
#emnlp2019
Much like software, models can be patched, adding support for new tasks with little change elsewhere.
I'll be at NeurIPS this week presenting our patching method, PAINT🎨. Come say hi! 👋
Despite their importance, datasets rarely receive the same research attention as model architectures or training algorithms.
We believe this is a major shortcoming in the
machine learning ecosystem, and that datasets deserve as much rigorous empirical experimentation as models
One of the most important challenges in machine learning today is figuring out how to control the behavior of pre-trained models, whether to reduce biases, align with human preferences, or simply improve accuracy on downstream tasks. (2/n)
Another key benefit of task vectors is that they enable us to reuse existing fine-tuned models, without the need to re-train or transfer any of the data. This is particularly exciting in light of the fast growth of fine-tuned models in recent years. (5/n)
While it might be surprising at first that we can operate directly in the weight space of neural networks, our research builds on several recent exciting works exploring the geometry of loss landscapes and weight averaging (links at the end!) (6/n)
Once created, task vectors can be combined via arithmetic operations like addition or subtraction, changing model behavior accordingly.
And since all operations are element-wise, editing models with task vectors has no impact on inference time! (4/n)
By *adding* task vectors, we can create multi-task models without any additional training.
Using CLIP, adding task vectors from two different tasks greatly improves the accuracy of the zero-shot model, and almost matches the accuracy of using multiple specialized models (9/n)
Together with DataComp, we are releasing CommonPool, the largest collection of image-text pairs to date.
CommonPool has 12.8 billion samples collected from Common Crawl, and is larger than existing datasets by a factor of 2.5x.
DataComp is a new benchmark for designing multimodal datasets.
Unlike traditional benchmarks, DataComp has data front and center.
The goal of participants is to propose new training sets, while keeping code, hparams & compute constant.
As more task vectors are added together, we can create more powerful multi-task models, without any re-training, and without increasing inference time (10/n)
We also show that the ranking of many curation approaches is consistent across scales
This suggests that experiments at smaller scales can provide valuable insights for larger scales, thereby accelerating investigations
We present 300+ baseline experiments along with many insights into dataset design
A key result is that smaller, more aggressively filtered datasets can perform *better* than larger datasets coming from the same pool
In our work we edit models using three arithmetic expressions over task vectors: negating a task vector, adding task vectors together, and doing analogies with task vectors. (7/n)
I recently had my last day as a
@GoogleAI
Resident. It has been an amazing year and I'm very thankful to
@jasonbaldridge
,
@vihaniaj
,
@alex_y_ku
,
@quocleix
and other collaborators for teaching me what no book can and making me fall in love with doing research.
Overall, we show that task arithmetic is a simple, efficient and effective way of editing models. It enables us to re-use existing checkpoints without the need to re-train or transfer data, and to combine models without increasing inference time. (14/n)
Finally, much like with word embeddings such as Word2Vec (think "man" is to "woman" as "king" is to "queen"), you can do *analogies* with task vectors! (11/n)
Consider two sentiment analysis datasets. We can improve accuracy on the first by combining three other task vectors, obtained by A) unsupervised ft on the 1st dataset; B) supervised ft on the 2nd and C) unsupervised ft on the 2nd
B+(A-C) improves accuracy on the first! (12/n)
Whoa, this is really cool!
Text-only models often outperform text+vision models in text-only tasks, given the statistical discrepancies in the language used in these domains.
"Vokenization" is a neat way to get some grounded supervision without paying the domain shit price
*Vokenization*: a visually-supervised language model attempt in our
#emnlp2020
paper: (w.
@mohitban47
)
To improve language pre-training, we extrapolate multimodal alignments to lang-only data by contextually mapping tokens to related images ("vokens") 1/4
Our benchmark is designed with scale in mind, with 4 levels of compute ranging from 12.8M to 12.8B samples seen in training
At the smallest scale, we can train in a few hours on a single GPU. At the largest, experiments may take up to 40 thousand GPU hours
DataComp is centered around image-text datasets, which have been instrumental in building models like CLIP, DALL-E, Stable Diffusion, Flamingo, and many others.
Our standardized infrastructure trains CLIP models and evaluates them on a diverse suite of 38 downstream tasks.
Large-scale image-text datasets like LAION or DataComp are heavily filtered.
Instead of throwing millions of images away, can we make use of them via image captioning models?
Check out this very cool work led by
@thao_nguyen26
! 👇
Are synthetic captions useful for multimodal training?
In , we show how image captioning can improve the quality of web-scale datasets. Replacing noisy web captions with generated ones outperforms existing filtering methods from the DataComp benchmark 1/n
We will also hold a workshop at ICCV 2023 centered around DataComp in October, and will invite outstanding submissions to give presentations.
Check out to learn more!
Researcher 1: we should show that our system is robust
Researcher 2: how about we simulate what would happen if a giraffe tried to eat the cube?
Researcher 1: excellent idea
We’re all used to robots that fail when their environment changes unpredictably. Our robotic system is adaptable enough to handle unexpected situations not seen during training, such as being prodded by a stuffed giraffe:
We've trained a new ViT-G/14 CLIP model with OpenCLIP on LAION-2B which achieves 80.1% zero-shot accuracy on ImageNet and 74.9% zero-shot image retrieval (R
@5
) on MSCOCO. As of Jan 2023 this is the best open source CLIP
code:
blog:
Along with filtering CommonPool, we have a separate Bring Your Own Data (BYOD) track. In BYOD, any data can be used as long as it doesn’t overlap with our evaluation suite.
We show that adding data sources such as RedCaps and CC12M can improve performance of some baselines
Personally, I'm excited about the potential of self-attention in vision, especially given recent indication that, in some scenarios, it can scale better than convolutions.
It's great to see all this recent progress, and I hope it shows up to its promise in the near future!
@ericjang11
@colinraffel
Some of our recent papers that might interest you!
Merging a finetuned and a pretrained model:
Merging models finetuned on the same task:
Merging models finetuned on different tasks:
There is much more in our paper, and we think this is just the beginning! I’m excited for a future where we have cheap and reliable ways of controlling how models behave, without needing to re-train them from scratch (15/n)
There is much more in the links below.
We are beyond excited to build the next generation of multimodal datasets rigorously and collaboratively, and hope you join us in this journey!
📜:
🖥️:
🌐:
Introducing a new recipe for fine-tuning --- model soups 🍜
TL;DR: we average the weights of multiple fine-tuned models to improve accuracy without increasing inference time
Paper:
Code:
To appear at ICML
(1/10)
Languages are beautiful. In classic Tupi, spoken by native Amerindians in Brazil, all verbs are in the present tense. Time is generally expressed by the suffixes "rama" (future) and "ûera" (past). (1/2)
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:
Paper:
Software:
Emergence:
6) DETR: End-to-End Object Detection with Transformers (2020), by
@alcinos26
,
@fvsmassa
,
@syhw
, Nicolas Usunier,
@kirillov_a_n
,
@szagoruyko5
)
Object detection as a set prediction problem and a transformer on top of a CNN backbone
Link:
7) Group Equivariant Stand-Alone Self-Attention for Vision (2020) by
@davidwromero
,
@jb_cordonnier
Self-attention with equivariance to arbitrary symmetries by carefully defining the positional encodings
Link:
If you haven't been following it,
@wightmanr
,
@CadeGordonML
and others have been doing amazing work with the OpenCLIP library!
They recently trained two ViT models on LAION-400M, the first large-scale, open-source CLIP models where the data is also publicly available!
OpenCLIP () has been updated with the latest results from a ViT-B/16 training run with the LAION400M dataset. Reaching zero-shot top-1 of 67.07 for In1k validation set. Further zero-shot analysis pending...
Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) (code below)
Overall, DataComp provides a controlled environment that enables rigorous experimentation over dataset design choices.
The large improvements we see from simple baselines highlight the power of careful empirical studies with datasets.
More details in the great thread below by
@Mitchnw
.
We added a number of new experiments and results in our paper, including additional models such as ALIGN and BASIC, along with further discussions on the role of hyperparameters.
Can zero-shot models such as CLIP be fine-tuned without reducing out-of-distribution accuracy?
Yes! Our new method for robust fine-tuning improves average OOD accuracy by 9% on multiple ImageNet distribution shifts without any loss in-distribution
(1/9)
Our codebase matches the ImageNet zero-shot accuracy from OpenAI (32.7% ours vs 31.3%) when training on the same data at medium scales (~15M samples from YFCC).
As shown by the scaling trends below, performance is far from saturated at this scale.
Getting started with research can be challenging, especially if you come from underrepresented communities. I was fortunate to have amazing people guiding me in this process and I’m happy to help ambitious people do the same. Feel free to contact me =)
I am excited to share our paper on evaluating the distributional robustness of QA models, where we evaluate 350+ SQuAD models on 15 distribution shifts and find that in-context learning provides the best performance-robustness tradeoff.
More details below ⬇️
Even the best pre-trained models are not perfect.
For instance, CLIP has strong zero-shot accuracy on ImageNet, but is worse than logistic regression with pixels on MNIST.
In some cases, like in typographic attacks, simply scaling up can make things worse📉
One particularly exciting property of small-medium scale CLIP models is that they still exhibit atypically high effective robustness! ()
This scale invariance means we don't need massive amounts of compute to study what makes these models robust