Big thanks to
@_akhaliq
for the retweet! 🚀
We are very excited about presenting 𝑹𝒆𝒄𝒂𝒑-𝑫𝒂𝒕𝒂𝑪𝒐𝒎𝒑-1𝑩, where we use a 𝐋𝐋𝐚𝐌𝐀-𝟑-powered LLaVA model to recaption the entire 𝟏.𝟑 𝐛𝐢𝐥𝐥𝐢𝐨𝐧 images from DataComp-1B. Compared to the original textual descriptions,
What If We Recaption Billions of Web Images with LLaMA-3?
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various
Thanks for tweeting,
@ak92501
! Regarding robustness, we surprisingly find that
1) ViTs are NO MORE robust than CNNs on adversarial examples; training recipes matter
2) ViTs LARGELY outperform CNNs on out-of-distribution samples; self-attention-like architectures matter
New work alert 🚨🚨 𝐌𝐚𝐦𝐛𝐚®: 𝐕𝐢𝐬𝐢𝐨𝐧 𝐌𝐚𝐦𝐛𝐚 𝐀𝐋𝐒𝐎 𝐍𝐞𝐞𝐝𝐬 𝐑𝐞𝐠𝐢𝐬𝐭𝐞𝐫𝐬.
Similar to ViT, we identify that Vision Mamba's feature maps also contain artifacts, but it's more intense --- even tiny models show extensive activation in background areas. (1/n)
📢 Introducing CLIPA-v2, our latest update to the efficient CLIP training framework! 🚀✨
🏆 Our top-performing model, H/14
@336x336
on DataComp-1B, achieves an impressive zero-shot ImageNet accuracy of 81.8!
⚡️ Plus, its estimated training cost is <$15k!
If you plan to "attend"
@CVPR
, check out our Tutorial on AdvML in Computer Vision (). Our amazing speakers will cover both the basic backgrounds & their most recent research in AdvML.
Save the date (09:55 am - 5:30 pm ET, June 19), and we will see you soon
Last year, we showed ViTs are inherently more robust than CNNs on OOD.
While our latest work challenges this point---we actually can build pure CNNs with stronger robustness than ViTs. The secrets are:
1) patchifying inputs
2) enlarging kernel size
3) reducing norm & act layers
Thanks for tweeting,
@ak92501
! Regarding robustness, we surprisingly find that
1) ViTs are NO MORE robust than CNNs on adversarial examples; training recipes matter
2) ViTs LARGELY outperform CNNs on out-of-distribution samples; self-attention-like architectures matter
📢📢📢 Several talented students from my group, skilled in computer vision, multimodal learning, and GenAI, are seeking summer internships. They're ready to bring innovative solutions to your projects! Don't miss out on this talent pool. 🚀
Check out their profiles: (1/n)
If you decide to apply for Ph.D. programs this year and are interested in adversarial machine learning, computer vision & deep learning, please consider my group
@ucsc
@UCSC_BSOE
. We will have multiple openings for Fall 2021.
No GRE required & DDL is Jan 11, 2021
RT Appreciated
👩🏾💻
@ucsc
’s Computer Science and Engineering department is accepting applicants to its Ph.D. and M.S. programs for Fall 2021! Applications due Jan. 11, 2021. No GRE required.
#BaskinEngineering
Learn more & apply:
📢
#CallForPaper
AROW Workshop is back
@eccvconf
! This year we have a great lineup of 9 speakers, and set 3 best paper prizes ($10k each, sponsored by
@ftxfuturefund
).
For more information, please check
v2.23.0 of OpenCLIP was pushed out the door! Biggest update in a while, focused on supporting SigLIP and CLIPA-v2 models and weights. Thanks
@gabriel_ilharco
@gpuccetti92
@rom1504
for help on the release, and
@bryant1410
for catching issues. There's a leaderboard csv now!
HQ-Edit
A High-Quality Dataset for Instruction-based Image Editing
This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets,
We are hosting a workshop on “Neural Architectures: Past, Present and Future” at
@ICCV_2021
, with 7 amazing speakers.
We invite submissions on any aspects of neural architectures. Both long paper and extended abstract are welcomed. Visit for more details
Interested in CLIP training but constrained by resources? We've got you covered!
In our latest work , we introduce an efficient training strategy that's affordable for researchers with limited computational capacities.
(1/n)
We're excited to share two recent works on multimodal learning.
The first one introduces a human-knowledge-based algorithm to address the alignment and efficiency of large-scale image-text data. It can secure model performance by compressing the image-text datasets up to ~90%.
Check out our latest work on studying the effects of activation functions in adversarial training.
We found that making activation functions be SMOOTH is critical for obtaining much better robustness.
Joint work with
@tanmingxing
,
@BoqingGo
,
@YuilleAlan
and
@quocleix
.
A surprising result: We found that smooth activation functions are better than ReLU for adversarial training and can lead to substantial improvements in adversarial robustness.
An incredible experience at
#ICML23
--- meeting over 50 Email/Twitter friends in person for the first time was truly amazing. Building real connections with real people (plus beautiful Hawaii) makes ICML unforgettable! 🥳🥳
𝙄𝙣𝙩𝙚𝙧𝙥𝙧𝙚𝙩𝙖𝙗𝙡𝙚 𝙢𝙤𝙙𝙚𝙡𝙨 𝙘𝙖𝙣 𝙨𝙘𝙖𝙡𝙚🚀, 𝙩𝙤𝙤!!!
Excited to present 𝐂𝐑𝐓𝐀𝐄-α, our latest efforts on scaling up the white-box transformer CRATE for vision tasks. By slightly tweaking the design of the sparse coding block and the training recipes,
I am happy to share that I will be joining
@UCSC_BSOE
as an Assistant Professor in the Computer Science and Engineering Department in Spring 2021! Thanks everyone for the great support along my Ph.D. journey 😀
Happy to share that our D-iGPT is accepted by
#ICML2024
. By only using public datasets for training, our best model strongly attains 𝟗𝟎.𝟎% ImageNet accuracy with a vanilla ViT-H.
Code & Model:
Huge congrats to
@RenSucheng
🥳
Rejuvenating image-GPT as Strong Visual Representation Learners
paper page:
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple
Interested in benchmarking your NAS algorithm? Check out our
@CVPR
1st lightweight NAS challenge () with 3 competition tracks: Supernet Track, Performance Prediction Track, Unseen Data Track.
🏆The prize pool is $2,500 for each competition track
VideoHallucer: a comprehensive benchmark for detecting hallucinations in large video-language models. By categorizing hallucinations into intrinsic and extrinsic types and adopting adversarial binary VideoQA, we offer nuanced insight into where current models excel and falter
Our Adversarial Machine Learning in Computer Vision Workshop (co-organized with
@xinyun_chen_
,
@SongBai_
, Bo Li, Kaiming He,
@drfeifei
, Luc Van Gool, Philip Torr(
@OxfordTVG
),
@dawnsongtweets
and Alan Yuille) is now accepting submissions, with a BEST PAPER PRIZE!
DDL: Mar 15 AoE
I am seeking to hire 3-5 Ph.D. students
@ucsc
@BaskinEng
. If you are interested in computer vision, machine learning, or AI for Healthcare, feel free to reach out and to let me know to look for your application!
Application deadline: Jan 10 (GRE not required)
RTs appreciated
1/ Yesterday, a student entering grad school asked me if it is possible to get married during PhD and still do well. Of course you can! Please don't be carried away by insane productivity stats. You can work for 40 hours/week and do a great job.
Google announces Scaling (Down) CLIP
A Comprehensive Analysis of Data, Architecture, and Training Strategies
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP
Our ICCV Workshop on Neural Architectures: Past, Present and Future will start soon (8:55 am EDT, Oct 11), with 7 amazing speakers.
Join us via Zoom from
@ICCV_2021
platform! We will also be streaming live on Youtube!
For more info, please visit
1/ 📢 In this week’s TrustML-highlight, we are delighted to feature Cihang Xie
@cihangxie
🎊🎉
Dr. Xie is an Assistant Professor of Computer Science and Engineering at the University of California, Santa Cruz.
Interested in CLIP training but constrained by resources? We've got you covered!
In our latest work , we introduce an efficient training strategy that's affordable for researchers with limited computational capacities.
(1/n)
Just note that our 𝑹𝒆𝒄𝒂𝒑-𝑫𝒂𝒕𝒂𝑪𝒐𝒎𝒑-1𝑩 is now trending on Huggingface Dataset! 🌟 Check out the thread below for an introductory overview of our work. 👇
HF link:
Big thanks to
@_akhaliq
for the retweet! 🚀
We are very excited about presenting 𝑹𝒆𝒄𝒂𝒑-𝑫𝒂𝒕𝒂𝑪𝒐𝒎𝒑-1𝑩, where we use a 𝐋𝐋𝐚𝐌𝐀-𝟑-powered LLaVA model to recaption the entire 𝟏.𝟑 𝐛𝐢𝐥𝐥𝐢𝐨𝐧 images from DataComp-1B. Compared to the original textual descriptions,
We see another significant boost in the Monthly Downloads of our CLIPA models, from 𝟏𝟔.𝟖𝐤 to 𝟒𝟖.𝟐𝐤. (weights are available at )
We also plan to release stronger CLIP models (especially in Zero-Shot Cross-Modal Retrieval) in the coming weeks. Stay
Excited to share our newest project: developing a comprehensive safety evaluation benchmark for Vision-LLMs.
Also, a shout-out to our awesome project leader,
@HaoqinT
, who will apply for PhD this year. He is strong and always energetic about research. Go get him!
Vision LLMs like LLaVA and GPT4V, are good at handling regular vision-language tasks, but are they really robust and safe😈
With our new VLLM safety benchmark, we shed light on two types of safety evaluations in existing VLLMs: OOD situation and adversarial attack. 🧵👇
Thank
@_akhaliq
for highlighting our work.
We show autoregressive pretraining can create powerful vision backbones. E.g., our ViT-L attains an impressive 89.5% ImageNet accuracy. Larger models are coming soon.
Big shoutout to
@RenSucheng
for leading this incredible project! 🌟
Rejuvenating image-GPT as Strong Visual Representation Learners
paper page:
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple
Excited to share our newest project: developing a comprehensive safety evaluation benchmark for Vision-LLMs.
Also, a shout-out to our awesome project leader,
@HaoqinT
, who will apply for PhD this year. He is strong and always energetic about research. Go get him!
Eagle and Finch
RWKV with Matrix-Valued States and Dynamic Recurrence
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a
Happy to share that our D-iGPT is accepted by
#ICML2024
. By only using public datasets for training, our best model strongly attains 𝟗𝟎.𝟎% ImageNet accuracy with a vanilla ViT-H.
Code & Model:
Huge congrats to
@RenSucheng
🥳
@mcgenergy
@quocleix
@DigantaMisra1
Besides the functions we listed in the paper, we actually have tried a lot of other smooth activation functions (including Mish). The general conclusion is as long as your function is a good smooth approximation of ReLU, it will work.
Which
#AI
#ComputerVision
models generalize and adapt best to new domains with different viewpoints, style, or missing/novel categories? Find out on Tue Dec 7 20:00 GMT at our
@NeurIPSConf
ViSDA'21 competition! Hear from winners and speakers; more at
What If We Recaption Billions of Web Images with LLaMA-3 ?
- Finetunes a LLaVA-1.5 and recaptions ~1.3B images from the DataComp-1B dataset
- Opensources the resulting dataset
data:
proj:
abs:
To mitigate this issue, we follow the ViT Register work to also add register tokens into Mamba. Two tweaks are needed: 1) inserting registers evenly, and 2) reusing these registers for final decision-making. We term the resulting architecture 𝐌𝐚𝐦𝐛𝐚® (2/n)
Happy first day of classes!!!
This year,
@ucsc
Baskin Engineering welcomes nine new faculty with specialties in statistical computing, computer security, human-robot interaction, wireless communications, serious games, AI, and more. Meet them here:
🚀 Excited about releasing MedTrinity-25M, a large-scale medical multimodal dataset with over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases
Try it at
MedTrinity-25M
A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
discuss:
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10
Exciting to share our
@NeurIPSConf
paper on defending against score-based query attacks 🥳
The leading author,
@MrSizheChen
, will apply for Ph.D. programs this year. He is a strong researcher and awesome collaborator; try to get him 😉
Excited that our work with Prof.
@cihangxie
has been accepted by
#NeurIPS2022
after an unforgettable rebuttal (2 4 4 5 -> 7 7 4 7). Looking forward to seeing you!
VoCo-LLaMA
Towards Vision Compression with Large Language Models
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing
Our Security and Safety in Machine Learning Systems Workshop at
@iclr_conf
(co-organized with Cihang Xie (
@cihangxie
), Ali Shafahi, Bo Li, Ding Zhao (
@zhao__ding
), Tom Goldstein, and Dawn Song (
@dawnsongtweets
)) is now accepting submissions, with a best paper prize! DDL: Feb 26.
Our
@CVPR
adversarial machine learning workshop starts now at EAST 3 meeting room. The first speaker is
@aleks_madry
, on talking about “Preventing Data-driven Manipulation”
Pro tip for PhD students looking for a research internship: cold emailing works! Find someone you think would be a good fit, specify topics they've worked on that interest you (point to papers and customize the email!), and make the case for why you're a strong candidate.
Proud Advisor Moment - Huge congratulations to my student, Xianhang Li, for winning the esteemed 2024-25 Jack Baskin & Peggy Downes-Baskin Fellowship. Your hard work and dedication truly shine through! 🥳🥳
@BaskinEng
I will be at CLIPA poster tomorrow morning from 10:45 am - 12:45 pm
Also I will be at
#NeurIPS2023
until this Saturday (Dec 16). Happy to chat with old friends and make new connections
Tomorrow (June 23) afternoon, we will be presenting our
@CVPR
work 𝐀 𝐒𝐢𝐦𝐩𝐥𝐞 𝐃𝐚𝐭𝐚 𝐌𝐢𝐱𝐢𝐧𝐠 𝐏𝐫𝐢𝐨𝐫 𝐟𝐨𝐫 𝐈𝐦𝐩𝐫𝐨𝐯𝐢𝐧𝐠 𝐒𝐞𝐥𝐟-𝐒𝐮𝐩𝐞𝐫𝐯𝐢𝐬𝐞𝐝 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 at 126b.
Come visit! Both
@HuiyuWang3
and Xianhang Li will be there!
📢 Introducing CLIPA-v2, our latest update to the efficient CLIP training framework! 🚀✨
🏆 Our top-performing model, H/14
@336x336
on DataComp-1B, achieves an impressive zero-shot ImageNet accuracy of 81.8!
⚡️ Plus, its estimated training cost is <$15k!
Want to make your networks get better performance? Try AdvProp
#CVPR2020
! By using adversarial examples, AdvProp achieves 85.5% accuracy on ImageNet. Join me via at 10AM PT TODAY!
Joint work with
@tanmingxing
,
@BoqingGo
,JiangWang,
@YuilleAlan
and
@quocleix
Quantitatively, our 𝐌𝐚𝐦𝐛𝐚® also attains stronger performance and scales better. For example, our 𝐌𝐚𝐦𝐛𝐚®-B attains 82.9% ImageNet accuracy, largely outperforms its non-register version by 1.1%. Furthermore, we provide the first successful scaling to the large model size
Meet Xianhang Li, a third-year PhD student making strides in video understanding and multimodal learning. 🌟 With 6 top-tier publications, he's a tech prodigy. Explore Xianhang's work at and CV at . (2/n)
📢
#CallForPaper
AROW Workshop is back
@eccvconf
! This year we have a great lineup of 9 speakers, and set 3 best paper prizes ($10k each, sponsored by
@ftxfuturefund
).
For more information, please check
Attending
#CVPR2020
?
Please check out our work on fooling object detectors in the physical world!
Q&A Time: 10:00–12:00 and 22:00–00:00 PT on June 16
Link:
Code:
Paper:
We also have a very cool demo👇
We will be presenting CLIPA at
@NeurIPSConf
on Tuesday, Dec 12.
This work shows that larger models are better at dealing with inputs at reduced token length, enabling fast training speed.
The code is released at , including G/14 at 83.0 zero-shot acc.
UCSC AI Group will be at
#NeurIPS23
and
#EMNLP23
, presenting research covering LLMs, T2I Generation/Editing/Evaluation, VLM Efficiency, Embodied Agents, Explainable AI, Machine Unlearning, Fairness ML, etc.
Check out the blogpost for detailed schedule: .
iBOT: Image BERT Pre-Training with Online Tokenizer
abs:
achieves an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K
Anyway, I am very disappointed with
#neurips2020
--- reject your works without peer-reviews AND do not allow you to do a rebuttal.
If my works cannot be properly reviewed in a conference, then what is the value of submitting a paper there?
Meet another star PhD student, Zeyu Wang! 🌠 In his third year, Zeyu has already published 5 papers, covering topics of multimodal learning, self-supervised learning, and robustness. Learn more about Zeyu’s work at and CV at (3/n)
4/ 2] "Shape-Texture Debiased Neural Network Training": They developed a simple algorithm for shape-texture debiased learning; the key is to train models with images with conflicting shape and texture information & provide the supervisions from shape and texture simultaneously.
Meet Sucheng, a remarkable first-year PhD student at
@JHUCompSci
! 🌟 Sucheng has an impressive 16 publications, covering multimodal learning, self-supervised learning, and knowledge distillation. Check his work at with CV at (5/n)
Say hello to Junfei Xiao from
@JHUCompSci
, a second-year PhD student with expertise in segmentation, multimodal learning, and large vision-language model, boasting 8 publications. 🚀 Discover his profile at and access CV at (4/n)
If you are working on robustness, please consider submitting your work(s) to our
@eccvconf
AROW workshop. 𝐖𝐞 𝐰𝐢𝐥𝐥 𝐬𝐞𝐥𝐞𝐜𝐭 𝐓𝐇𝐑𝐄𝐄 𝐛𝐞𝐬𝐭 𝐩𝐚𝐩𝐞𝐫 𝐚𝐰𝐚𝐫𝐝𝐬, 𝐞𝐚𝐜𝐡 𝐰𝐢𝐭𝐡 𝐚 $𝟏𝟎,𝟎𝟎𝟎 𝐩𝐫𝐢𝐳𝐞!
CMT site:
DDL: Aug 1, 2022
📢
#CallForPaper
AROW Workshop is back
@eccvconf
! This year we have a great lineup of 9 speakers, and set 3 best paper prizes ($10k each, sponsored by
@ftxfuturefund
).
For more information, please check
Implementation details affect the robust accuracy of adversarially trained models more than expected. For example, a slightly different value of weight decay can send TRADES back to
#1
in the AutoAttack benchmark. A unified training setting is necessary.
We are hosting a workshop on “Neural Architectures: Past, Present and Future” at
@ICCV_2021
, with 7 amazing speakers.
We invite submissions on any aspects of neural architectures. Both long paper and extended abstract are welcomed. Visit for more details
Qualitatively, our 𝐌𝐚𝐦𝐛𝐚® can effectively suppress those artifacts, making the feature maps look much cleaner. Additionally, our inserted register tokens sometimes can emerge strong semantics in representing objects. (3/n)
Lastly, I would say that I am super privileged to collaborate with these incredibly talented students! 🌟 Their expertise and dedication are exceptional. Don't miss the chance to work with them – hire them now😉 (n/n)
I will be presenting this paper at the
#CVPR
Area Chair Workshop this Sunday.
Additionally, if you're attending
#CVPR
and want to dive deeper into this project, feel free to ping me for an in-person discussion. Looking forward to engaging with you all! (5/n)
Our SecML workshop starts now!
After
@xinyun_chen_
's opening remark,
@AlinaMOprea
will talk about "Machine Learning Integrity in Adversarial Environments" at 9am PT.
The link is here 👇
To validate its empirical benefits, we kicked off with CLIP model training as our first experiment.
Results show:
1) training with a mixture of the original captions and our recaptions boosts cross-modal retrieval significantly, despite a tiny dip in ImageNet accuracy.
2) Using
Same here, with two very vague comments: (1) not supported with sufficient empirical evidence, or important baselines are missing; (2) not sufficiently positioned with respect to prior work, either in terms of the presentation or the empirical validation.
Oh, great. Neurips desk reject. If the results are “known results” (I’m pretty confident they aren’t), perhaps the area chair could have made the effort to leave a note telling what they are? 1/n