So, this is what we were up to for a while :)
Building SOTA foundation models for media -- text-to-video, video editing, personalized videos, video-to-audio
One of the most exciting projects I got to tech lead at my time in Meta!
🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date.
Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in
We released 92 pages worth of detail including how to benchmark these models! Super critical for the scientific progress in this field :) We'll also release evaluation benchmarks next week to help the research community 💪
World meet
#emuvideo
For the past year, our team has been pushing on video generation. The result? Emu Video that generates high quality videos from text or images. SOTA performance vs. commercial products and academic papers. Check it out
Today we’re sharing two new advances in our generative AI research: Emu Video & Emu Edit.
Details ➡️
These new models deliver exciting results in high quality, diffusion-based text-to-video generation & controlled image editing w/ text instructions.
🧵
Thank you
@MetaAI
! Would not have been possible without the awesome collaborators and mentors at FAIR (Meta) and CMU!
And thanks to
@techreview
for the award :)
Congratulations to our very own Meta AI researcher
@imisra_
for being named one of
@TechReview
’s
#35InnovatorsUnder35
! Ishan has been one of our leaders in advancing self-supervised learning.
Here's my conversation with Ishan Misra (
@imisra_
), research scientist at FAIR, working on self-supervised visual learning: getting machines to understand images & video with little help from humans. This is one of the most exciting topics in AI today.
We released ImageBind - a multimodal model for six different modalities trained using self-supervised learning.
Mark announced it in his video
Also, thanks,
@_akhaliq
for sharing :)
ImageBind: Holistic AI learning across six modalities
ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing
Check out the generative vision related release too
Imagine Flash generates the image as you type
You can also "Animate" your images! (technique based on Emu Video )
Kudos to the team for putting this out :)
Latest work in SSL learns with 100x fewer labeled images than previous SOTA like MAE, DINO
Huge shout out to
@mido_assran
and Nicolas Ballas for really pushing the boundaries of SSL!
Tired of manually annotating vast quantities of data to achieve good performance? Our Masked Siamese Networks (MSN) is a self-supervised framework for learning image representations. So what does that mean?
Two exciting updates on Movie Gen
(1) MovieGenBench containing thousands of *random* generations for benchmarking for video/audio tasks :)
(2) Folks in Hollywood (Casey Affleck, Blumhouse productions) took Movie Gen for a spin:
As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models.
Movie Gen Video Bench
Attending
#NeurIPS2022
this year. Super excited to see the awesome research! Organizing the 3rd SSL workshop at NeurIPS (1st time in-person) on December 3
FAIR research scientist, Ishan Misra (
@imisra_
) sat down with
@lexfridman
to demystify self-supervised learning & its impact in
#AI
: . Read the blog post that inspired the conversation:
[1/2] Late to arxiv
Happy to share my recent work on using Transformers for 3D Recognition.
Inspired by DETR, we propose 3DETR which works well for 3D Detection and classification.
Simple to understand, implement and extend!
And yes, we released a full paper!
It was a mammoth effort with such an amazing team :)
Here's the first line (one of my favorites) of our paper as a video
Why train separate models for visual modalities?
Following up on our Omnivore work: We train a single model on images, videos using no labels! Joint work with
@_rohitgirdhar_
,
@alaaelnouby
, Mannat, Kalyan and
@armandjoulin
!
PS: Omnivore is an oral at CVPR 2022, see you there!
OmniMAE: Single Model Masked Pretraining on Images and Videos
abs:
single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark
Introducing *Transfusion* - a unified approach for training models that can generate both text and images.
Transfusion combines language modeling (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. This
CutLER: a self-supervised detection/segmentation model. Surpasses prev SOTA by 2.7x on 11 datasets across various image domains! Awesome work by
@XDWang101
!
🎉CutLER is accepted to
#CVPR2023
!
tl;dr: CutLER is an unsupervised object detector surpassing prev SOTA by 2.7x on 11 datasets across various domains, e.g. natural images, painting, sketch.
Codes/demos are released!
Developed during my internship at FAIR-Meta AI
@MetaAI
1/n
And yes, we released a full paper!
It was a mammoth effort with such an amazing team :)
Here's the first line (one of my favorites) of our paper as a video
Attending
#NeurIPS2022
this year. Super excited to see the awesome research! Organizing the 3rd SSL workshop at NeurIPS (1st time in-person) on December 3
We’re sharing new research on using the natural association between video & sound to teach machines to better understand the world. Our self-supervised approach, which is a
#CVPR21
best paper candidate, learns directly from sounds & images in videos.
It’s here! Meet Llama 3, our latest generation of models that is setting a new standard for state-of-the art performance and efficiency for openly available LLMs.
Key highlights
• 8B and 70B parameter openly available pre-trained and fine-tuned models.
• Trained on more
🎉We are thrilled to announce the NeurIPS 2024 workshop on Self-supervised Learning - Theory and Practice. We invite submissions of both theoretical works and empirical works on SSL!
Website:
OpenReview:
1/n
Learn all about self-supervised learning for vision with
@imisra_
!
In this lecture, Ishan covers pretext invariant rep learning (PIRL), swapping assign. of views (SwAV), audiovisual discrimination (AVID + CMA), and Barlow Twins redundancy reduction.
The 4th workshop on "Self-Supervised Learning - Theory and Practice" will appear in
@NeurIPSConf
2023 in December! We have a great list of speakers: And we are still open for submissions:
This is quite something! Input detailed image detections to ChatGPT => incredibly detailed natural language description of the image, where it fills up spatial relations.
@imisra_
Building on
@taesiri
's clever exploration, we get a rich description of a scene if we combine Detic's superior object detection with GRiT's dense captioning (SOTA) of individual objects to create the input to ChatGPT. Remarkable spatial sense in description.
What do the Vision Transformers learn? How do they encode anything useful for image recognition? In our latest work, we reimplement a number of works done in this area & investigate various ViT model families (DeiT, DINO, original, etc.).
Done w/
@ariG23498
1/
Since I started working on generative models, I've felt the most compelling application would be sth that helps one learn new skills. Excited to share our first steps to that goal: generating bespoke instructions with illustrations to solve your task! w/
@SachitMenon
@imisra_
Excited to present our work, "The effectiveness of MAE pre-pretraining for billion-scale pretraining" () at
#ICCV2023
today. Our strong foundational MAWS models, with multilingual CLIP capabilities, are now available publicly at .
If you're still at CVPR today check out our paper "Omnivore: A Single Model for Many Visual Modalities".
Our SoTA model excels at classifying images, videos, and single-view 3D data using exactly the same model parameters and without access to correspondences between modalities.
I’m thrilled and proud to share our model, Movie Gen, that we've been working on for the past year, and in particular, Movie Gen Edit, for precise video editing. 😍
Look how Movie Gen edited my video!
Meet Dobb·E: a home robot system that needs just 5 minutes of human teaching to learn new tasks. Dobb·E has visited 10 homes, learned 100+ tasks, and we are just getting started!
Dobb·E is fully open-sourced (including hardware, models, and software):
🧵
Excited to share what we've been up to this year: Emu Video! A SOTA video generation system from text or images. In spirit of the upcoming holidays, here's a holiday greeting featuring a cute little penguin Frosty, powered by Emu Video! (🔊 on)
#emuvideo
And it was so simple to implement! Such a powerful technique 😊 kudos to
@RickyTQChen
@mnick
Yaron Lipman and others! Super insightful discussions with you all
Lots of goodies (aka actual explanations & research results🧐) in the Movie Gen technical report: .
1) Flow matching val loss correlates with human evaluation.
2) Human evaluation strongly prefers flows over diffusion on both quality and text alignment.
[2/2] We show that using better positional encodings and non-parametric queries is critical for 3D detection.
Transformers are also good encoders of 3D point data and work well for classification.
Accepted as an ICCV’21 Oral
Amazing to see how
@blumhouse
creators found our models to be helpful tools for human<->AI collaboration
Not a stretch to imagine generative video models be a standard tool for moviemakers and content creators.
Movie Gen pilot program is out for the creatives!
Today, we’re sharing initial results from our work with
@Blumhouse
and select creators as we continue to develop our Meta Movie Gen models.
We’re excited to expand this pilot program in collaboration with the creative industry in 2025 ➡️
Our founding team is covering many AI fields from vision, with Patrick Pérez and Hervé Jégou (
@hjegou
) to LLMs with Edouard Grave (
@EXGRV
), audio with Neil Zeghidour (
@neilzegh
) and Alexandre Défossez (
@honualx
) and infra with Laurent Mazaré (
@lmazare
).
This is big. Meta just announced and open-sourced a new powerful Multimodal AI that combines six types of data.
Inspired by humans, ImageBind, is the first to combine six types of data into a single embedding space. It understands images, video, audio, depth, thermal, and
Meet Emu in this Emu Video. It has been so much fun to create a movie just from a few lines of story. Excited about the prospect of a future where everyone can create their own films almost instantaneously. Video generated with
#emuvideo
and speech from
#Voicebox
text-to-speech.
Does your classifier know when it doesn't know?
We ask how well a standard, properly trained classifier can detect if a test image comes from an 'unseen' class. Surprisingly, it can get SoTA!
#ICLR2022
Oral
Project Page:
Code:
🧵
a new strategy (targeted rather than untargeted, incentive rather than penalty):
are you in NYC, done with
#NeurIPS2022
review and without life (i.e. no plan later this afternoon)?
if yes, DM me to claim your deserved glass of beer/wine/etc. on me later today.