Videos are cool and all...but everything's more fun when it's interactive.
Check out our new project, ✨CAT3D✨, that turns anything (text, image, & more) into interactive 3D scenes!
Don't miss the demo!!
🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models.
w/ lovely coauthors
@holynski_
,
@poolio
and an amazing team!
Check out our new paper that turns a (single image) => (interactive dynamic scene)!
I’ve had so much fun playing around with this demo.
Try it out yourself on the website:
Excited to share our work on Generative Image Dynamics!
We learn a generative image-space prior for scene dynamics, which can turn a still photo into a seamless looping video or let you interact with objects in the picture. Check out the interactive demo:
Excited to show off our new project on single-image cinemagraphs. Our method automatically turns a _single image_ into a seamlessly looping video!
Website:
Video:
w/ Brian Curless, Steve Seitz, Rick Szeliski
More in thread! [1/5]
We just posted a report on the state of the art in diffusion models for visual computing:
If you're new to diffusion models, or maybe just want a recap of everything that's been going on lately---this is a great place to start.
Excited to share self-guidance, a new method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model:
Work led by Dave Epstein w/
@ajabri
,
@poolio
, Alyosha Efros
More in thread🧵
Happy to finally be able to share our
#CVPR2022
paper, InstructPix2Pix!
We taught a diffusion model how to follow image editing instructions — just say how you want to edit an image, and it’ll do it!
(w/ Tim Brooks & Alyosha Efros)
More on Tim’s site:
🧵
We posted an updated version of Generative Image Dynamics to arXiv---the biggest change is to better contextualize our method with respect to prior work in image space motion analysis, especially the great work of
@AbeDavis
Check out our new paper that turns a (single image) => (interactive dynamic scene)!
I’ve had so much fun playing around with this demo.
Try it out yourself on the website:
🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models.
w/ lovely coauthors
@holynski_
,
@poolio
and an amazing team!
It turns out images contain lots of useful cues about how things should be flowing -- like ripples in water, turbulent streams, motion blur. An image-to-image GAN learns a lot of these subtle cues, and can synthesize pretty complex motion.
[3/5]
Here's another result:
Check out
@xiaojuan_wang7
's new project!
🔎Generative Powers of Ten🔍
Use a pre-trained text-to-image model to generate deeeeep zoom videos!
(Excuse Twitter's terrible compression, check the webpage instead: )
We focus on fluids (flowing water, billowing smoke, clouds), i.e., things well approximated by particle motion. So, instead of predicting a sequence of flow fields for a video, we can predict a single Eulerian motion field (a particle velocity field).
[2/5]
To generate the video frames, we use a deep warping technique (encode-warp-decode). Since warping a single image usually leads to big holes, we use a novel symmetric splatting approach, which combines features from different points in time to produce more realistic images.
[4/5]
Google presents CAT3D
Create Anything in 3D with Multi-View Diffusion Models
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating
Super neat! An interactive diffusion-based Photoshop.
A great example of how the right interfaces and controls can make a massive difference in the utility of these generative models.
We are thrilled to announce "Layered Diffusion Brushes": a real-time training-free image editor powered by diffusion models. 🎨✨ This is new work from my PhD student Peyman Gholami
@peymo0n
.
Explore the interactive demo and check out more videos at:
🔮Readout Guidance🔮 is a neat way of controlling diffusion models (in pretty complex ways!)
See the site () for applications and interactive galleries.
Here's one favorite: where we guide the identity in a generated image to match a reference image.
Guidance on top of diffusion models can now be used to drag and manipulate images, create pose-conditioned images, and so much more! Check out Readout Guidance:
Work w/
@trevordarrell
,
@oliver_wang2
,
@danbgoldman
,
@holynski_
. More in thread 🧵.
We're hosting a CVPR workshop on AI-assisted art---a big focus is to understand how AI models are currently being used in artistic workflows (to help inspire the next generation of better, more useful AI tools).
We are excited to share that the AI for Creative Visual Content Generation Editing and Understanding Workshop
@cveu_workshop
has been accepted to
#CVPR2024
@CVPRConf
. See you in Seattle to meet art, tech, and creativity! It is the first time we come to the US🇺🇸
I'll be talking about our paper "Animating Pictures with Eulerian Motion Fields" this evening at Paper Session
#5
(10pm-12a ET, 7pm-9pm PT).
Come say hi!
Excited to show off our new project on single-image cinemagraphs. Our method automatically turns a _single image_ into a seamlessly looping video!
Website:
Video:
w/ Brian Curless, Steve Seitz, Rick Szeliski
More in thread! [1/5]
Come hang out at this
#CVPR2024
workshop we're organizing!
Learn from researchers & artists about new creative applications, open technical challenges, & more.
The event is in-person only---no recording, no streaming! Don't miss out!
@CVPR
I'm co-organizing a CVPR workshop next Tuesday that is absolutely stacked with talent. If you're interested in anything related to art or generative video (eg Sora, Veo, Pika, Runway), be there.
Our latest work on making Consistent Video Depth more ROBUST. Works great for casual phone videos that are really difficult for previous methods.
Another great collaboration with
@jastarex
and
@jbhuang0604
.
arXiv:
Project:
Happy to finally be able to share our
#CVPR2022
paper, InstructPix2Pix!
We taught a diffusion model how to follow image editing instructions — just say how you want to edit an image, and it’ll do it!
(w/ Tim Brooks & Alyosha Efros)
More on Tim’s site:
🧵
Introducing Eclipse, a method for recovering lighting and materials even from diffuse objects!
The key idea is that standard "NeRF-like" data has all we need: a photographer moving around a scene to capture it causes "accidental" lighting variations. (1/3)
check out dave's project!
automatically decomposes complex 3D scenes into individual objects (without relying on per-object text descriptions or annotations!)
a neat central insight: think of objects as "parts of a scene that can be moved around independently"
text-to-3d scenes that are automatically decomposed into the objects they contain, using only an image diffusion model & no other supervision:
work w/
@poolio
@BenMildenhall
Alyosha Efros and
@holynski_
Come check out our paper at 3DV today!
(6a PST oral / 8:30a PST poster)
We use vanishing points and planes to get rid of pose drift in SfM.
"Reducing Drift in Structure from Motion Using Extended Features"
Project page:
Video:
View synthesis is super cool! How can we push it further to generate the world *far* beyond the edges of an image? We present Infinite Nature, a method that combines image synthesis and 3D to generate long videos of natural scenes from a single image.
We trained the model on a massive dataset of generated editing examples, with triplets containing:
1. input image
2. text editing instruction
3. output image
How does one generate a dataset like this, you might ask?
If you’re interested in this stuff, I’d highly recommend reading Abe’s thesis, which includes a thorough and beautiful theory about the underlying frequency-space motion representation and how it connects to modeling object dynamics.
Diffusion models let you create amazing images given the right prompt.
But some things are hard to express in text, like where objects should go or exactly how big they should be.
How can we get this kind of control?
Self-guidance offers a new way to control the generation process:
Without any extra models or training, we can extract properties like object shape, size, and appearance from internal attention maps + activations.
We can then guide these properties to edit generated images.
Finally, fine-tune a text-to-image diffusion model to learn this transformation, conditioned on the input image and the instruction!
Here's one of our favorites, but you can try it for yourself with the demo on the website:
Robust Consistent Video Depth Estimation
@JPKopf
,
@jastarex
,
@jbhuang0604
Jointly estimates camera pose & dense depth for challenging video captures of dynamic scenes
Sora is our first video generation model - it can create HD videos up to 1 min long. AGI will be able to simulate the physical world, and Sora is a key step in that direction. thrilled to have worked on this with
@billpeeb
at
@openai
for the past year
I was blown away by the incredible results of animating fluid motion from a single image last week.
I thought it would be fun to add a bit of 3D. Here are some results using 3D photo inpainting
I'm co-organizing a CVPR workshop next Tuesday that is absolutely stacked with talent. If you're interested in anything related to art or generative video (eg Sora, Veo, Pika, Runway), be there.
We present nerfies! We use selfie videos to create 3d free-viewpoint portrait visualizations of yourself using Deformable NeRFs! More details at and below (1/8)
Self-guidance also works on real images, which allows you to "borrow" real objects and stick them in new contexts, sort of like a zero-shot DreamBooth.
Excited to release v2.0 of our Background Matting project, which is now REAL-TIME & BETTER quality: 60fps at FHD and 30fps at 4K! You can use this with Zoom, check out our demo! 👇
Webpage:
Video:
More in the thread! [1/6]
Come say hi next week at CVPR!
We’ll be presenting the InstructPix2Pix poster on Thursday morning — and I’ll also be giving a talk on this at the Multimodal Learning (MULA) workshop on Sunday.
@CVPR
#CVPR2023
By combining a large language model (GPT-3) and a text-to-image model (Stable Diffusion)!
First, fine-tune the LLM on a small collection of human-written examples, and use it to generate a dataset of text triplets:
1. image caption
2. edit instruction
3. caption after the edit
For example, you can use self-guidance to move or resize an object (like this donut) — or even replace it with an item from a real image — all without changing the rest of the scene.
We can use self-guidance to edit entire images at once, too. For example, we can copy the appearance or inherit the layout of another scene — basically re-styling or re-composing any image!
@JPKopf
But I'm curious what the best way is to handle transient objects, like the kids in most of the RCVD videos.
Maybe something like action shots ()
Create a couple instances of the person, so there's always something interesting in every direction
Learning To Recover 3D Scene Shape From a Single Image
Wei Yin,
@oliverwang81
@simon_niklaus
& co
Uses learned priors to estimate the unknown focal and shift required for unprojecting monocular depth estimates to structurally-correct point clouds
We are happy to finally share our (David Griffiths and Tobias Ritschel) latest Relighting paper,
OutCast: Outdoor Single-image Relighting with Cast Shadows. To be presented next week at Eurographics 2022.
Paper, results and more at: . (1/4)
@ericmchen1
Sure, here are a couple from a cool sculpture I saw at the MoMA.
The samples aren't 100% consistent, but definitely within the ballpark of what a well-tuned NeRF pipeline can handle (just as it might handle inconsistencies in real captures, eg minor dynamics & lighting changes)
Rendering a NeRF is slow, so we found a way to "bake" NeRFs into something more GPU-friendly. The result: real time NeRF rendering in your browser. Try it now before the internet hugs our server to death.
@PeterHedman3
@_pratul_
@BenMildenhall
@debfx
Abe’s work on Image-Space Modal Bases was an important inspiration for our paper, and the new version of the paper better reflects this —connecting some of the terms we used in our original draft with those used in prior work
Super cool stuff from
@yoni_kasten
@dolevofri
@oliverwang81
@talidekel
Decomposes a video into two canonical templates/atlases (represented by MLPs) that can be edited --- allowing edits to propagate through a whole video. Sort of like MLP Unwrap Mosaics. Awesome results!
Excited to share our work "Layered Neural Atlases for Consistent Video Editing".
Paper:
Project page:
(with
@dolevofri
,
@oliverwang81
and
@talidekel
, to appear in SIGGRAPH Asia'21) (1/9)
Our work uses these insights about converting dense 2D motion trajectories to the frequency domain, and shows that this spectral volume representation is also an efficient and effective one for __generating__ long-term motion from a single, still image.
@jbhuang0604
Wow, this is awesome! This was exactly what was missing. When I was collecting the training data, I was specifically looking for stationary shots, but this made things difficult, since the vast majority of nature videos online have this kind of slow pan.
Among the insights in Abe’s work is that the motion trajectories observed in a 2D video of an object exhibiting oscillations/vibrations, when converted to the Fourier domain, are — under certain assumptions — a projection of the 3D vibration modes of that object.
I dare a challenge for
#cvpr2021
twitter: for each paper of yours that you advertise on twitter, please share 1-3 interesting papers from other teams. Authors will appreciate it and the community will be stronger. Here are some champions:
@ducha_aiki
@CSProfKGD
@artsiom_s
Pulsar: Efficient Sphere-Based Neural Rendering
@chlassner
,
@MZollhoefer
Super cool visualizations of the optimization process -- definitely check out the video.
What did Abraham Lincoln really look like? Our new project simulates traveling back in time with a modern camera to rephotograph famous subjects.
Web:
Video:
w/
@ceciliazhang77
,
@RealPaulYoo
,
@rmbrualla
,
@jlawrence176
, Steve Seitz
@jonathanfly
Yeah, default parameters. The main training Colab will periodically save a checkpoint model, and you have to use a second Colab () to render the video given one of those checkpoints. I think I used the first or second checkpoint.
Tomorrow at ECCV, we are presenting “Totems: Physical Objects for Verifying Visual Integrity”
Remember totems from Inception? We tried to make something *a bit* like that in reality.
website:
paper:
1/n
@jbhuang0604
So I've been hoping to find a way to do something like this, to make the shots look more like the type of stuff you'd find online -- you beat me to it! Thanks, I'll have to try it out myself :-)