I’m attending CVPR next week! The ARC and AI Lab teams will be presenting quite a few papers on-site, including the ones listed below. Feel free to DM me if you'd like to chat. Looking forward to engaging discussions! 🤝✨🎉
Our CVPR24 highlights:
SmartEdit: Exploring Complex Instruction-based Image Editing with LLMs
Programmable Motion Generation for Open-set Motion Control Tasks
HumanGaussian: Text-Driven 3D Human Generation with GS
Turns out there's no overlap with the ones listed earlier😆😆
Announcing Mira: A glimpse into the world of Sora, providing insights through open-sourced resources including MiraData (training samples), MiraDiT (the model), and code, all aimed at fostering collaboration and accelerating innovation in this promising field. 🩷🩷
Project Page:
🎉We are exploring the
#Mira
project~
- Built a long video dataset
#MiraData
with structured captions.
- Trained
#MiraDiT
to explore the consistency in long video generation.
Hope it will be a supplement to existing text-to-video methods.
Project Page:
🎉Tencent's CustomNet official demo is on Spaces
🌟A unified encoder-based framework for object customization in text-to-image diffusion models
🌟Incorporates 3D view synthesis capabilities
🌟Adjusts spatial positions and viewpoints
🌟Preserves object's Identity effectively
Excited that our PhotoMaker, YOLO-World, VideoCrafter2, SEED-Bench, DreamAvatar, EvalCrafter, &GS-IR etc. made
#CVPR2024
! A confirmation for academia, open-source, & industry unity. Big thanks to our teams & collaborators! 🎉🎉
Showcasing the capabilities of TextureDreamer (
#CVPR2024
)!
... include my favorite results of transferring Rirakkuma 🧸to all types of shapes.
Full explainer video:
Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh
Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao,
@yshan2u
, Long Quan
tl;dr: use a triangular mesh to manipulate 3DGS directly with self-adaptation
@zipengfu
The hardware is definitely impressive! However, it's essential to state upfront in both the main post and video that the robot is teleoperated (operated by a human). This will prevent unrealistic expectations and avoid confusion with autonomous robots.
LiDAR to Gaussian Splatting, Lixel CyberColor, from
@XGRIDS2023
has been announced.
✨ LiDAR to Gaussian Splatting
📏 CM level precision
🥽 Compatible with Apple Vision Pro
🤝 Compatible with XGRIDs scanning suite
🔗
Video-MME
The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus
Thanks
@_akhaliq
for featuring! The survey on 3D Model Generation encompasses 436 papers on the latest advancements. Hope it's helpful!🌟📷
Thanks to the team: Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao,
@yanpei_cao
,
@yshan2u
Advances in 3D Generation: A Survey
paper page:
Generating 3D models lies at the core of computer graphics and has been the focus of decades of research. With the emergence of advanced neural representations and generative models, the field of 3D content
Introducing BrushNet, fills in anywhere with precision and coherence, and works with any frozen DMs (diffusion model)! Code, paper, demo available!
cc:
@juxuan_27
,
@AlvinLiu27
,
@xinntao
, Yuxuan Bian,
@yshan2u
, Qiang Xu
#BrushNet
A plug-and-play inpainting/outpainting model
Please try it out:
Codes and models are released:
Thanks to co-authors
@juxuan_27
,
@AlvinLiu27
, Yuxuan Bian,
@yshan2u
, Qiang Xu
We've just launched InstantMesh, our latest addition to the Image-to-3D family — arguably one of the best open source models to date, based on our tests😉. Feel free to try it out.🚀
CC: Jiale Xu, Weihao Cheng, Yiming Gao,
@xinntao
, Shenghua Gao,
@yshan2u
#InstantMesh
🎉, an image-to-3D mesh generation method from a single image within 10 seconds.
Incorporate mesh-based optimization, better training efficiency, and scalability, allowing explicit geometric supervision.
Codes: Demo:
Thanks
@_akhaliq
for sharing! This is for 3D scene editing with better control.
Thanks to the team: Jingyu Zhuang, Di Kang,
@yanpei_cao
, Guanbin Li, Liang Lin,
@yshan2u
Tencent presents TIP-Editor
An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
paper page:
Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still
SAM is able to simplify complex low-level vision tasks. In this case, adapting SAM to take flow as input, or use flow as segmentation prompt outperforms all previous approaches by a significant margin in both single and multi-object benchmarks.
SAM + Optical Flow = FlowSAM
FlowSAM can discover and segment moving objects in a video and outperforms all previous approaches by a considerable margin in both single and multi-object benchmarks 🔥
Tencent presents VideoCrafter2
Overcoming Data Limitations for High-Quality Video Diffusion Models
paper page:
Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate
Tencent and NUS release M2UGen
Multi-modal Music Understanding and Generation with the Power of Large Language Models
demo:
The M2UGen model is a Music Understanding and Generation model that is capable of Music Question Answering and also Music
So.. full video editing using AI 📹🎨 is still a bit away. But we are now closer! 🔥🔥
We just built a demo for
@CamcorderAI
that lets you crop (rotobrush) any object out of a video just by using your words! 😁
What do you think - should we make it into a full tool?
🧵
Introducing Open-MAGVIT2: an open-source effort investigating and advancing the lookup-free visual tokenizer with large codebooks. From SEED (discrete) to SEEDX (continuous), we keep exploring the frontier of MLLM and sharing our progress along the way.
MAGVIT2 is a leading visual tokenizer, but hasn't been officially open-sourced. Existing reproductions lack complete codes and checkpoints. We did this! 🔥
We are keeping iterating the codebase and welcome collaboration on the Open-MAGVIT2 plan. 🤗
Segment and Edit Anything, on your Local Computer.
The Brushnet Gradio app lets you select some points in an image to segment items, and replace them with ANYTHING you want. Pure magic.
And now, run locally on your machine with 1 click. Works on all OS (Windows, Mac, Linux)
TIP-Editor accepted as a SIGGRAPH-2024 Journal paper: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts🎉🎉
Congrats:
@JjZhuang26958
, Di Kang,
@yanpei_cao
, Guanbin Li, Liang Lin,
@yshan2u
Tencent presents TIP-Editor
An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
paper page:
Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still
Sora Generates Videos with Stunning Geometrical Consistency
The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there
📢VBench now Supports I2V Eval📢
📊
#VBench
now supports the multi-dimensional evaluation of Image-to-Video (I2V) models
🏆
#DynamiCrafter
and
#SVD
are among the top models
- Code:
- Leaderboard
@huggingface
: . Thanks to
@_akhaliq
!
🚀 A stellar 2023 at ARC Lab & AI Lab Visual Computing! Proud of our impactful work in T2IAdapter, VideoCrafter, Tune-a-video, FateZero, SEED-Bench, Dream3D, SadTalker etc. [see links], all open sourced. Big thanks to our teams and collaborators. Looking forward to an even more
✨The rapid progress in 3D generation is impressive, but generated meshes often lack structure. We integrate *parts* into the reconstruction process, enhancing segmentation, structural distinction, and shape editing!
Project Page:
#SIGGRAPH2024
#AIGC
We are happy to integrate "text-to-video" into GenAI arena . Currently, we support six open-source video generation models. Please help us vote to create the video leaderboard!
For "text-to-image" arena, Playground V2 and V2.5
@playground_ai
are leading
Our CVPR24 highlights:
SmartEdit: Exploring Complex Instruction-based Image Editing with LLMs
Programmable Motion Generation for Open-set Motion Control Tasks
HumanGaussian: Text-Driven 3D Human Generation with GS
Turns out there's no overlap with the ones listed earlier😆😆
Excited that our PhotoMaker, YOLO-World, VideoCrafter2, SEED-Bench, DreamAvatar, EvalCrafter, &GS-IR etc. made
#CVPR2024
! A confirmation for academia, open-source, & industry unity. Big thanks to our teams & collaborators! 🎉🎉
Thanks
@_akhaliq
for featuring! SEED-X is a unified MLLM designed for both real world understanding and generation tasks, with competitive results. Feel free to try it out!
Project page:
CC:
@tttoaster_
, Sijie Zhao, Jinguo Zhu,
@ge_yixiao
, Kun Yi, Lin
SEED-X
Multimodal Models with Unified Multi-granularity Comprehension and Generation
The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However,
Thanks
@_akhaliq
for featuring! DynamiCrafter is a major upgrade to our image-to-video model.🚀 Echoing recent improvements in our text-to-video model, VideoCrafter2, the new model significantly improves motion, resolution, and coherence. 💡
Team credit:
@Double47685693
,
Meta announces Aria Everyday Activities Dataset
present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor
Tencent released MotionCtrl for Stable Diffusion Video
A Unified and Flexible Motion Controller for Video Generation
demo:
API docs:
MotionCtrl can Independently control complex camera motion and object motion of generated
Gen4Gen
Generative Data Pipeline for Generative Multi-Concept Composition
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This
Tencent just released VideoCrafter2 demo on Hugging Face
high quality text to video model
demo:
Overcoming Data Limitations for High-Quality Video Diffusion Models
code, models and data are distributed under Apache 2.0 License
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and
My view of the world model, world simulator, etc., based on the original 'World Model' paper. Hoping this sheds some light on the subject, though it might cause more confusion. 😆😆
🔥Exciting news from Arena
@Anthropic
's Claude-3 Ranking is here!📈
Claude-3 has ignited immense community interest, propelling Arena to unprecedented traffic with over 20,000 votes in just three days!
We're amazed by Claude-3's extraordinary performance. Opus is making history
Thrilled to work with
@JiachenLi11
to release T2V-Turbo, which is a very fast yet high-quality consistency model.
With only 4 diffusion steps (5 seconds), it can obtain high-quality video. T2V-Turbo currently ranks the first on VBench (), beating other
We will be presenting ScaleCrafter (spotlight), SEED, TapMo, DragonDiffusion (spotlight), and FreeNoise at ICLR-2024. You are very welcome to come by and chat with our presenters!
#ICLR2024
#iclr
Thanks
@_akhaliq
for featuring. YOLO-World is for real-time open world detection!
Thanks to the team and collaborators: Tianheng Cheng, Lin Song,
@ge_yixiao
, Wenyu Liu,
@XinggangWang
,
@yshan2u
Tencent presents YOLO-World
Real-Time Open-Vocabulary Object Detection
paper page:
On LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
I gave a keynote at China3DV about our research on 𝐏𝐡𝐨𝐭𝐨𝐫𝐞𝐚𝐥𝐢𝐬𝐭𝐢𝐜 𝐀𝐈 𝐀𝐯𝐚𝐭𝐚𝐫𝐬.
Since many people asked, I have uploaded the slides of my talk here (PDF version):
Thanks to
@_akhaliq
for sharing! EvalCrafter is our step towards tackling the challenge of video generation evaluation. It's designed to streamline the process for faster iterations, benefiting both our own development and hopefully the broader community. It's very much a work in
Explore the magic of
#PhotoMaker
by ARC! ✨
Create images with customized person and style in just seconds. 🎨
Try the Huggingface demo NOW! 🚀
Thanks to
@xinntao
,
@zhenli1031
, and the team for making this happen, and to
@osanseviero
for sharing. 🙌
Thrilled to witness the waves of ICLR acceptance posts! Great insights from each paper's crisp summary. Feel like folks will have a lot of fun in Vienna!
@iclr_conf
#ICLR2024
#ICLR
🌊📚🚀
La qualité des Drones DJI est de plus en plus impressionnante !
Ici il s’agit d’un drone FPV dernier cri où il semble que DJI a sans doute intégré la stabilisation Rocksteady et Horizon pour minimiser les vibrations de la caméra et garantir des séquences fluides même lors de
As we promised, SEED-X is now open sourced with the model checkpoint, training code for instruction tuning , and newly collected data for instructional image editing!
Feel free to check out this link for more details:
Our model checkpoints, training code for instruction tuning, online demo, and newly collected data for instructional image editing have been fully open source! 🔥
Welcome to cook with SEED-X models and data. 🤗
Thanks
@_akhaliq
for sharing. YOLO-World is for real-time open world detection!
Thanks to the team and collaborators: Tianheng Cheng, Lin Song,
@ge_yixiao
, Wenyu Liu,
@XinggangWang
,
@yshan2u
Tencent releases YOLO-World
Real-Time Open-Vocabulary Object Detection
demo:
method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on
Today, with my collaborators
@prafull7
(MIT CSAIL),
@jampani_varun
(
@StabilityAI
), and my supervisors Niki Trigoni and Andrew Markham, we share with you ZeST, a zero-shot, training free method for image-to-image material transfer!
Project Page:
1/8
A annotation framework that produces hyper-detailed descriptions, and "performs better than GPT-4V outputs (+48%) on readability, comprehensiveness etc."
📢 Excited to unveil our latest research, ImageInWords (IIW)! 🚀We're pushing the boundaries of image descriptions with a new seeded, sequential, human-in-the-loop approach producing SOTA, articulate, hyper-detailed descriptions.
arXiv: 🧵1/12
Stanford engineers have developed a prototype augmented reality headset that uses holographic imaging to overlay full-color, 3D moving images on the lenses of what would appear to be an ordinary pair of glasses.
@stanford_ee
@GordonWetzstein
🐎 Let the hooves pound!
Our new method Ponymation learns a generative model of 3D articulated animal motions from raw unlabeled Internet videos.
Page:
Paper:
Led by
@skq719
& Dor Litvak, w/
@zhang_yunzhi
Hongsheng Li
@jiajunwu_cs
Once a model is trained, there is a fun phase to discover its capability. I've been experimenting with our SEED-X-I model by blending two images, which I call A Tale of Two Images. Here are some interesting results, with details in the thread below!
Given the renewed interest in binarization, here is our earlier work (KDD23) focused on binary embedding for retrieval. It achieves a 16x reduction in memory footprint and has been rigorously tested in production with billions of vectors! Code is available!
cc: Yukang Gan,
This month in 1980: a Japanese computer scientist published a paper proposing the “Neocognitron,” the neural net that directly inspired CNNs.
Kunihiko Fukushima’s paper:
MultiModal Large Language Models have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs. The models preserve the reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks.
A nice video made by
@Sam_Witteveen
, an external developer with early access to the long context capabilities of Gemini 1.5 Pro, sharing some of the things this model can do. 🎉
We've just released a survey on 3D model Generation, encompassing 436 papers on the latest advancements. Hope it's helpful!🌟📈🚀
Thanks to the team: Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao,
@yanpei_cao
,
@yshan2u
Paper:
Excited to announce RPG-DiffusionMaster, a joint work with Peking University and Stanford University. RPG harnesses multi-modal LLMs to master diffusion models in complex and compositional text-to-image generation/editing, achieving state-of-the-art performance.
METRA is the *first* unsupervised RL method that can learn diverse locomotion skills purely from pixels, and is one of my favorite works!
METRA got accepted to ICLR 2024 (Oral), and come to the sessions this Wednesday!
Oral: Wed 4p, Halle A 2
Poster: Wed 4:30-6:30, Halle B
#161
🌟 Harnessing Tech for Good: ARC Lab is thrilled to be a part of the team integrating cutting-edge AI to restore a stunning 4,500-year-old statue.
#TechForGood
#Innovation
#tencent
#ARCLab
Glad that MotionCtrl is accepted by SIGGRAPH-2024. Thank you all for featuring and following this work!
Congrats to: Z Wang, Z Yuan,
@xinntao
, Y Li, T Chen, M Xia, P Luo,
@yshan2u
The future of AI video generation is gonna be so cool!
MotionCtrl is a motion controller that can manage both camera and object motions with video generation models like VideoCrafter1, AnimateDiff and Stable Video Diffusion 🤯
Kolmogorov-Arnold Network is just an ordinary MLP.
Here is the Colab, which explains:
The main point is, that if we consider KAN interaction as a piece-wise linear function, it can be rewritten like this:
1/n
🗣️ V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation 🔥 Jupyter Notebook 🥳
Thanks to Cong Wang ❤ Kuan Tian ❤ Jun Zhang ❤ Yonghang Guan ❤ Feng Luo ❤ Fei Shen ❤ Zhiwei Jiang ❤ Qing Gu ❤ Xiao Han ❤ Wei Yang ❤
🌐page:
ComboVerse
Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance
Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models
Check out Diffusion-DPO🌟 Bridging the gap between StableDiffusion & closed models like Midjourney v5. Our
#TextToImage
model uses human feedback for state-of-the-art alignment, marking a new era in AI creativity!
Code:
Blog:
Multimodal Meta AI is rolling out widely on Ray-Ban Meta starting today! It's a huge advancement for wearables & makes using AI more interactive & intuitive.
Excited to share more on our multimodal work w/ Meta AI (& Llama 3), stay tuned for more updates coming soon.
Thanks
@camenduru
for sharing! DynamiCrafter is a major upgrade to our image-to-video model.🚀 Echoing recent improvements in our text-to-video model, VideoCrafter2, the new model significantly improves motion, resolution, and coherence.
In Dall-E3's vision, living in a post-labor economy following the advent of AGI looks like this: a world where advanced robotics and AI are seamlessly integrated into the environment, and humans are engaged in leisure and artistic activities.😊
#ECCV2024
is encouraging potential reviewers to self-nominate. Know a great reviewer? Encourage them to self-nominate.
Reviewer nomination form:
Please do not send an email to the ECCV organizing committee, we cannot reply to all the individual emails.
The YOLO-World YouTube tutorial is out!
please, let us know what you think!
- model architecture
- processing images and video in Colab
- prompt engineering and detection refinement
- pros and cons of the model
watch here:
↓ more resources
Would you play a game like this? 🕹️
AI-generated using:
- Platformer backgrounds:
#Scenario
❤️🔥
- Retro music:
#Udio
🎶👾
- Animated video sequences:
#Runway
📽️
Details provided below. 👇👇
Interestingly, the World Model bears some similarities to the I Ching (易经). In
@ylecun
's formulation, it categorizes all life situations x into 384 categories s, each with a suggested action a. The mystical mapping from s(t) to a(t) is sometimes referred to as a controller.
Lots of confusion about what a world model is. Here is my definition:
Given:
- an observation x(t)
- a previous estimate of the state of the world s(t)
- an action proposal a(t)
- a latent variable proposal z(t)
A world model computes:
- representation: h(t) = Enc(x(t))
-
This year’s AI Index report offers a deep dive into the evolving landscape of AI. Covering key trends from technical performance to geopolitical dynamics, it's a must-read for industry leaders, policymakers, and anyone interested in the state of AI.
Research and innovation were mostly works of burning brain power. In the new era of deep learning, a significant part of the thinking process has shifted to burning GPUs. This presents both challenges and opportunities for academia. 🔥💻🎓
Since Sora is out, I have been thinking about our role in academia. One thing we can do at school is fast prototyping with very talented students, showing the potential, the possibility. Of course, the future will always be scaling up.
📢 The
#AIIndex2024
is now live! This year’s report presents new estimates on AI training costs, a thorough analysis of the responsible AI landscape, and a new chapter about AI's impact on medicine and scientific discovery. Read the full report here:
I'm increasingly convinced there's an "impossible trinity" in content creation tools: controllability, usability, and versatility. No tool excels in all three, and none seems able to.
The ability to infer others' actions and outcomes is central to human social intelligence.
Can we leverage GenAI to build cooperative embodied agents with such capabilities?
Introducing 🌎COMBO🌎, a compositional world model for multi-agent planning!
Introducing COLE: a effective hierarchical generation framework that can convert a simple intention prompt into a high-quality graphic design, while also supporting flexible editing based on user input.🤗🤗🤗
Paper:
Project page:
Great effort assembling the map! The history of AI is essentially a history of massive tensor computation gradually taking the center stage of the evolution.
I am excited to announce that Mustango has been accepted at
#NAACL2024
! MusTango is a controllable text-to-music generative system that can generate music audio from text prompts that contain music-specific descriptions (e.g., chords, tempo, dynamics, etc).