
Zhenzhi Wang
@zhenzhiwang
Followers
162
Following
11
Media
3
Statuses
16
Ph.D candidate at MMLab, CUHK. Working on human-centric video generation. Previously NJU CS&Physics.
Sha Tin District, Hong Kong
Joined November 2020
Video generation model could now generate multi-person dialogue videos or talking videos with HOI, from text prompts and N pairs of {cropped reference images (e.g., head images), audio} without any lip post-processing. Paper: https://t.co/AoIOtYLYbD Demo: https://t.co/c3QfgvHf5i
3
7
19
Some random thoughts I've been having about video world model/long video generation since working on Mixture of Contexts (whose title could also be "Learnable Sparse Attention for Long Video Generation"): ๐จSemi-long Post Alert๐จ 1. Learnable sparse attention is still underrated
How do we generate videos on the scale of minutes, without drifting or forgetting about the historical context? We introduce Mixture of Contexts. Every minute-long video below is the direct output of our model in a single pass, with no post-processing, stitching, or editing. 1/4
6
36
207
HumanVid's extension to multi-person human image animation has been accepted by ICCV25! Thanks for collaborators @liyixxxuan @zengyh1900 @GuoywGuo @tianfanx @lindahua @doubledaibo Paper: https://t.co/FNhWE4ZBTv Code will be open-sourced soon.
arxiv.org
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and...
Want to generate camera controllable human videos like a real movie clip? Try our HumanVid dataset and a baseline model combined by AnimateAnyone and CameraCtrl. Project Page: https://t.co/ix2w7jYelN Paper: https://t.co/D8uCejx6KZ Data and code coming soon.
0
2
5
๐ฏ Inspired by Gen-4's impressive multi-shot results? ๐ Check out our recent work on scene-level video generation via Long Context Tuning! ๐ Homepage: https://t.co/kp7z3U6wLz ๐ Paper:
1
4
18
[1/3] Want to capture a fantastic HDR image by 2 simple shots with your cellphone? Try UltraFusion HDR. It takes two images with exposure differences up to 9 stops, and it robustly generates HDR output. Try your own captured images (supporting 4Kx3K): https://t.co/P7V62H3KNE
1
17
66
๐ Excited to introduce IDArb! ๐ Our method can predict plausible and ๐ฐ๐ผ๐ป๐๐ถ๐๐๐ฒ๐ป๐ geometry and PBR material for ๐ฎ๐ป๐ ๐ป๐๐บ๐ฏ๐ฒ๐ฟ๐ท of input images under ๐๐ฎ๐ฟ๐๐ถ๐ป๐ด ๐ถ๐น๐น๐๐บ๐ถ๐ป๐ฎ๐๐ถ๐ผ๐ป๐โ๏ธ ! Webpage: https://t.co/GvfyvbEq25
2
25
76
Excited to attend NeurIPS 2024 during Dec 10th - 15th, with two 1-st author paper accepted. Hope to have a chat about (human) video generation! Papers: 1: https://t.co/IYAD13rdky 2:
arxiv.org
Human image animation involves generating videos from a character photo, allowing user control and unlocking the potential for video and movie production. While recent approaches yield impressive...
0
0
3
Want to generate camera controllable human videos like a real movie clip? Try our HumanVid dataset and a baseline model combined by AnimateAnyone and CameraCtrl. Project Page: https://t.co/ix2w7jYelN Paper: https://t.co/D8uCejx6KZ Data and code coming soon.
7
37
116
We have released the MatrixCityPlugin, render sequences, config and related scripts ( https://t.co/4Qu5NwJ8ZT). - collect large-scale and high-quality city data. - control lighting, fog, human and car crowds. - obtain depth, normal, decomposed BRDF materials.
github.com
[ICCV 2023] MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond. - city-super/MatrixCity
#ICCV2023 We curated a large-scale, comprehensive, and high-quality synthetic dataset MatrixCity for city-scale neural rendering based on the UE5 City Sample. @ICCVConference -Project: https://t.co/MwBxJWg4Kw -Paper: https://t.co/Xoix3IadpD
0
42
152
Yuwei (@GuoywGuo) just released #AnimateDiff v3 and #SparseCtrl which allows to animate ONE keyframe, generate transition between TWO keyframes and interpolate MULTIPLE sparse keyframes. RGB images and scribbles are supported for now. Github: https://t.co/IeQ5ui4TDC
14
56
312
Method: (1) define human interactions as human joint contact pairs and let LLM to generate them (2) train a spatially controllable MDM on every joint and takes contact pairs as spatial condition. We could generalize to arbitrary number of humans without interaction training data
1
0
2
Excited to present our new work, InterControl. TL;DR: We could generate human motion interactions with spatially controllable MDM that is only trained on single-person data. arxiv: https://t.co/IYAD13rdky code: https://t.co/fwXZdPvrns
1
7
31