It finally happened. We are releasing Stable Cascade (Würstchen v3) together with
@StabilityAI
!
And guess what? It‘s the best open-source text-to-image model now!
You can find the blog post explaining everything here:
🧵 1/5
We release Würstchen. TL;DR: Reduce the training time of text-to-image models by 16x compared to Stable Diffusion 1.4 while achieving similar results in metrics and visual appearance. 9.200 GPU hours vs 150.000 GPU hours.
Wuerstchen: Efficient Pretraining of Text-to-Image Models
paper page:
introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained
UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks
Image generations between 4k and 6k!
Utilizes Würstchen v3 / Stable Cascade and finetunes / extends the model with some really cool ideas.
I'm super impressed with the details of the generations!
Würstchen v2 vs. Würstchen v3
"A grandpa holding a sign that says 'Thomas', photo."
Würstchen v3 is much better at text. We are still evaluating other categories and the model is still finetuning. Really hoping this model becomes a banger!
@StabilityAI
@pabloppp
I made a new video explaining Diffusion Models. I did a simple yet comprehensive explanation of both the idea and gave a full math derivation! Check it out:
Here is some progress on Würstchen (). The model trained for 260k steps at 512x512 and now 100k steps at 1024x1024. Batch Size of 1280. 7691 GPU hours. (Other models usually take > 100000 GPU hours)
I will explain what we changed + show some results.
1/9
Würstchen v2 - some cinematic 1024x2048 generated images. 4 images at 1024x2048 take 7 seconds to generate! Stable Diffusion XL takes 40 seconds to do the same. More images in the thread below.
Note: Würstchen was not finetuned on some fancy dataset, just pretraining!
OUT SOON
Introducing Dream Machine - a next generation video model for creating high quality, realistic shots from text instructions and images using AI. It’s available to everyone today! Try for free here
#LumaDreamMachine
Here it is! My PyTorch Implementation video on Diffusion Models. It contains unconditional and conditional code & training. And I'm also implementing classifier-free-guidance and exponential moving average!
This is Dream Machine, our first generative text-to-video and image-to-video model. This video showcases some of the capabilities of
#LumaDreamMachine
that we're most proud of. Try Dream Machine for free today 👉
I really don‘t understand why Stability apparently is very much against Stable Cascade. What have we done lol? What‘s wrong with the model xd
It‘s not even mentioned on the models page :(
Following the release of Stable Cascade, I want to highlight a point that might be really interesting to researchers.
As you have seen, Stable Cascade compresses images down 42x spatially, while reconstructing them very accurately.
This means, ....
1/4
Training of Würstchen v3 has started! 1B and 3.6B versions are training. (
@pabloppp
started the trainings and took the big challenge of fighting FSDP all by himself)
This is incredible. With that you are able to run Stable Cascade in FP16 (which before overflowed) and
@KBlueleaf
also shows how to use the model in FP8. Might be interesting to some and should make it more accessible to more GPUs that don‘t support Bfloat16. Thank you so much!
Würstchen - Text-to-Video. Turns out using Würstchen for video generation might give even better benefits than on images in terms of training & sampling efficiency. Still very early on, but we are working on it!
Model: 550k steps image + 220k steps video
GPU Hours: 11200
here is sora, our video generation model:
today we are starting red-teaming and offering access to a limited number of creators.
@_tim_brooks
@billpeeb
@model_mechanic
are really incredible; amazing work by them and the team.
remarkable moment.
JourneyDB: A Benchmark for Generative Image Understanding
paper page:
While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the
Does anyone else feel like diffusion models have a hard time generating high frequency details? Any experience / thoughts / ideas / pointers on that manner? We have observed often that our models don't generate high frequencies in images and have found it hard improving it.
You can test Würstchen v3 on our Discord Server in the
#w
ürstchen-v3 channel. Feel free to join and let us know what you think of this preliminary version of Würstchen v3.
Link:
Würstchen v3:
'Cinematic realistic photography of an anthropomorphic dog wearing a hat and sunglasses standing in front of the eiffel tower holding a sign that says "WURST" in colourful letters'
*cherrypicked
You can not believe how much work went into this release, the models and the codebase.
We release all of our code, including training, finetuning, ControlNet, LoRA and normal inference here:
1. Theory on Adam Instability in Large-Scale Machine Learning
paper:
abstract: We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the
"Pikachu dressed as an astronaut standing on mars, cgi, cinematic" (non-cherrypicked, not finetuned on some carefully crafted dataset, 1536x1024, 4 images generated in ~5 seconds) - Würstchen v2 - soon
I uploaded a new video explaining Cross Attention.
In my opinion a technique that is often not really spoken about, although it powers models such as
#stablediffusion
#imagen
#muse
etc. Let me know what you think!
Würstchen v3 is also able to do image variations out of the box. The top image was generated with:
"A photo of a cow wearing a cowboy hat"
And the images below that are image variations based only on that given image, no caption.
This is really cool: We finetuned Würstchen v2 on some pretty data. Here are generations from the base model & the finetuned model AFTER JUST 2000 steps. (batch size = 384). Prompt: "portrait of a mysterious dog, creative concept trending on artstation" (no neg prompt, cfg = 4)
Or also for text-to-video generation: Imagine having a really good model that compresses videos with a 42x spatial and maybe 8x temporal. This would mean super efficient and fast generations.
If you have other ideas, we can chat about them if you want:
That almost all information of a 3x1024x1024 image is stored in just 16x24x24 numbers.
You can test this yourself with the notebook we have that explains all the details:
That's why we think it makes so much sense to train the T2I model in that space.
Finetuned Würstchen for 2k steps at 1536x1024 and 1024x1536. Its crazy how fast models can adapt to new image sizes and aspect ratios. This model has now trained in total for 28.000 GPU hours (916k steps). (SD 1.4 used 150.000)
1 / N: We are trying some new conditions on Würstchen, (cheeky stealing of SDXL ideas lol) by conditioning on aesthetic scores, crop ratios and image sizes. The first set of images here is using an aesthetic score of 7 and the second using an aesthetic score of 5.
"Photography of an astronaut running scared in a cave, trying to escape from an extraterrestrial creature, ci, cinematic"
Würstchen v3 - 2048x1536
v3 will come with with 4 models:
Stage C: 1B, 3.6B
Stage B: 700M, 3B
Images here are from Stage C 1B and Stage B 700M
@pabloppp
We will release the weights and the updated code on our GitHub ().
Training of Stage C took less than 5 days on 64 GPUs, and can be reproduced much easier.
5/9
I love the paper. Finally an open-source zero-shot "finetuning" model. Also the approach seems to be pretty cool. Detect identities, (embed->concat->project) identities + regularize cross-attn maps in training with segmentation maps of identities. I wanna try it too!
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300 times-2500 times speedup compared to fine-tuning-based methods and
"grainy closeup photo of an antropomorphic mushroom creature with a cute face and limbs sitting sad on a mossy rock in the middle of a forest"
#stablecascade
We will continue trying to improve Würstchen and soon go to text-to-videos :D
Thanks to
@StabilityAI
for providing compute to do this research!
Shared work with
@pabloppp
9/9
Dream Machine 1.5 is here 🎉 Now with higher-quality text-to-video, smarter understanding of your prompts, custom text rendering, and improved image-to-video! Level up.
#LumaDreamMachine
"Anthropomorphic blue owl, big green eyes, lots of details, portrait, finely detailed armor, cinematic lighting, intricate filigree metal design, 8k, unreal engine, octane render, realistic, redshift render" - Würstchen v2. We will release soon.
How time passes. One year ago
@pabloppp
and I started with text-to-image models and we went from this (left) to this (right) within one year. Let's see where we will be next year!
"A pikachu shaped hat"
MediSyn: Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis
abs:
Two instruction-tuned text-guided latent diffusion models, one for 2D medical images and one for 3D medical images. Trained on a dataset of 5.7 million 2D medical
Generating very contrast rich black and white images is also possible, which e.g. Stable Diffusion has problems () with due to different minimum signal to noise ratio: "A black square on a white background".
7/9
There might be other cool things to explore here. For example using the encoder to encode images and then train an image classifier in that space. Or other image tasks could use that model to first embed images to a small space, potentially leading to more efficient training.
I created the animations of my last video with manim, (the python library created by 3B1B to animate his videos, which is maintained by an open-source community) and put the code online here if anyone is interested:
"astronaut in the mushroom forest, psychedelic mood, astral figures, morning light, clear sky, extremely detailed, strong use of colors, pop surrealism, hard edges, heavy paint brush, 8k" - Würstchen v2 Finetuned - Release Next Week
Stay on topic with Classifier-Free Guidance
paper page:
Classifier-Free Guidance (CFG) has recently emerged in text-to-image generation as a lightweight technique to encourage prompt-adherence in generations. In this work, we demonstrate that CFG can be
One of the biggest improvements we saw comes from making use of findings from Consistency Models (
@YSongStanford
@prafdhar
@markchen90
@ilyasut
), which we combine with an epsilon objective. The model learns much much faster.
2/9
That's how most of my videos look after finishing them. It's quite a lot of work, but I enjoy it a lot. (The latest Paella video took about 60 hours of work, my DDPM video took 200 hours of work back then)
"A serene digital illustration featuring a simplified, stylized mountain landscape bathed in sunlight, with gentle pastel hues dominated by soft yellows, where a lone figure is captured in a moment of bliss as they leisurely ride a bike across the foreground, invoking a sense of
This makes training the text-conditional stage very fast & cheap, opening doors to better investigate large text-to-image models and enabling more people to train & finetune for cheaper.
Work with
@pabloppp
@MAubreville
"a still shot from a mobile time-lapse photography shoot of a city at night. It's a beautiful view of the city at night with city and traffic lights gleaming."
We are using ChatGPT to enrich captions now.
You can try it out on our Discord:
We achieve this by decoupling the text-conditional model even further from high resolutions. We use two models to compress 512x512 images into a tiny low dimensional latent space of 12x12, resulting in an f42 spatial compression, while reconstructing them faithfully.
@nathanwchan
@_akhaliq
It just means that training classification models can now be improved by also training them on images generated by text-to-image models like StableDiffusion. The generated images are used to enlargen the dataset size the classification models train on. And apparently it works.
Invictus.Redmond is here!
Invictus is a Stable Cascade Generalist finetune.
Its Stage C finetuned.
Thanks
@RedmondAI
for all GPU Support. Need GPU? Talk to Redmond.
Download it for free on Civitai and HF.
Links e more examples below!