Proud to finally showcase the 1024x1024 version of AuraFlow (the next generation of lavenderflow)! Truly open, largest text2image model out there with (arguably) SoTA level performance!
1/n
Has someone created materials around “fundamentals of ML for AI Engineers”, not focused on building models but things like evaluations, error analysis, etc
Maybe something already exists? I don’t want to do it lol - looking for a resource I can share with people
If you want to train everything from scratch,
1. Train VAE
2. Train CLIP
3. Train LLM
4. Using 3, train captioner based on CLIP.
5. Finetune dense captioner
6. Relabel text image pair
7. Train unet based on 1,6
8. Train pixel decoder
9. Train LLM for upsample caption
So year ago I introduced LoRA (which was at the time little known even to the LLM community, it was well before LLAMA / Peft) to image generation space.
Little did I realize year later thousands of deepfake waifu LoRAs would be flooding on the web... 🫥
My model is now ready to make thousands of consistent generations...
It's technically known as a LoRA (Low-Rank Adaptation), with SDXL as the base (foundation) model.
From here, two options are possible:
(i) Utilize your LoRA model independently,
(ii) Or blend this LoRA with
This paper and their model is insane. Highly likely that these attention layers can be transferred to other fine-tuned models as well, which is truly groundbreaking feature for the SD community.
Did you know SDXL can be implemented with 520 lines of code in single file?
If you thought diffuser's unet code is now too big to understand in an hour, and wanted very limited but fully diffusers-compatible refactor of SDXL unet, this is for you
To fellow solo-working-outsider bros... Don't let comment like this discourage you.
There is *a lot* to do. We *all* need your help.
Ask for grants if you don't have compute (h100s cost like < 3$ / hour anyways, prove in small scale and reach out), like
@FAL
, or Google TPU
@tilmanbayer
@faustsoli
einstein was able to advance the field just with a pen and paper even when he wasnt able to verify his theories through experiment at the time. you cant do that in ai, at all
Personally, i feel very good today.
Achievement Unlocked: successfully train very large diffusion model from scratch, entirely on my own codebase! (of course, not like SD3 papers codebase is out or anything..)
YES!!!! TOOK 26 hours to make this happen: conditional D3PM implementation with pytorch. Let's accelerate discrete diffusion research!!! 👏I believe this is the only torch implementation of it out there!
Less than 400 LOC!
paper:
Here is a cool little hack I found with AnimateDiff: instead of just sampling, by introducing variance-preserving self-correlation in time axis, you can achieve "lesser flickering motion". corr = [0.9, 0.7, 0.2, 0.0(Just sampling)].
So over the course of this year and last year I've learnt a whole lot on scaling, especially regarding what correct parameterization is when width / batch size / data size scales up.
Objective is both training stability & optimal hyperparameter transfer.
I thought I would share
So you've had your fun with
@karpathy
's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of
@MSFTDeepSpeed
. No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability
Wondered how SD3 was trained? Me too 😅, but I tried my best to replicate that today!
Scalable transformer based rectified flow, following SD3's logit-normal sampler and llama-dit architecture.
Enjoy!
Hi, this is Lavenderflow-5.6B-v0.0
✅MMDiT, muP, CFM, FSDP, recaped, 768x768, T5
✅No strings attached, completely-open-every-step-of-the-way
✅Not SoTA😅(hey it was trained by one grad-student under total 3 weeks of development.) Severely undertrained!
Again, the paper im advocating here is from openai, and is referenced all the time and frankly one of the paper all large scale practitioner should read. the math here isn't complicated and nothing here is either controversial nor task dependent.
Whats cool about implementing your own ZeRO distributed optimizer is you get to touch every single aspect of your optimizer, both regarding implementation and sharding strategy, as well as its performance optimization.
For example, you now don't have to rely on fused apex-like
I've managed to fine-tune Kandinsky 2.1 model. I think I'm the first one to get it done (because there is no doc on the repo and model structure is rather strange, and really not trivial to fine-tune). Model itself is really good as the FID promised.
At this point just so many SD related techs are getting pumped in its near impossible to catch up 🤣 either way, here goes another controlnet like model from tencet
Lavenderflow-pretrained-256x256-6.8B Hybrid MMDiT just reached 0.597 on GenEval! 🥳
It took me and
@isidentical
less than 7 weeks of part-time effort + 4k h100 hours to get to SDXL-level (and this is just pretrained model) Does two of us worth 1B valuation?
I've ported t2i-adapter to be compatible to diffusers library, go ahead and use them! Example with Anythingv3 model + LoRA + T2I Adapter. (all with diffusers!)
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
Uhh excuse me wtf LLAMA3 ranking 1st????? in lmsys arena in English? Kudos to team
@AIatMeta
, based AF 👏👏 for open sourcing literal GPT-4 level model, (almost) no strings attached🥳
Worked on this weekend: open-sourced f16-c32 VAE
(will release tommorow or something, but its a quite large model lol)
vibe checks out btw:
left is ground truth, right is reconstructed.
The trick was to use zero-init modulation (like DiT), groupnorm, latent upsamping, and
Fun fact: AuraFlow was < 800 LoC and < one month of training. Code is just open. It's just deepspeed and MDS.
You don't need bloated codebases to make a good model!
Proud to finally showcase the 1024x1024 version of AuraFlow (the next generation of lavenderflow)! Truly open, largest text2image model out there with (arguably) SoTA level performance!
1/n
Cannot emphasize this enough, but you only have to train LoRA once and you can apply them anywhere. Below case is with , which is pretty awesome model. Configs from
Normal people's hobby : listening to music, sports, video games...
Me : speedrun pretraining 5B T2I DiT from scratch under 3 weeks
RELEASING SOON!!!!! (btw this is pretrained ver, gotta train on hi-res)
Did you know Imagenet fits on your apple watch's RAM?
introducing imagenet.int8: 5GB, Cropped, VAEed, quantized version of imagenet, 26x compression in total, preprocessed in StreamingDataset format.
Enjoy.
In a equal compute budget, using larger batch almost always implies worse performance.
Rationale for using larger batch-size should always be for sake of faster convergence in equal *time*, not better performance in equal compute budget
Final model 512x512 aesthetic training 😍
btw Its been an absolute wild run. I've learned SO DAMN much from this process. Not a lot of people get to make foundation model from scratch with such freedom. Im so glad
@burkaygur
from
@FAL
offered me such collaboration!!
AuraFlow got a lot more attention than I thought! First time is the hardest, we'll only get better from now.
My plan for next version.
* Much better Encoder-Decoder, with higher spatial compression ratio, such as stable cascade-level compression (im looking at this btw
But to be honest, there's been tons of low-rank, quantized gradient-approximation for efficient allgathers that the paper didn't mention for some reason. Like, not citing PowerSGD?? this? …Like man totally not cool 🙄
fig from psgd
GaLore
Memory-Efficient LLM Training by Gradient Low-Rank Projection
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank
So Lavenderflow 6.8B *just* reached Dalle-3 level on GenEval, but I suspect this was lucky run. Still, its going somewhere close to Dalle-3 (and SD3-8B) which makes me happy. Going to do Parti-prompts GPT-V eval soon!!!
(+ GenEval is not a solid nor comprehensive benchmark so i
Something I learned from training 1024x1024 diffusion models... USE MULTI_SCALE_LOSS from simple-diffusion. Seriously. I've just used 4x4_loss + 1/4 * MSE, and it works great!!! btw it takes quite an activation so maybe fused kernel for it would be great
Ok hear me out, what if I told you absolute bare minimal implementation of FSDP is just 140 LoC?
My next toy project is *sane* FSDP framework that any child can understand and build on. Maximal hackability, minimal abstraction.
* Keeping the params in contiguous row-major
I managed to get it work! 2 Step, No progressive distillation as promised, reasonable quality for dumb unet structure and 10 min of training. I think this is the only implementation out there (given its like 4 days old) Not bad!
Reminder, my friends.
Tomorrow you will be gifted with SD3, 2B ver.
However,
It will be tiny.
It will be closed source
It will require "license".
.
.
.
Mine won't.
Mine will be LARGE AF.
training code? its ALREADY OPEN.
mine will be MIT at best, AL2 at worst.
moon is high, model is 44k steps in, I stopped the run to check on everything and use multi-nodes, didn't expect *anything* at all.... However, safe to say, i've trained my FIRST ever 5.6B Text2Image MMDiT from scratch!!!
Fully fine-tuning SDXL on OW Kiriko images. This took about 10 min. Can you believe this is fine-tuned Base model? BASE????
@StabilityAI
is simply incredible.
Incredible analysis on CFG, best ive read so far, by Karras (again). CFG reduces diversity, but makes the prediction 'sharper', thus giving the illusion of better result
However method "autoguidance" has been (unfortunately) been widely used before :P
"bUt iT woN't Be aS goOd wiTH yoUR teeNy coMpUte"
nah i dont care im not raising cash bro, gaining this experience of handling 100M-scale dataset, pretraining billion-scale vision model from scratch, post-hoc analysis... *all as a hobby in my free time*, is what matters 😎
One YOLO decision I made on AuraFlow is it DOES NOT HAVE clip & croppings conditions as global conditioner, and it ONLY has one Pile-XL T5 encoder as the input.
This way, the way image gets generated ONLY stems from the single textual embedding and nothing else, you will be able
Cool work, have a look! Interesting to see they tie the "probability" of discrete representation to, well, the probability of the dataset : Variational Inference itself.
So this might be the current best usable form of encoder based inversion for SD 2.X models, Really good in terms of fidelity, but NC license is bit sad.
Google presents Mixture-of-Depths
Dynamically allocating compute in transformer-based language models
Same performance w/ a fraction of the FLOPs per forward pass
Math is,,, incredible. I just fixed the learning rate faithful to muP suggested, now gradient norm is much more stable, my depression is cured, eyesight have improved, posture is better, and cured cancer.
... we depart from common practice and do not freeze the image encoder. However, the challenges outlined in LiT remain. In order to avoid destructive supervision signal from the initially unaligned language model, we use a slow linear warm-up for the image encoder’s learningrate
Unlike Controlnet, T2i-adapter is lightweight, generalizable out-of-the-box, and is vey fast. It also does generate additional feature per-timestep. However, it seems to be less strict than Controlnet, thus one might prefer controlnet for truly fine-grained control.
Friendly reminder that this is truely open source t2i model!
Every line of code to reproduce this model has always been open sourced from very beginning!
(But it requires a lot of vram to run this code. You need substantial modification to save a lot
Today I realized that weight normalization of EDM2 (by Karras again, damn) is kinda just Riemannian optimization. Projecting gradient to tangent space of hypersphere is precisely the Riemannian gradient, and normalization after update is just retraction on the oblique manifold.
📢 Introducing MPT: a new family of open-source commercially usable LLMs from
@MosaicML
. Trained on 1T tokens of text+code, MPT models match and - in many ways - surpass LLaMa-7B. This release includes 4 models: MPT-Base, Instruct, Chat, & StoryWriter (🧵)
How did I not know this before? download model from hf to local visible directory via
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=True
huggingface-cli download TheBloke/Yi-34B-Chat-AWQ --local-dir ./yiawq
NO JOKE 100x speedup
a gray cat playing a saxophone is inside a large, transparent water tank strapped to the belly of a massive mecha robot, which is stomping down a bustling san francisco street, the mecha has large metal legs and arms with glowing joints and wires, towering over buildings and
This is pretty interesting I never knew wtf. Kahan Summation compensates for lost precision, allowing *not* keeping master weight in full precision & Adam variables in much lower precision
... this free lunch is not in torch's FSDP nor DeepSpeed?
First looks on training 0.9B IN1k model, 67k steps in, im already getting pretty decent quality images!! minRF is damn scalable with help of
@MSFTDeepSpeed
!
👉
[ rectified flow, muP, SDXL vae, MMDiT, cfg = 7.0!]
Huh, so it looks like triton's Flash Attention is significantly faster than torch's integrated SDPA flashattention (which is much faster than naive attention). This was done on 3070 Ti GPU
btw Qwen2 has arena-hard of 48 (imo toughest, most relevant benchmark out there) puts it right besides gemini, gpt4, and claude.
...only except that its truly open (apache2.0), 128k, multilingual, and 70B!! What a day!
💗Hello Qwen2!
Happy to share the Qwen2 models to you all!
📖 BLOG:
🤗 HF collection:
🤖
💻 GitHub:
We have base and Instruct models of 5 sizes, Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B,
Cool paper from google!
Exciting idea to use multiple latent per cross attention. There might be a room for correlated optimization, where some tokens being injected share multiple common embeddings.i.e., inject another common token t_s during optimizatio
Recently, Karras demonstrated post-hoc ema method, where he was able to "simulate" arbitrary ema decaying factor after the training by saving two copies of ema and clever math.
I took a deep breath to understand it, and wrote a tutorial + working example!
Now that my 5.4B model is stably training (pun intended), next goal is to deduplicate wds + filter + recaption.
I've done my job on deduplication multiple times before, but here is my best attempt yet, fully following SD3's approach with SSCD emb
Enjoy!
*reads nemo*
*realize 0.4 is the best MFU you will ever get, realistically*
💀*Inner soul dies with hopelessness*
HEY NVIDIA, WHATS THE POINT OF 60% OF THE CORES IF 0.4 IS THE BEST YOUR BEST ENGINEERS CAN EVERY REALLY GET?????
Lucky enough to collaborate with
@huggingface
's diffusers team (more like watching them implement🤣 I wrote no code) and... huge updates! Now LoRA is officially integrated with diffusers! There are major difference from my implementations, very simple to use!
Fine-tune Stable Diffusion in T4/V100 on a custom image-caption pairs' dataset 🧨 🔥 => memory efficiency
This is enabled by LoRA. With LoRA, the fine-tuned checkpoints are just **3 MBs** in size 🤯 => portability
Know about it👇
Since the authors didn't upload the code, here is my attempt at Prompt+! (below results is from my impl).
Also further tested out the "correlated extended embedding" idea, which seems to be working (rather it is better or not is unclear)
I made huge blunder last run. I was actually not following important scaling law regarding batch size and was SIGNIFICANTLY undertraining everything... DAMN IT!!!
Literally 5th fundamental mistake I made during training auraflow. htf is this thing SoTA in GenEval?
Btw, this was done on int8 quantized dataset i shared couple weeks ago, which is 26x smaller than the original dataset!!! Imo clever dataset quantization has a lot to offer.
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
golden apple next to a bronze orange, next to silver grapes
This is quite challenging indeed.. Still slightly better than non-commercial open-weight model out there :)
@cloneofsimo
I've had to do this for specific loras I've trained, and even then you can run into data balance issues. Painful if oranges are usually orange.
BTW, I tried to prompt 'golden apple next to a bronze orange, next to silver grapes' in SD3 Medium and it can only do 1 metal fruit.
last week has been weird.
Team "image is worth tokens"
Team "image is worth pixels"
(IMO, tokens > pixels purely based on hardware utilization POV. progressive diffusion OK as well)
Continuing the journey on f16-c32 AE.... so getting high frequency details is tough... I am trying my best to not use LPIPs/GAN tricks but damn these things are hard to get it right. Help me out with great ideas / literatures....
left is reconstructed, right is input.
New trick that works insanely well! How would one mitigate spurious correlation that occurs during fine-tuning? Identify the dataset on the region of interest! [1/n]
I wouldn't have come up with using lora for dreambooth if I had beefy A100 gpu to play around 😂Now even the "GPU-rich" uses lora to fine-tune diffusion model.
I prefer to operate in “GPU-Poor” mode.
I don’t agree with the take from the semianalysis piece. Creative breakthroughs often occur under constraints—new systems, models, and methods that can better take advantage of even larger-scale compute
Ok hear me out, even without upcycling, MoE-MMDiT just started to cross the MMDiT... Now now, let me start the era of time-routed Expert-parallelism diffusion models.
Got my hands on it. Super easy to use, and some findings :
1. Works with Textual inversion, custom models, and LoRA. Incredible flexibility
2. Prompting + Guidance has non-negligible effect here.
3. Sub-second upscaling. Almost free lunch.