Happy to share a cool hobby project I've been working on for the past few weeks: predicting attributes of cars from images.
Blog post:
Github:
Gradio demo:
GPT-4o is a huge step forward for image generation. Not only is it amazing at rendering text and following captions, it also provides a very natural way to iteratively edit and compose visual concepts. 1/8
There is a new 3D generative model in town: Genie by
@LumaLabsAI
. This model is surprisingly good and insanely fast! As someone who has worked on 3D generative modeling, here are my observations and guesses about how this system works (without any insider info):
Now anybody can tinker with and explore the current SOTA generative image models! We have released model weights for our paper "Diffusion Models Beat GANs on Image Synthesis":
Agent finally learning to beat the 10th floor! A few more days of training and I'll probably be ready to submit this. Whole thing is rewritten in PyTorch now, no more anyrl-py or TensorFlow.
My theory for why Python is so popular for deep learning: Python is so slow that the CPU basically looks like a GPU--you need vec ops or special kernels just to get good perf. Python devs were used to this from the start (e.g. numpy) so GPU frameworks felt natural.
TIL that matrix multiplications are rather inefficient on my 6-year-old NVIDIA Titan X, achieving only 25% occupancy even with optimized kernels. I noticed this while writing my own matmul kernel in pure PTX, where I bumped up against some annoying hardware constraints. 1/6
So RNNs are just really deep networks. We now seem to be good at training really deep networks, i.e. LLMs, with layer norm and residual connections. Has anybody gone back to see if LSTM/GRU/other tricks are still necessary for RNNs, versus just a residual RNN?
Happy to share my latest hobby project: representing 3D shapes as oblique decision trees. Surprisingly, this representation has a few neat properties that might make it useful!
Blog:
Github:
This is something I always wanted to investigate, but never had time to. It would seem that MAML (and presumably Reptile as well) learn useful features in the outer loop, rather than learning how to learn useful features in the inner loop.
Rapid Learning or Feature Reuse? Meta-learning algorithms on standard benchmarks have much more feature reuse than rapid learning! This also gives us a way to simplify MAML -- (Almost) No Inner Loop (A)NIL. With Aniruddh Raghu
@maithra_raghu
Samy Bengio.
JoJoGAN works so well because GANs trained on real faces can project stylized faces onto the manifold of real faces zero-shot. I wonder why this is the case, and what other domains would have this property.
Announcing my own compression challenge. My program achieves 10x compression, but I will accept any winners who achieve at least 9x compression. I will pay $2,000 to the first person to solve this before May 29, 2024, at 12:00PM, PST.
@ak92501
This is ridiculous. None of these are novel contributions and they don't reference any of the many past works that apply diffusion to inpainting.
One use case I'm excited about is telling a story with images. In this example, we use the model to create a character and then immerse her in a visually-consistent, fictional world. 2/8
My latest vacation project was Bezier MNIST: a vector graphics version of the MNIST character dataset. Earlier this week I made an algorithm to convert images to Bezier curves, and now I applied it to all of MNIST!
A neat thing about this model is that it can produce multiple consistent views of a 3D object, allowing us to reconstruct 3D models of complex shapes. 5/8
It continues to astound me that the built-in PyTorch Transformer isn't really used/usable in any serious projects. They should consider updating the standard library, since this is such a crucial building block.
DDIMs are generative models with a latent space that is never learned or explicitly defined--it's implicitly created from the data distribution. Mind-bending stuff.
MuZero with Self-competition for Rate Control in VP9 Video Compression
abs:
MuZero-based rate control achieves an average 6.28% reduction in size of the compressed videos for the same delivered video quality level (measured as PSNR BD-rate)
New blog post on three of my recent ML research projects that _didn't_ pan out. Negative results can be informative, and I think the ideas are still interesting!
Turning a NeRF into a textured mesh is an interesting way to explore what the model is actually learning. In this case, it's clear what the training camera angles don't cover.
Working on a new kind of discrete VAE that can generate good samples without any auto-regressive prior. Here's samples and test set reconstructions on MNIST with a 60-bit(!!!) prior:
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
- Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training
- Achieves better perplexity than Mamba
Everyone seems excited about Mamba. What I can't glean is whether anybody has tried evals other than log-loss. Do we have a strong reason to believe these models actually compete with Transformers on realistic tasks?
As a totally hacky and preliminary demo, check out the web app I made for my VQVAE + diffusion voice changer!
I'd say it works reasonably well for me 50% of the time. Curious to see what you all can make it do!
Speaking of consistent characters, how about becoming movie stars? Here, the model is able to depict me and
@gabeeegoooh
as detectives in a stunning movie poster. Note how our names and the movie title are rendered properly! 3/8
What we really want is algorithms that use lots of knowledge and reasoning to choose what to explore, how to act, how to attribute credit, etc. Results on single games do not look like this at all. 5/5
Getting scooped on a really good idea sucks, but sometimes I have to remind myself that nothing is as bad as being forgotten as the inventor of calculus.
I'm reading this 2011 paper on parallelizable pseudorandom number generators, and it contains some surprisingly good knowledge for someone with no background in cryptography.
There's something unsettling about SimCLR. The paper shows that the network, at some intermediate layer, has more relevant information to classification than it does right before the output layer used for the contrastive loss. Why should this be the case?
I have now learned that it's a fallacy to equate occupancy with efficiency! It's possible to fully utilize the hardware with low occupancy by leaning on instruction-level parallelism.
@cHHillee
pointed me to this very elucidating slide deck:
TIL that matrix multiplications are rather inefficient on my 6-year-old NVIDIA Titan X, achieving only 25% occupancy even with optimized kernels. I noticed this while writing my own matmul kernel in pure PTX, where I bumped up against some annoying hardware constraints. 1/6
I thought the codefusion paper was too good to be true. How could such a tiny model learn to write _any_ meaningful code? The thing they didn't mention in the abstract is that they use a pretrained CodeT5 encoder, which is much larger than the entire rest of their architecture.
My wife's summary of my workday:
1. Fix a bunch of bugs
2. Introduce one bug that breaks everything
3. Launch a bunch of experiments to run overnight
4. Realize the bug from step 2 the next day
By generating multiple images and context, and leveraging the model's amazing text rendering capabilities, we can do neat things like create custom fonts. 6/8
Solution to my compression challenge. The files were encrypted -- simple as that.
Encrypted files "look like noise", but they _do_ have structure and _can_ be compressed--if you are willing to invest lifetimes of compute into it!
Has anybody tried scaling up LPIPS since the original work ~5 years ago. It seems like we could get a much better metric with modern backbones and a larger dataset of judgements.
Just read this paper. My main question is why you'd inpaint a ball instead of outpainting a panorama, since I think they can accomplish the same thing.
Paper:
I got a parking ticket in the Bay Area. You know what that means! I scraped ticket data from the online payment website. I estimate that all the tickets add up to between $79M and $130M.
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
So what is my guess about how this model works? My best guess is that an image diffusion model is fine-tuned to produce multi-view images, possibly with an additional depth channel, and then a novel second stage produces a NeRF quickly from this output.
In this example, we can see just how well the model does at rendering a complex image. It uses two separate chat bubbles for the messages, renders a ton of text correctly all at once, and almost perfectly depicts a QWERTY keyboard. 7/8
New paper - CURL: Contrastive Unsupervised Representations for RL! We use the simplest form of contrastive learning (instance-based) as an auxiliary task in model-free RL. SoTA by *significant* margin on DMControl and Atari for data-efficiency.
Diffusion Models Beat GANs on Image Synthesis
Achieves 3.85 FID on ImageNet 512×512 and matches BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
There are so many parallels between GPU programming and distributed systems design. In both cases, you have to think about synchronization, communication overhead, locality, and parallelism. The only substantial difference is reasoning about hardware failures.
Just wasted a day or two because optimizer.load_state_dict() also resets an optimizer's hyperparameters. I was wondering why all my resumed runs with changed LR and other HPs did not change significantly.
@marwanmattar
Some of the best ML researchers I've ever met do not have PhDs--or any degree, for that matter. It's your choice as a company how you want to filter candidates, but there are likely better ways.
Noticed this 2018 NeurIPS paper, which doesn't actually train any neural networks. Instead, it learns decision trees with sparse linear SVMs at the branches.
@rasbt
Disclaimer: I was an author on that paper. Yes, diffusion models will really shine for most use cases. GANs might still be better for very narrow domains, but otherwise wouldn't be my first choice.
I sometimes work on small-scale ML projects on vacation. It's always a reminder that you can still have interesting ideas and iterate on them with only a single GPU.