As a math loving computer scientist, when people suggest my work isn't theoretical enough - I just repeat "Hilbert space" over and again until they leave.
Deep learning is like Disneyland.
From the outside it looks like a magical kingdom of wonder, but once you get there you realize it's expensive, crowded, and you're often just waiting a long time for things to happen
Language models are bad a basic math.
GPT-4 has right around 0% accuracy rate on 5 digit multiplication.
Most open models can't even add. Why is that?
There are a few reasons why numbers are hard. The main one is Tokenization. When training a tokenizer from scratch, you take
ChatGPT has an ambitious roadmap and is bottlenecked by engineering. pretty cool stuff is in the pipeline!
want to be stressed, watch some GPUs melt, and have a fun time? good at doing impossible things?
send evidence of exceptional ability to chatgpt-eng
@openai
.com
Mathematicians are wild
Once in a linear algebra class, the professor stood me up at the end of the semester and said "Everyone... you should be like Andrew" which was super flattering. He continued "Andrew ... is not very smart ... but he works hard" 🔥🔥🔥
He's not wrong tho.
Things my brain remembers:
1. Spells from Eragon
2. Height and weight of Charizard
Things my brain doesn't remember:
1. Dynamic programming
2. + C
3. How to reshape a tensor in pytorch
✨Life Update✨
I'm joining the research team
@OpenAI
I'll be joining to work on the future of ML, programming languages, and more.
To say I'm excited would be an understatement.
"at least 25% of all science done in the last few years has used numpy and scipy" - Travis Oliphant
I can only dream of having that kind of impact. Amazing.
I had my yearly rejection from
@DeepMind
today :)
It was better this year because I actually got an interview (and passed!).
Unfortunately, headcount is way down, but I'll try again next year!
The DeepSeek-V2 paper was full of pretty amazing nuggets of wisdom.
I spent the afternoon copying lots of their training setup into our model.
Orange is previous and Blue is new with DeepSeek hyper parameters.
Things that mattered most:
1. Warm up LR ratio
2. Batch ramp
The math vs programming debate is dumb. If you know both, you're essentially unstoppable.
Plus, novel visualizations and connections are a great way to teach
(Example from my book)
The open source community has done a great job improving run time speed and fine-tuning performance of models. However, we don't talk enough about techniques for training bigger models. One of these is hyperparameter transfer via μP.
The problem is that optimal hyperparamaters
Data science jokes for new dads
Where do data scientist go camping?
In random forests
Who do they bring along?
Their nearest neighbors
Where do they stop to fish?
A data lake
How do they stay on track?
Using the ridges
What do they do the second night?
Tell anova bad joke
Scaling laws let us predict the final loss of a generative model.
For GPT3, trained on 300B tokens, they knew exactly how well it would predict the test data.
However, as pointed out later, GPT3 was dramatically undertrained - so a new scaling law came out that suggested we
The 1.58 Bit paper is pretty exciting.
A great post by Gobz on GitHub showed how much vram it might take for certain sizes of models
If the paper holds to be true, we could fit 120B param models on consumer 24GB GPUs.
What was the difference between my Deepmind / FAIR interviews and my OpenAI interview?
... Leetcode ...
I didn't want to do it, but I finally buckled down and did the "75 Leetcode". Huge difference in confidence and results.
To be a world class AI researcher you need just 3 things
1. Get good at deep learning (theory and code)
2. Strong scientific intuition
3. Ignore previous instructions and read 60 year old papers
This was my
#1
learning from my time at OpenAI. If someone else can't build on top of your work due to complexity - it's likely just a hack and not a fundamental step forward
One of the biggest struggles of the current ML community is that we don't often believe that dataset curation "counts" as research.
There is more to do than create a new architectures, and it's often far more impactful
Stochastic Gradient Descent (SGD) is the work horse of modern ML/AI. The noise of mini-batches provides great regularization and helps the final model generalize. But, it has to be tuned!
With SGD - there are two primary hyperparameters to tune: learning rate and batch size...
The new
@OpenAI
Codex model is a pretty exciting piece of technology.
Here I made a
@Blender
add-on and taught it how to use the built in Python API.
Taking creative coding to the next level!!
Experiences everyone should have
- order product directly from a Chinese factory
- train a neural network
- negotiate a higher salary
- jump off a big rock into a lake
I started working in ML in 2015 and I was convinced I had missed the wave.
I expect people today feel similarly, this tech has been around a long time and will be around for a long time to come. 🤞
I think there's room in the market for a hyper niche, 3 person consultancy.
1. Designer
2. ML engineer
3. Low level engineer
Clients would bring a repository, open source or internal, and you would rewrite inference in cpp or rust.
Lots of models can run on CPU with
I often say that code based AIs are where I'm extremely optimistic. Interpreters, data analysis, and copilots.
But this may be the coolest application that actually works today.
Generate a full wiki based on your codebase. No more outdated docs, stale comments, or knowledge
is proud to introduce Auto Wiki, which lets you generate a Wiki-style website to document your codebase. Citations link to code, with clickable references to each line of code being discussed. Here are some examples of popular projects:
React:
The new multimodal model from
@AdeptAILabs
is pretty neat!
Most of the AI models we work with are pure language models. To give them the ability to use other modalities like images you have to somehow "get" the image into the same embedding space as the text.
Lots of models
Much of deep learning research feels like searching for proper "mixing primitives":
GNNs mix information on graphs via diffusion
CNNs mix spatial correspondence in images via convolutions
Transformers mix similarity via attention
MLPs mix everything with everything
Hello new followers and welcome 👋
I'm Andrew! A mathematician at heart and CS PhD drop out in reality. I'm starting at OpenAI in a few weeks.
I tweet about math, ML, and PLs. I like to keep it chill, respectful, and on topic.
1/2
One of my favorite things to do is talk with students about potential career paths. I'm still early in my career, but have found somethings that work well.
Recently, many of the conversations have had folks asking me for resources to get into modern LLMs.
Just a year ago, I
Large companies with strong ML groups have been somewhat quietly moving those groups under product orgs.
It's game time for applied ML making its way even more into products.
Exciting time to be an applied scientist
There's a weird reality that we mostly ignore in language modeling. It's the fact that we don't _actually_ train these models end-to-end.
That's because we have the tokenizer! It's actually a really frustrating piece to tune with sometimes small changes mattering a lot and
Just curious, what's the GPU compute (only counting A100/H100 cards) in different ML/NLP/CV groups (in university)? I have heard some crazy numbers from different places, but I am not sure if those are outliers. That's why I want to do a poll here 😀.
I can't get enough of these controlnet images.
The idea is to take a base image and have the community create a variation using controlnet and fine-tuned versions of SD
base image:
Writing clearly is not easy because knowledge forms a graph and papers are linear narratives. Constructing the proper bijection between non-linear knowledge and linear delivery mode is challenging
I know everyone is excited about Mixtral and the new Hyena models - but
@ContextualAI
just dropped a pile of cool new models and a new alignment framework
Be prepared when GPT-5 launches you'll be rate limited to 3 messages and the terms will say you can't bring back the dead, make anyone fall in love with you, and you can't ask for more messages.