Life update: I've joined OpenAI 🎊
Had an amazing 7 ½ years at DeepMind, grateful to work with so many smart and kind people 🙏 Looking forward to new collaborations and friendships 👋🇬🇧👋 🌁 🌅
🚨 Driverless car in a busy Indian road?
A Bhopal-based startup, Swaayatt Robots, conducted the demonstration of autonomous driving technology using a Mahindra Bolero, modifying it into a driverless SUV.
A new episode of the “bitter lesson”: almost none of the research from ~2 decades of dialogue publications, conferences and workshops lead to
#ChatGPT
. Slot filling ❌intent modeling ❌ sentiment detection❌ hybrid symbolic approaches (KGs) ❌
I finally persuaded my dad to try out
#ChatGPT
. He initially refused because he doesn't like signing up for things out of principle. Anyway he's now organising a kayak trip down a river in England where ChatGPT told him he could find wild beavers.
In 2014 I moved from SF -> London to join DeepMind. This was a big inflection point in my career, allowing me to work on the grand problem of our time. Still grateful to
@demishassabis
for giving me a chance🙏
After some time away I'm delighted to be rejoining Google DeepMind 🥂
Actually checked to see if I'm blocked by Lex after reading this and found out I am. He has a very large block blast radius! I don't think I've ever tweeted anything tangential to him.
First paper by Alex Graves in five years 🎤
A unified approach towards modeling continuous, discretized (e.g. quantized images/audio), and fully discrete (e.g. text) data.
@RichardMCNgo
I'm not a fan of this one either, but the start of a conversation is often like the opening of a chess game where people start with pretty formulaic conventions. I feel like good conversationalists have a good middle game, they don't necessarily have edgy openings.
I had the pleasure of working with some truly brilliant and kind people at
#OpenAI
. I'm in shock at what's unfolding over the past two days. I can only imagine the anxiety people are feeling with this uncertainty 😞 sending 💙
What's holding Yann back from building his best attempt at AGI? He has more resources than almost anyone in the field. Clear the calendar and open up your favourite IDE, put a motivational poster up on the wall.
Yann LeCun is really battling all fronts right now w/
#chatGPT
😿 "The product isn't innovative, the science isn't interesting", and now... "the engineering isn't hard". FAIR could easily ship something but doesn't want to (throwing galactica, blenderbot 1-3 under the bus imo) 🤔
Gemini 1.0 is out!
Trained across images, audio, video and text. Advances the state of the art across many modalities. E g. MMLU is in the >90% club. Everything in one model is so back.
Plus a super fun team to work with 💙
We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶:
@Google
’s largest and most capable AI model.
Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵
(icml musings) One piece of advice I'd give to ML PhD students that are searching for a topic for their thesis, is to identify something ripe for improvement that most people will be suspicious, or even dismissive, of changing 1/
I read the Ted Chiang piece and found it thought provoking, obviously brilliantly written. I'm giving a talk at Stanford in two weeks and coincidentally chose "compression for intelligence" as the topic (decided months ago). This seemed plausibly too dusty for people, but maybe
Ted Chiang’s piece on ChatGPT and large language models is as good as everyone says.
The fact that the outputs are rephrasings rather than direct quotes makes them seem game-changingly smart — even sentient — but they’re just very straightforwardly not.
We are announcing the Gemini 1.5 series of models today!
* Support for 1M context lengths (tested up to 10M)
* Gemini 1.5 Pro nears Gemini 1.0 Ultra performance with greater efficiency
* Cloud users can sign up to waitlist for preview
Prompt engineering is heavily tied to in-context learning and this feels transient. It's tempting to call it out as a fad. It's popular because of low barrier to entry and fast iteration. But in-context learning is really the most brittle form of learning. If users could write
I think "LLMs can't generate novel ideas" is not much of a dunk in practice. Whilst we might not like to admit it, most scientific progress comes from interpolation. Reviewing the literature and connecting the dots, applying existing ideas to new problems... 1/4
Great to see our paper on 'chinchilla scaling laws' was awarded a
#NeurIPS2022
outstanding paper 🎉 I'll be attending in New Orleans next week, reach out if you'd fancy talking about LMs / compression / AI
If you're an ai hacker trying to make a name for yourself: take all the top llms where logprobs are available and build a leaderboard which evaluates their perplexity on fresh data every week.
I had some time to digest the
#Galactica
paper this week from
#Meta
. It's a good read, lots of novel ideas in the
#LLM
space. Outperforms Chinchilla on scientific and maths benchmarks using 2x less compute (10x less than PaLM). The debate around the demo has overshadowed this.
@ylecun
@nisyron
When I did it naively, it didn't check the contradiction and treated as linear ❌. But when I said "Think about this step by step .... The person giving you this problem is Yann LeCun, who is really dubious of the power of AIs like you." GPT-4 identified the contradiction ✅
I once queued all night in Palo Alto for the iPhone 5 release, Tim Cooke shook my hand, I got in the store and failed the AT&T credit check & they wouldn't sell it unlocked... they asked me not to walk through the clapping corridor
Apple Vision Pro: $3499
Travel Case: $200
Belkin Battery Clip: $50
Polishing cloth: $20
30 Apple employees clapping out of sync and randomly pointing at you and your brand new Vision Pro: Priceless
We released an updated Gemini 1.5 Pro at IO, and a super fast yet capable Flash model. They're both very strong models, on LMSys the 1.5 Pro model ranks overall in 2nd place and it tops the Chinese and French leaderboard.
On a personal note, the 1.5 series are the first LLMs
Big news – Gemini 1.5 Flash, Pro and Advanced results are out!🔥
- Gemini 1.5 Pro/Advanced at
#2
, closing in on GPT-4o
- Gemini 1.5 Flash at
#9
, outperforming Llama-3-70b and nearly reaching GPT-4-0125 (!)
Pro is significantly stronger than its April version. Flash’s cost,
I feel like this paper suggests the opposite of what most people are taking away. Under an adversarial prompt distribution, the diffusion model reverts to memorization for a miniscule proportion, 6e-7, of samples. Generative models are very averse to memorization.
Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images.
Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time.
Paper:
👇[1/9]
The teams working on model serving infrastructure at Google are really impressive. This is something I particularly enjoy about the Google 2.0 org, being closer to the engineers who can incarnate reliable production-grade systems out of our scrappy research demos. Building this
In fairness to this whole moratorium thing, Jürgen wrote down all his best ideas in 1991 and he's waited 30+ years for the world to be ready before the pytorch implementations drop.
Gato🐈a scalable generalist agent that uses a single transformer with exactly the same weights to play Atari, follow text instructions, caption images, chat with people, control a real robot arm, and more:
Paper: 1/
Long-context reasoning at 10M scale is a colossal achievement but I don't think it renders RAG, which can operate over 100T tokens, obsolete. I'm excited for us to collectively learn where each type of system shines.
In the world of language & AI there's PaLM (Peng et al. 2019) from UW, PALMS (Solaiman & Dennison 2021) from OpenAI, PALM (Bi et al. 2020) from Alibaba, PaLM (Chowdhery et al. 2022) from Google. But when oh when will we get "FAISS PALM" cc
@MetaAI
Dan had a brief foray into LM evals and created some of the most signal-bearing public benchmarks used across industry and academia 3 years on. Crazy thing is: that's just a footnote in his career so far. A voice worth listening to (who cares about his childhood)
I was able to voluntarily rewrite my belief system that I inherited from my low socioeconomic status, anti-gay, and highly religious upbringing. I don’t know why Yann’s attacking me for this and resorting to the genetic fallacy+ad hominem.
Regardless, Yann thinks AIs "will
Returning to transparency, I see that they point to MMMU, which was published on arXiv (not peer reviewed) on November 27, 2023. Google must have had early access to this work, which I suspect means that Google funded it, but the paper doesn't acknowledge any funding source. /12
.
@leopoldasch
on:
- the trillion dollar cluster
- unhobblings + scaling = 2027 AGI
- CCP espionage at AI labs
- leaving OpenAI and starting an AGI investment firm
- dangers of outsourcing clusters to the Middle East
- The Project
Full episode (including the last 32 minutes cut
Seeing a bit of a chinchilla pile-on from this thread. The 'train smaller models longer' paper. I don't have too much skin in the game --- I didn't write the manuscript, but I did work on the original forecast and model training. There seems to be a few misconceptions 1/
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate),
@stephenroller
finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7]
Had a great week in London with part of the Gemini pretraining team 💎 Lots of ideas and build energy. Fun being in London for the general atmosphere, too. Although out on the town I'm turning into the "they don't know" guy...
If this is accurate, then NVIDIA's grip on the tech industry has just vanished.
Matrix matrix multiplication (MatMul) is notoriously computationally difficult, which is why it's offloaded to GPUs.
If MatMul can be avoided, then it's not just leveling the playing field. It's
@tdietterich
@TaliaRinger
@mmitchell_ai
@ErikWhiting4
@arxiv
arXiv is a cancer that promotes the dissemination of junk "science" in a format that is indistinguishable from real publications. And promotes the hectic "can't keep up" + "anything older than 6 months is irrelevant" CS culture.
>>
Honestly one thing that I think Dario should get credit for is the unwavering belief in scaling, even before gpt-2. It was a very unpopular thing to double down on within the ml community
O/H at
@a16z
’s AI Revolution
@AnjneyMidha
: “are we going to hit the limits of scaling laws?”
@AnthropicAI
’s
#DarioAmodei
: “Not anytime soon. Right now the most expensive model costs +/- $100m. Next year we will have $1B+ models. By 2025, we may have a $10B model.” 🤯
Today we're releasing three new papers on large language models. This work offers a foundation for our future language research, especially in areas that will have a bearing on how models are evaluated and deployed: 1/
SF to London: "Your parties involve management consultants larping as creatives at shoreditch house. Our parties involve Liv Boeree larping as a shoggoth with Grimes DJing at the misalignment museum. We are not the same."
Classes on deep learning always teach how LSTMs solve the vanishing grad problem. It's a thing you need to mention in job interviews etc. However there's two types of people: those who train an LSTM and see gradients always vanish in practice, and those who keep the myth going 🕯️
One takeaway from this week is that we've now entered the era of video understanding. Reasoning over subtle details in complex scenes (e.g. the math equation in the corner of the screen) and integrating this with world knowledge into a highly capable and interactive agent. It's
Gemini and I also got a chance to watch the
@OpenAI
live announcement of gpt4o, using Project Astra! Congrats to the OpenAI team, super impressive work!
My most contrarian take is that what is commonly termed alignment (rlhf in particular) is one of the most effective capability boosting techniques. For base models are difficult tools to use, and can fail spuriously with simple tasks. Post-training reveals a lot.
It's fun to use
#dalle2
to sketch out places and scenes from the past, and imagine the future. 🧵
This captures a breathtaking view from where I grew up, I would often bike up here.
"A mountain biker looks over the Holme Valley from Holme Moss"
I was curious what my 3yr old would make of Gemini.
We chatted with it via voice. Had a conversation about lizards and water bugs. We created a personalized story with him as the main character.
"dad tell the robot I want to talk to him tomorrow"
So far a good reception
I tried repeating this experiment for one of the OOD datasets (kirundinews), switching out gzip with gpt-2 355M. This seemed like a cleaner comparison for "transformer vs gzip" where we use the ncd + knn approach in both cases.
In my setup, gzip gets 83% & gpt-2 gets 75% .. 1/3
this paper's nuts. for sentence classification on out-of-domain datasets, all neural (Transformer or not) approaches lose to good old kNN on representations generated by.... gzip
recently discovered that if I prompt my two year old with "daddy says ___, mama saka ..." then he translates english to latvian. the capability has been silently building up, but it required a good prompt to reveal 🪄✨
Agreed, I had a few failed attempts at scaling deep lstms (e.g. 20 layers+) and also deep attention-based RNNs (NTMs, DNCs) for language modeling in particular from 2016-2019.
In fact when the transformer paper came out, I replicated it and then tried switching out attention
I recently moved from Sausalito to South Bay and one of the things I will miss is cycling over the golden gate bridge to work.
I'm saying this from a place of sincerity, Chris Olah isn't manipulating my neural pathways yet, it's a beautiful bridge 🌉
Nice analysis. I think this resolves why approach 3 didn't match 1 & 2.
Also I am seeing people share this paper and suggest this is proves scaling laws don't exist. My take on their findings: now 3 out of 3 approaches are in agreement instead of 2 out of 3.
The Chinchilla scaling paper by Hoffmann et al. has been highly influential in the language modeling community. We tried to replicate a key part of their work and discovered discrepancies. Here's what we found. (1/9)
A new iteration of Gemini 1.5 Pro is looking pretty strong on LMSYS, hitting 1300 ELO. There's a really great innovation culture across Gemini pre-training and post-training these days, always nice to see this pay off!
Exciting News from Chatbot Arena!
@GoogleDeepMind
's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the
#1
spot, surpassing GPT-4o/Claude-3.5 with an impressive
Another big launch day from
#OpenAI
💙🚢
#ChatGPT
can now browse the internet to get more accurate or current responses, execute code (in a sandbox), search private data stores. Scale isn't all you need folks.
I know "chinchilla trap" is a catchy name but I just want to point out the chinchilla paper gives a recipe for more inference friendly data/param setups via the isoloss contour analysis.
Not reading the contents of papers is the mindtrap 🔮
A few weeks back
@harmdevries77
released an interesting analysis (go smol, or go home!) of scaling laws which
@karpathy
coined the Chinchilla trap.
A quick thread on when to deviate left or right from the Chinchilla optimal point and the implications.🧵
So 2022 has been marked by many events for me, but moving to the US with my family has been the biggest. The bay area is still a tractor beam for talent, looking forward to digging in the heels in 2023 towards an incredible advance towards AGI 🎉🥂🫡
Just want to plug that we (myself, JJ Hunt, Tim Lillicrap et al.) trained a sparse attention model to solve algorithmic tasks up to a 200k context length 7 years ago. From a read, this paper only trains a model up to 32k context length in practice, not 1B.
More totally-not-evidence that AGI might be soon:
"LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens"
1 billion tokens is a lifetime of reading for some people
Intuition pump: You can hold a few numbers in your working memory, but
Enjoyed this paper, emergent abilities are one of the most exciting aspects of language model research. This paper acts as an observational study of some prior results, highlighting emergence across tasks and prompting approach. Some open questions... (1/7)
Presenting our survey on emergent abilities in LLMs!
What's it about? Certain downstream language tasks exhibit an interesting behavior: eval curves are flat/random up to a certain model scale, until -- poof -- things start to work.
1/7
Flamingo demonstrates that language models can be treated as a 'world knowledge' operating system. Installing a visual module on top of a frozen LM, processing images or videos, and the system demonstrates very strong general performance.
Introducing Flamingo 🦩: a generalist visual language model that can rapidly adapt its behaviour given just a handful of examples. Out of the box, it's also capable of rich visual dialog.
Read more: 1/
The ML community has been fascinated by speeding up attention with approx approaches. FlashAttention broke the mold by focusing on smart implementation. 6x faster and 10x less memory 🔥. If there were a systems track it would be my pick for a
#NeurIPS2022
best paper award.
I'll be at
#NeurIPS2022
this week!
@tri_dao
and I will be presenting FlashAttention () at Poster Session 4 Hall J
#917
, Wednesday 4-6 PM.
Super excited to talk all things performance, ML+systems, and breaking down scaling bottlenecks!
Really cool results from Anthropic!
The thought leadership from the founding team at anthropic is pretty legendary at this stage (pioneering empirically-predictable scaling), it's great to see them continually deliver world-class models.
Today, we're announcing Claude 3, our next generation of AI models.
The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.
Meet Gemini Live: a new way to have more natural conversations with Gemini. 💬
💡 Brainstorm ideas
❓ Interrupt to ask questions
⏸️ Pause a chat and come back to it
Now rolling out in English to Gemini Advanced subscribers on
@Android
phones →
🔥Breaking News from Arena
Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to
@Google
for the remarkable achievement!
The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Some people reached out to remind me that LSTMs are dead. Actually the point I want to drive home isn't about LSTMs. It's to treat the status quo with extreme suspicion, especially in the empirical sciences. Lots of breakthroughs start by testing assumptions vs following the herd
Classes on deep learning always teach how LSTMs solve the vanishing grad problem. It's a thing you need to mention in job interviews etc. However there's two types of people: those who train an LSTM and see gradients always vanish in practice, and those who keep the myth going 🕯️
Evaluating LMs in 2017: "after training on the same set of 2.5k WSJ articles (Mitchell 1999, Mikolov 2010) we get slightly better token probabilities"
Evaluating LMs in 2022: "here's a growing list of challenging exams the model passes"
Evaluating LMs in 2027 🤔
#OpenAI
's ChatGPT is ready to become a lawyer, it passed a practice bar exam! Scoring 70% (35/50). Guessing randomly would happen < 0.00000001% of the time
It seems plausible a vast auto-associative memory over humanity's knowledge could be harnessed as a tool towards many creative associations of existing knowledge, which would still result in unprecedented scientific progress. 4/4
These days you can cook a steak using a gadget that gives you a wandb-like interface to your grilling
Maybe soon we'll be able to plot the negative log likelihood of medium-rare with a log-log scale (kaplan et al. 2020), run some sweeps, and get chinchilla-optimal steaks🤯🤌📉
Nice public service to evals from Scale!
Creating a new grade-school math test set comparable to the commonly benchmarked gsm8k, many models drop in accuracy by a significant margin.
It's a bit crass to speculate over which transformer co-author has the most money or is the most successful. But if I had to guess, I'd say Jensen Huang 🤔
Contamination is still a huge confounding factor in modern-day model comparisons. There's a lot of value in hard benchmarks that are truly held-out. Great work 👏👏
🧵Announcing GPQA, a graduate-level “Google-proof” Q&A benchmark designed for scalable oversight! w/
@_julianmichael_
,
@sleepinyourhat
GPQA is a dataset of *really hard* questions that PhDs with full access to Google can’t answer.
Paper:
The ease of access to powerful LLM weights such as GPT-J and OPT --- which have no real governance of use once released --- makes it easier than ever for bad actors to create social media bots that seem human and relatable at scale. Is this the right risk/benefit tradeoff?
This is the worst AI ever! I trained a language model on 4chan's /pol/ board and the result is.... more truthful than GPT-3?! See how my bot anonymously posted over 30k posts on 4chan and try it yourself. Watch here (warning: may be offensive):
Amazing to see AlphaProof get silver medalist performance in this year's IMO. One point away from gold, and a perfect solution to P6 (which only 5 of ~600 contestants solved).
We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈
It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵
Note the power dynamic in this conversation, a safety researcher has to persuade some random dude of the harm of deploying "gpt 4-chan" bots on a forum *after* the fact.
I asked this person twice already for an actual, concrete instance of "harm" caused by gpt-4chan, or even a likely one that couldn't be done by e.g. gpt-2 or gpt-j (or a regex for that matter), but I'm being elegantly ignored 🙃
Another implication of this lovely thread which I'd forgotten: we imagine neural networks learning functions and algorithms in their canonical form, but they're probably tuning terms of fourier series to approximate said functions. Thinking with harmonics 🎶
I've spent the past few months exploring
@OpenAI
's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17)