My entire feed is OpenAI employees retweet Sam with the heart emoji.
If the board doesn't let him back, he's going to start a new company and take a large chunk of those people with him.
If the board does let him back, Ilya is going to leave and start a competitor. (1/2)
We've found a new way to prompt language models that improves their ability to answer complex questions
Our Self-ask prompt first has the model ask and answer simpler subquestions. This structure makes it easy to integrate Google Search into an LM. Watch our demo with GPT-3 🧵⬇️
There's no moat. You just need $400M and a bunch of good engineers and you can build your own GPT-4.
Now we gotta get someone to build an open version.
New (1h32m) video lecture:
Transformers From Scratch: Building 5 Language Models at Increasing Complexity Levels
It's an intuitive way to learn what every component of a modern transformer LM does and why they're there.
Cool new idea from DeepMind:
They evaluate LMs by giving them a piece of code, having them describe it, and then asking the LM to rewrite that code given only the description. The metric is the similarity between the original code and the rewritten code.
I'm sure this chaos and uncertainty sucks for all of those involved but if the world gets 2 strong competing LMing companies out of what used to be OpenAI, we'll all win... Especially if the Sam-led one ends up actually being a bit more open. (2/2)
Since Transformer LMs were invented, we’ve wanted them to be able to read longer inputs during inference than they saw during training. Our Attention with Linear Biases enables this, in very few lines of code, without requiring extra params or runtime 🧵⬇
As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as "What is the calling code of the birthplace of Adele?".
We show that as GPT size grows, it does not improve its compositional abilities🧵⬇️
Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) (code below)
Reddit launched in 2005. StackOverflow in 2008.
Both are shutting off access to their data because they're annoyed that they aren't getting payed when it gets used for LM training.
Silly move- the value of future data is miniscule given that we already have data from 2008-now.
@TheSeaMouse
Of course they would. He's one of the smartest people in ML.
I disagree with his views but I'm sure that lots of VCs either agree with him or don't care about those things.
ChatGPT can solve novel, undergrad-level problems in *computational complexity* 🤯
"Please prove that the following problem is NP-hard..."
Solution in next tweet -->
Credit:
@TzvikaGeft
(1/3)
I made a simple UI to ChatGPT that lets you easily build complex matplotlib plots, visualize them in the browser and get ChatGPT to solve your bugs.
Try it out at:
Open source on GitHub.
I needed to analyze data, would've taken me 15 mins to code
I asked GPT4 to code it cause it's 2024 and humans dont code anymore
It made a mistake so I asked it to fix it. There was a mistake in the fix so I asked again
Anyways its 1 hour later now and I still dont have the code
If someone doesn't stop Tim soon he's gonna run Guanaco-65B on a globally distributed cluster of 5 toasters and 3 electric toothbrushes at 70 tokens/sec.
I forgot how much better Guanaco-65B is compared to 33B. You can try here via Petals (globally distributed inference):
With Petals, you can also run a 65B model in a colab or locally on a small GPU at ~5 tokens/sec (see below).
Chapyter is a new Jupyter extension that lets ChatGPT assist you in writing Python notebooks. It can also read previous cells and the output of their execution.
This is awesome!
Transformers are made of interleaved self-attention and feedforward sublayers. Can we find a better pattern?
New work on *improving transformers by reordering their sublayers* with
@nlpnoah
and
@omerlevy_
I'm going to defend next week!
You're invited to the livestream!
I'll talk about ALiBi, evaluating models trained on different sequence lengths, and other things that I've worked on for the past few years. I'll also talk about what directions I think we should explore next.
It's been just 10 days since we launched SWE-agent but we already have 1.5k people in our Discord and lots of contributors on GitHub.
We've been making the agent easier to use and there are lots more exciting updates coming soon, including a web UI! Join us :)
SWE-agent is blazing fast, and when it works it feels like magic!
In this short demo I show how it solved a real bug in the neural network training code in scikit-learn. I also explain the process behind our agent-computer interface design choices.
I think GPT-5 will just be GPT-4 finetuned on agent trajectories, meaning that it'll be really good at:
1. Browsing the web: 'Find me a hotel near ICLR 2025 that has a pool'
2. Using GUIs: 'Remove the intro of this video clip'
3. Software Engineering: SWE-bench
Nicholas Carlini has a new HumanEval-esque (but slightly harder & more creative) benchmark for LMs, with ~100 tasks.
Results are as you'd expect:
gpt-4-0125: 49%
claude 2.1: 31%
gpt-3.5: 30%
mistral-medium: 25%
gemini-pro-1.0: 21%
When a student sadly tells me that the idea we've been working on for weeks was just arXived, I say:
"Great! We've just gotten *strong* confirmation that our thinking was in the right direction. We've had the initial work done for us. Lets figure out how to make this 10x better"
New preprint by
@Ale_Raganato
et al. shows that NMT models with manually engineered, fixed (i.e. position-based) attention patterns perform as well as models that learn how to attend. Super cool!
DeepMind's Gopher and BigScience's BLOOM already use relative position embeddings, but most other language models don't. I believe we should all start using relative positioning.
In this new post, I discuss the use case for relative position methods:
The next big leap in language modeling is going to come from finetuning them on agent trajectories. This will lead to a big accuracy improvement in end-to-end programming (e.g. SWE-agent), controlling desktop apps (e.g. OSWorld) and web browsing (e.g. SeeAct).
People are asking us how Claude 3 does with SWE-agent- not well. On SWE-bench Lite (a 10% subset of the test set) it gets almost 6% less (absolute) than GPT-4.
It's also much slower.
We'll have all the data in the preprint next week.
SciCode is our new benchmark, with 338 programming challenges written by PhDs in physics, math, and bio, based on papers in their fields. A bunch of the questions are from Nobel-winning papers!
I hope this becomes the new HumanEval.
SciCode is our new benchmark that challenges LMs to code solutions for scientific problems from advanced papers. The challenges were crafted by PhDs;
~10% of our benchmark is based on Nobel-winning research.
GPT-4 and Sonnet 3.5 get <5% ACC.
🧵 1/6
If Claude 2 turns out to be as strong as GPT-4, thereby breaking the OpenAI monopoly on strong LMing, the number of companies building products on top of LMs will increase substantially.
It's finally here- ALiBi is officially in FlashAttention 2!
As expected, it's much faster than the PyTorch implementation and just as fast as the Rotary FlashAttention implementation.
ALiBi is the simplest and best positioning method- go try it :)
People have been asking me why I think sinusoidal embeddings don’t extrapolate while ALiBi does.
I think it’s because with position embeddings, transformers don’t actually “understand” the concept of positioning.
The following is my hypothesis- 🧵⬇
If this result is correct, it will show that the LLaMA length scaling trick is almost useless: yes, it lets the model consume longer sequences, but it performs *much worse* across the board when extrapolating.
Better to use the original model with a sliding window,
I believe that in 6-12 months we'll have an open source GPT-4 replication.
But GPT-5 will be built based on immense amounts of human feedback collected like shown here and I'm not sure how the open community will replicate that
First time seeing ChatGPT give me 2 possible responses in this style (I did not press the "regenerate response" button). When did this functionality get added?
I assume it detected I had corrected it multiple times in a row to trigger this mode. Good way to gather RL data.
@karpathy
Super excited to push this even further:
- Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit)
- Two weeks: Full release of code, paper, and a collection of 65B models
Get state of the art results in word-level language modeling by simply shuffling the training data! Naively shuffling all of the sentences would not work, so here I present a new method that *partially* shuffles the training data:
Sandwiches will be served at ACL 2020! In our updated paper, we show that sandwiching improves strong models in word *and* character-level language modeling. We match the results of Deepmind's Compressive Transformer on enwik8 even though our model is both much faster and smaller
Transformers are made of interleaved self-attention and feedforward sublayers. Can we find a better pattern?
New work on *improving transformers by reordering their sublayers* with
@nlpnoah
and
@omerlevy_
Sparks of stupidity?
We've found a wide array of questions that lead GPT-4 & ChatGPT to hallucinate so badly, to where in a separate chat session they can point out that what they previously said was incorrect.
@zhang_muru
et al🧵⬇️
Cool new paper by
@XiangLisaLi2
and
@percyliang
that shows that you can train small, continuous vectors to act as 'prompts' for different downstream tasks in GPT-2 and BART.
Love is all you need, and attention may not be all you need! We show that a simple, attentionless translation model that uses a constant amount of memory performs on par with the Bahdanau attention model. with
@nlpnoah
Combining Self-ask + Google Search + a Python Interpreter leads to a super powerful LM that is really easy to implement and run!
Super excited what else people do with this :)
5/
Finally, a potential game changer:
They provide a simple API around "agents," or LLMs that can somehow interact with "tools" (like a python interpreter or search engine) in order to answer questions
The below program hits up both Google and Python to arrive at an answer
Last year we made Bamboogle, a set of questions that Google answered *incorrectly*
I was told that it's now used internally to eval search & LMs, and indeed some of the wrong answers were fixed
So apparently the best way to get support at Google is to write an EMNLP paper 😉 ->
GPT-4 probably trained on:
1. LibGen (4M+ books)
2. Some of Sci-Hub (80M+ papers)
3. All of GitHub
The Stack is an open source release of (3) but we don't really have open releases for 1 and 2.
I think this is a really important step in building the next strong & open model.
Figuring out how data (size and/or type) affects LM performance and how to get more of the types of it that we need (code? papers? YouTube video transcripts?) is the most important research direction in NLP right now.
I have met a lot of people who did their PhD on reinforcement learning.
None of them like RL.
I think there's a very good chance that we figure out how to do the 'alignment' step in LM training without RL. Hopefully without human annotations too. Self-align is a major step.
The progress on SWE-bench is nuts. I think my prediction of 2 systems surpassing 35% pass
@1
on the full test set by Aug 1 will come true.
When we launched in October, nobody wanted to work on the dataset because it was considered "too hard" or "impossible". Acc was 1.96% then.
Google doesn't answer compositional questions well.
We made a dataset composed just of questions that Google answers incorrectly- Bamboogle.
We show that LMs also struggle with these Qs and that self-ask helps LMs answer these (better than CoT).
I heard of a company that bought 100+ H100 cause they want to train their own LMs (to replace their GPT-4 usage) even though they have no people with knowledge on how to train large LMs.
There's reason to be hyped about the future of LMing but some people are being silly. (1/2)
AI assistants have been improving but they still can't answer complex but natural questions like "Which restaurants near me have vegan and gluten-free entrées for under $25?"
Today we're launching a new benchmark to evaluate this ability.
I hope this leads to better assistants!
Can AI agents solve realistic, time-consuming web tasks such as “Which gyms near me have fitness classes on the weekend, before 7AM?"
We introduce AssistantBench, a benchmark with 214 such tasks.
Our new GPT-4 based agent gets just 25% accuracy!
We built a super tough benchmark to test whether models can browse the web to correctly attribute scientific claims.
The GPT-4o-powered agent gets 35%.
Also- the first author is my brother
@_jasonwei
+
@JerryWeiAI
: we're coming for you 🤠
Can AI help you cite papers?
We built the CiteME benchmark to answer that.
Given the text:
"We evaluate our model on [CITATION], a dataset consisting of black and white handwritten digits"
The answer is: MNIST
CiteME has 130 questions; our best agent gets just 35.3% acc (1/5)🧵
The only two numbers worth looking at here are GPQA and HumanEval
On GPQA the result is very impressive. On HumanEval, they compare to GPT-4's perf at launch. GPT-4 is now much better- see the EvalPlus leaderboard, where it gets 88.4
I bet OpenAI will respond with GPT-4.5 soon
Today, we're announcing Claude 3, our next generation of AI models.
The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.
reddit still hasn't figured out how to be profitable but I don't really see how they get there by shutting off API access.
I get their frustration in not getting payed for "their" data, but it never really was "theirs". It's all user-generated...
Do transformers trained with absolute position embeddings overfit to specific positions?
Do transformers benefit from being trained on >1k context tokens?
How can we correctly evaluate LMs trained on different context lengths?
Watch my ALiBi talk!
Since lots of people are interested in ALiBi now, I'm sharing my video lecture here, which contains a lot of insights into how transformers work and why we wanted to make them work without position embeddings.
Code LLaMA has good results and good eval. It's cool to see PPL decrease all the way up to 100K tokens (after finetuning on 100K token-long inputs).
Facebook is close to replicating GPT-4 performance on HumanEval. Great news for the open source/science communities!
Asking Goldman Sachs about AI is as productive as asking a group of penguins about architecture.
AI is going to make programmers much more efficient, it's going to do other things as well, but just that programming bit is going to be worth more than $1TN.
From a recent Goldman Sachs report on generative AI: limited $$ upside, capital intensive, can't solve complex problems, no killer app. The bull case is that they somehow figure it all out or simply that "bubbles take a long time to burst." [PDF]
If you ask Google "When did
@chrmanning
's PhD advisor finish their PhD?" it won't answer correctly.
Self-ask + Google Search answers this correctly!
(Green text is generated by GPT-3, blue is retrieved from Google)
Play with this demo at:
ALiBi+FlashAttention runs faster then Rotary in realistic scenarios, sometimes with a substantial gap
This is even though the Rotary implementation uses Triton and the ALiBi one doesn't yet, meaning that there's more speed to be gained
Credit: shcho1118
I disagree- there's a lot still left to explore in how we can build *on top* of LMs to make them much more useful.
SWE-agent took GPT-4 from 1% on SWE-bench to 12%.
We didn't do any finetuning or training, so this type of research is super accessible to academics.
Hi Dan- I've got about ~1,700 questions for you that no existing AI system can solve, you can get them at and you don't even have to pay me.
~250 of these unsolved questions have also been verified by humans as being definitely solvable.
Have a question that is challenging for humans and AI?
We (
@ai_risks
+
@scale_AI
) are launching Humanity's Last Exam, a massive collaboration to create the world's toughest AI benchmark.
Submit a hard question and become a co-author.
Best questions get part of $500,000 in
We just launched SWE-bench Multimodal, a brand new benchmark with 617 tasks *all of which have an image*.
This benchmark challenges agents in new but realistic angles.
We also launch SWE-agent Multimodal to start tackling some of these issues.
We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues.
- 617 *brand new* tasks from 17 JavaScript repos
- Each task has an image!
Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues
Led w/
@_carlosejimenez
🧵
I love this! Earlier layers in transformers make worse predictions than later ones so we can improve decoding performance by biasing against tokens that were assigned a lot of probability by earlier layers.
(1/5)🚨Can LLMs be more factual without retrieval or finetuning?🤔 -yes✅
🦙We find factual knowledge often lies in higher layers of LLaMA
💪Contrast high/low layers can amplify factuality & boost TruthfulQA by 12-17%
📝
🧑💻
#NLProc
Entering the research world is hard, my 6 tips:
* Research involves lots of failure, and that’s ok
* Don’t hide negative results
Working with a mentor:
* It’s ok to say I don’t understand/know/agree
* Advisors don't have all the answers
Full post at
About Scaled LLaMA:
If you train on 2k tokens and then extrapolate to 8k using this, your LM will actually only be looking back at 2k tokens during each timestep. So you are able to input longer sequences but perf. doesn't improve.
I explain this at
@ggerganov
@yacineMTB
Regular LLaMA 7B:
arc_c: 0.41 arc_e: 0.52 piqa: 0.77 wic: 0.5
Scaled LLaMA 7B:
arc_c: 0.37 arc_e: 0.48 piqa: 0.75 wic: 0.49
So it does seem to have a slight performance downgrade, but this is with zero finetuning (!)
AFAIK this is the first release of data that shows what people are actually using LMs for.
The top 2 uses:
30% is for generating/explaining code.
18% is for text manipulation: summarization, expansion, translation, QA about a given text.
(1/2)
Looking for use-cases people actually have for LLMs?
The folks from Vicuna did the number crunching for you! (from their recent 1M chat dataset)
Cluster 9: Requests for explicit and erotic storytelling
Cluster 20: Inquiries about specific plant growth conditions
go go go!
The days of easy LM benchmarking might be over. HELM puts davinci-002 above 003, but my experience with both models makes it pretty obvious that 003 is better than 002.
How can we build better benchmarks?
(HELM is *awesome* btw, I just think we need to rethink benchmarking)
OpenAI has privately announced a new developer product called Foundry, which enables customers to run OpenAI model inference at scale w/ dedicated capacity.
It also reveals that DV (Davinci; likely GPT-4) will have up to 32k max context length in the public version. 🔥
Predictions:
>=2 orgs will get 35% on SWE-bench by Aug 1, 2024.
A fully open source system will reach 35% by Nov 1, 2024. Probably based on SWE-agent + ACI improvements: debugger, better code retrieval, lang. server protocol. The LM will be finetuned on ~500 good trajectories
You can now download & run SWE-agent (on any GitHub issue) in 1 line!
Check our repo for deets:
Join our Discord to hear first about updates like this:
OpenAI just released a small subset of SWE-bench tasks, verified by humans to be solvable.
I would treat this subset as "SWE-bench Easy"- useful for debugging your system.
But eventually when you're ready for launch, we still recommend running on SWE-bench Lite or the full set
We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues.
I don't think it's productive or effective for a PhD student to ever lead more than 1 project simultaneously.
If anything, I think leading 0.5 projects is even better (see SWE-bench & SWE-agent which Carlos and John co-led)
Focusing is really important.
Out of curiosity, do AI PhDs normally work (lead) on several projects simultaneously?
I have never managed to work on more than one project during my PhD and I tried to convince my students not to do so. The paradigm might have already changed, so I am asking here.
Just spoke
@WeizmannScience
about building benchmarks that are tough, natural & easily checkable
i.e.
A guy I didn't recognize in the front kept on asking questions the entire talk
After the talk I asked who it was
It was Adi Shamir, the S of RSA! OMG!!
"it is possible to distill an approximation of Stockfish 16 into a transformer via standard supervised training. The resulting predictor generalizes well to unseen board states, and, when used in a policy, leads to strong chess play (Lichess Elo of 2895 against humans)"
Awesome!
Google Deepmind presents Grandmaster-Level Chess Without Search
paper page:
largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit
If you still think that in language modeling, bigger inputs always lead to better models, you should watch the first 5 minutes of my ACL presentation 🤓
Our poster session will be on Wed at 9am UTC, calendar event available at:
How many years is it going to take for the prompt
"a step by step diagram on how to make the first move in a chess game"
to lead to a correct output from the leading image generation model?
Here's DALLE-3's best current take:
I'm at
#NeurIPS
presenting my work on infinitely long ImageNet-C and test time adaptation. You can stream the dataset right now (), with no download required! Feel free to reach out to chat about robustness, domain adaptation, or related topics 😀
New rule: all new SWE-bench submissions must now include reasoning trajectories showing all the thoughts/actions/... the system took in order to solve the given issue.
You can use proprietary LMs, you can use proprietary actions, but we want to see your system logs.
HumanEval continues to be the best benchmark right now for LMs. I feel like usually results here correlate pretty well with performance on non-coding tasks too. Crazy to get so much value from ~100 manually written programming challenges.
Source:
I disagree- lots of things to work on in language modeling:
1. Find weird phenomenon in LMs and understand why they happen- hallucination, the compositionality gap, the reversal curse.
2. Take my self-ask + google search system and use it to build a better
@perplexity_ai
🧵
As PhD applications season draws closer, I have an alternative suggestion for people starting their careers in artificial intelligence/machine learning:
Don't Do A PhD in Machine Learning ❌
(or, at least, not right now)
1/4 🧵
Our team at
@MosaicML
has been working on releasing something special:
We're proud to announce that we are OPEN SOURCING a 7B LLM trained to 1T tokens
The MPT model outperforms ALL other open source models!
Code:
Blog:
🧵