Introducing LongLLaMA 🦙, an unlimited-context version of OpenLLaMA fine-tuned at 8k & capable of extrapolating to 256k tokens!
We train it using our new Focused Transformer 🎯 technique (FoT). No degradation on short context, drop-in compatibility & Apache 2.0 license 🔥🔥
🧵
🎇Introducing LongLLaMA-Instruct 32K!🎇
Inspired by
@p_nawrot
#nanoT5
, we fine-tune LongLLaMA- on a *single GPU* for ~48h to improve upon OpenLLaMA: 55% on lm-eval (vs. 53%), better perf on long context and code!
We open-source our optimized fine-tuning code in PyTorch/HF!🧵
✨Announcing LongLLaMA-Code 7B!✨
Have you wondered how GPT3.5 obtained its capability?
Are base models of code better reasoners? 🤔
We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠
Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯
Honored to win Poland's best CS master thesis prize for my work on long context LLM w/
@PiotrRMilos
🎉
Can't make it to
#NeurIPS2023
😭, but
@CStanKonrad
will present LongLLaMA paper tmr!
Thu 10:45, Poster
#326
, Session 5
Interested in extending context to 256K? Come and say hi!
LLMs struggle with solving simple competitive programming problems (e.g. Codeforces) outside their training data.
Our
#ACL
#NLRSE
paper (Thurs 19:30) investigates their ability to comprehend and reason about human-coded solutions. Can they grasp the main idea from just the code?
Chatbot Arena update❤️🔥
Exciting news—
@xAI
's Grok-2 and Grok-mini are now officially on the leaderboard!
With over 6000 community votes, Grok-2 has claimed the
#2
spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at
#5
.
Grok-2 excels in
Thrilled to be a first-gen MSc! 🎓 Just defended my thesis on ‘Fine-tuning Large Language Models for Long Context Utilization’ at the University of Warsaw.
Check out our recent work if you’re curious how to extend context of LLaMA🦙 up to 256K and remain efficient at inference!
Extremely lucky to get Focused Transformer (FoT) accepted at
#NeurIPS2023
🎉! It is my first first-author paper at a big conference, which makes this moment even more special🎇
Feel free to check our recent LongLLaMA release using FoT!
Unthinkable with ACL arxiv embargo policy
✨Announcing LongLLaMA-Code 7B!✨
Have you wondered how GPT3.5 obtained its capability?
Are base models of code better reasoners? 🤔
We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠
Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯
FoT is a simple modification of the vanilla Transformer - instead of increasing context window length in all layers, we access previous windows of the training batch (containing tokens from the current and other docs) in a subset of attention layers called memory layers.
📄
We show that LMs suffer from the "distraction issue" i.e. struggle to handle multiple documents in one context. Our Focused Transformer training objective (FoT) alleviates this by attending to tokens from the same doc (positive) and other docs (negative)
Announcing the xAI PromptIDE
The xAI PromptIDE is an integrated development environment for prompt engineering and interpretability research.
It accelerates prompt engineering through an SDK that allows implementing complex prompting techniques and rich analytics that
visualize
Unlike prior work focusing on position encodings, we follow and achieve extrapolation by simply keeping positionals constant for memory tokens, while leaving the local context intact. This makes LongLLaMA backward compatible with LLaMA inference code.
@Yampeleg
In our Focused Transformer paper we propose a contrary view, which means packing multiple examples in one context can be beneficial if you optimize for long-context capabilities. This is because the model learns to ignore irrelevant tokens in context.
Surprisingly, we observe that apart from expected gains in multi-document settings, FoT (d=2) also improves perplexity on long, single documents, compared to training on just positive documents (d=1). We find this important, as the amount of long-context training data is limited.
We improve GSM8K from 13% to 17% after 35B tokens without in-distribution training.
We also publish Focused Transformer code for long-context pre-training, used in LongLLaMA!
GitHub:
HF (checkpoint):
arXiv:
Colab:
HF:
Code:
We also announce LongLLaMA-v1.1, a 3B base model trained for 5B 32K context tokens with our FoT method: We improved long-context and code (12% HE pass
@1
) capabilities
#ACL2023
is over!
It was so exciting to talk LLMs with y'all!
I hope that some of these insights will shape the future of large language models. Hopefully see you at
#NeurIPS2023
!
📄
We show that despite poor performance in solving competitive programming problems, LLMs exhibit a strong capacity in describing and explaining solutions. We try to disentangle the contribution of reasoning and coding in solving these problems by LLMs.
Thanks to using Focused Transformer (FoT), our inference is > 3x faster than the baseline at 16K tokens. We only use long-range attention in a subset of layers (3 out of 32). It is only 10% of the vanilla attention FLOPs. See by
@harmdevries77
for details
We obtain exactly the same HumanEval perf as CodeLLaMA, and improve MMLU from 37.2% to 40% (scoring setup, no CoT) and gsm8k-py from 23.4% to 24.9%. We also outperform LLaMA2.
SFT could unleash the full potential of this model; stay tuned for instruct version - coming soon!
LongLLaMA-Instruct was initialized from LongLLaMA-v1.1 32K and fine-tuned with context of just 2048(!) for 0.07 epochs. We observe that despite short-context fine-tuning, we don't lose the long-context capabilities of the base model (see by
@Francis_YAO_
).
As indicated by , current LLMs are no good at solving competitive programming problems outside of their training distribution, achieving a very low rating of 392 reported by
@OpenAI
, corresponding to barely 5th percentile of human competitors (avg. ~1450)
I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces.
Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems.
This strongly points to contamination.
1/4
CodeLLaMA is a great model, but apparently, it degrades GSM8k from 42.2% to 32.7% at 34B size. Is there a reason for not using a more balanced mixture during fine-tuning, instead of 85% code? I feel like the community might close this gap very soon :)
Thrilled to be spotlighted in an interview with
@AICoffeeBreak
presenting my work at
#ACL2023
! Dive into the dialogue at 1:26 to catch my spicy 🌶️ insights on LLMs for code👨🏻💻, competitive programming, and the "understanding" of these models. Don't miss it!
The base model is fine-tuned from OpenLLama v2 and released on a fully commercial Apache 2.0 license. We used a combination of OpenOrca and for SFT. We open-source the code to facilitate efficient instruction tuning on your own data
Using insights from designing our prompt for the backward reasoning process, we propose a structured prompt that boosts the solve rate of these models just from problem statement to code (the original task, without golden explanations in the input) from 6.1% to 9.1% for pass
@10
.
In human evaluation, we observe gpt4 is better than gpt3.5 at understanding the main idea, but there's still a long way to go. We hypothesize our backward explanations could be useful to improve model's forward reasoning process with techniques such as
Language models can dramatically improve their reasoning by learning from chains of thought that they generate.
With STaR, just a few worked examples can boost accuracy to that of a 30X larger model (GPT-J to GPT-3).
W.
@ericzelikman
, Noah Goodman
1/
Instead of solving the problem directly just from its NL description as LLM input (NL to code), we study the backward process of extracting the idea from a correct code solution. We show that our extracted rationales can significantly boost the solve rate of LLMs on CodeContests.
LongLLaMA-Code was done with a small amount of 35B pretraining tokens (mix webtext & code) to improve reasoning
While still exploratory, these results suggest base models of code are a promising avenue for enhancing reasoning capabilities
@Francis_YAO_
[]
Which LLMs are generally good at math and which are overfitting to benchmarks?
With the release of Grok,
@xai
evaluated several closed models on a Hungarian national finals math exam which was published after the models were trained. This means it is impossible to train on or
🎇Introducing LongLLaMA-Code 7B Instruct 🦙!🎇
A step towards an open-source alternative for Claude 2.
Run in 🆓 Colab (8bit).
🗨 Answers questions about 📑 papers and >_ code.
SOTA 7B reasoning :
🎓 GSM8K: 65% 🐍 PoT 0-shot, 42% std CoT 8-shot setting.
>_ 37%: HumanEval
@Francis_YAO_
Perplexity/LM loss gains (compression) seem pretty clear, e.g. from Memorizing Transformers paper, and context could be seen as an additional scaling dimension. Imo it's harder to quantify benefits from application of LCLMs, due to lack of reasonable downstream benchmarks 🤔
@XueFz
@DrJimFan
Not sure if gpt4 is capable of solving novel lc medium/hard (outside of the training data). IMHO these tasks still require some amount of reasoning before coding. Once you come up with the idea, llm can code it up pretty easily. Check our work for more
In the colab demo we try to feed the entire Focused Transformer paper into context and ask questions about it, achieving reasonable results. In the same colab, we also provide a chat interface to interact with the model!
This was an amazing project primarily done by Jierui Li from UT Austin, with my help & advised by Yingying Wu and Raymond Mooney. Acknowledgements to
@IDEAS_NCBR
for making the in-person presentation possible. For me, the project was huge fun, bringing back NOI and ICPC memories
LongLLaMA v1.1 shows competitive performance on long-context retrieval tasks (see
@nelsonfliu
), without degradation after instruction tuning. Also, the short-context performance on downstream tasks improves due to instruction tuning: 55% vs. 53% on lm-eval
Exploring the roots of the famous "6ND" formula, I stumbled upon this insightful post by
@DBahdanau
about LLM training compute estimation 👉 . Still on point!
Wondering what >64k contexts/flashattn bring to the table 🧐, does FFN still dominate attention?
I'm presenting this paper on explaining competitive programming solutions with LLMs at 1:30pm (in 10min) in the big poster hall
#ACL2023
- number 24, drop by and say hi!
@Francis_YAO_
There are very few evaluation datasets for long-context LLM. shows comparison between open-source and Claude, which is pretty miserable - still a long way to go for OSS 🤔, but developing a leaderboard like MT Bench for LCLM would be immensely useful!
@JohnGal43951639
@Yampeleg
The memory overhead from longer context is much smaller than for standard models since we only access k/v from the extended context in a tiny fraction of layers (3 out of 26). With a simple trick we fit 32K context into a colab GPU (see the HF repo):
Nothing works better for chilling out than google recalling your compute in the middle of training, 2 days before a planned release. You can just watch netflix or write some codeforces round 🤣
Our model is also significantly faster at inference due to only having a subset of layers (3/26) attend to tokens beyond the local context window. LongLLaMA-Instruct is a competitive 3B model you can run on a colab or locally and use as a long-context chatbot (see colab demo!)
Everyone's invited to stop by the poster session of NLP-OSS Workshop at
#EMNLP
where you can see this piece of art poster by yourself in-person
This is the last post about nanoT5 from me, if you haven't seen it check out
Thanks for all the kind feedback!
@4evaBehindSOTA
@p_nawrot
@CStanKonrad
The FoT pretraining code is written in JAX; we're planning to release it after LongLLaMA v2 release - stay tuned :))
No plans to implement Focused Transformer pretraining in PyTorch, so if you'd like to learn, doing this could be a great exercise and useful for the community!
@PiotrRMilos
I just feel lucky there are conferences allowing to put papers on arXiv and make an impact way before acceptance, which is not true for some other conference 😉. Thanks a lot for the entire team for the extremely hard work!
@CStanKonrad
@MikolajPacek
@Yuhu_ai_
@hmichalewski
Amazing initiative with COLM 🎉! One more step for inclusivity could be making the location accessible to all 🌍. It'd cut down on the visa hassle 🛂 that often plagues U.S. conferences (e.g.
#NeurIPS2023
) . Any thoughts on the potential location? 🙂
Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models.
Submissions: March 15 (it's pronounced "collum" 🕊️)
@KujoJot32604166
The current state is roughly here:
We are working towards better pretraining data for long-context, and very likely to release something new in Oct/Nov :)
ACL has removed the anonymity period.
This means that ACL submissions can be posted and discussed online at any time, although extensive PR is discouraged.
@tugot17
@dylan522p
@mvpatel2000
@abhi_venigalla
Pretraining data mixture for llama2 is not public ig, but it doesnt seem to be qualitatively better than llama1, given it’s perf / train tokens. I think mistral used much more code in the mixture, which is likely to help reasoning benchmarks (although still speculative)
@atgambardella
Wish I could motivate myself to study Mandarin 2h daily.. typically it’s 30min at most, extremely good job on your side and fingers crossed you master Japanese!
@4evaBehindSOTA
@p_nawrot
@CStanKonrad
The code would have been much more performant in pytorch (flashatt etc), its just compute constraint that I only have TPU compute to pretrain on, so pytorch is not useful for me, but def would be for the community!
@minimario1729
@terese14711217
@AICoffeeBreak
My suspicion is that openai’s gpt has already seen multiple epochs of solutions (based on its pre-cutoff solve rate) & unclear amount of epochs of editorials. You can check out our work to see how gpt-generated editorials/rationales could boost performance.
@Yampeleg
Specifically, in Figure 6 we study what happens if we train the model on multiple examples, compared to just same example. Perplexity on long, single documents gets better for a model that can sould see multiple unrelated examples in context, which we find quite surprising.
@zhu_zhaocheng
Pretty much, im interested in learning mandarin these days :) just seen in what context my friends use the fake smile and do the same myself hhh
@KujoJot32604166
Also, we are working on an instruction-tuned version of LongLLama-Code 7B. This is supposed to be relased early Oct, and from preliminary results we expect it to be pretty good :) stay tuned!
@atgambardella
rn on my side it’s mostly memorising basic vocab and learning to distinguish between tones hhh but I hope I can start listening to some comprehensible input/read simple stuff soon like I did with European languages
@dylan522p
@mvpatel2000
@abhi_venigalla
You mean training time flops? I guess they have much higher quality data than llama2 so I wouldn’t be too surprised if they do even 2x better in terms of training flops for achieving same downstream perf :p
@laion_ai
I dont think it can compare even to gpt3.5 in terms of math/coding/reasoning performance. Looking forward to their paper and gsm8k & humaneval numbers. Given their data mixture, is not looking promising ngl :))
@abacaj
I totally agree with that - training data is the most important for long context utilisation. I hope some upcoming ICLR submission will discuss this question!