Tim_Dettmers Profile Banner
Tim Dettmers Profile
Tim Dettmers

@Tim_Dettmers

Followers
35K
Following
4K
Media
132
Statuses
3K

Creator of bitsandbytes.Research Scientist @allen_ai and incoming professor @CarnegieMellon. I blog about deep learning and PhD life at https://t.co/Y78KDJJFE7.

Seattle, WA
Joined October 2012
Don't wanna be here? Send us removal request.
@Tim_Dettmers
Tim Dettmers
6 months
After 7 months on the job market, I am happy to announce:.- I joined @allen_ai.- Professor at @CarnegieMellon from Fall 2025.- New bitsandbytes maintainer @Titus_vK. My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops ๐Ÿงต.
155
86
2K
@Tim_Dettmers
Tim Dettmers
2 years
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark:. Paper: Code+Demo: Samples: Colab:
Tweet media one
89
930
4K
@Tim_Dettmers
Tim Dettmers
3 months
This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs๐Ÿงต
Tweet media one
@tanishqkumar07
Tanishq Kumar
3 months
[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training TLDR;. - Models become harder to post-train quantize as they
Tweet media one
66
491
3K
@Tim_Dettmers
Tim Dettmers
3 years
I am excited to share my latest work: 8-bit optimizers โ€“ a replacement for regular optimizers. Faster ๐Ÿš€, 75% less memory ๐Ÿชถ, same performance๐Ÿ“ˆ, no hyperparam tuning needed ๐Ÿ”ข. ๐Ÿงต/n. Paper: Library: Video:
Tweet media one
18
279
1K
@Tim_Dettmers
Tim Dettmers
2 years
@karpathy Super excited to push this even further:.- Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit).- Two weeks: Full release of code, paper, and a collection of 65B models.
39
191
1K
@Tim_Dettmers
Tim Dettmers
2 years
We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: ๏ฟฝ
Tweet media one
36
297
1K
@Tim_Dettmers
Tim Dettmers
2 years
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:. Paper: Software: Emergence:
Tweet media one
17
249
1K
@Tim_Dettmers
Tim Dettmers
5 years
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n.
15
307
1K
@Tim_Dettmers
Tim Dettmers
1 month
Reading the report, this is such clean engineering under resource constraints. The DeepSeek team directly engineered solutions to known problems under hardware constraints. All of this looks so elegant -- no fancy "academic" solutions, just pure, solid engineering. Respect ๐Ÿ‘.
@deepseek_ai
DeepSeek
1 month
๐Ÿš€ Introducing DeepSeek-V3!. Biggest leap forward yet:.โšก 60 tokens/second (3x faster than V2!).๐Ÿ’ช Enhanced capabilities.๐Ÿ›  API compatibility intact.๐ŸŒ Fully open-source models & papers. ๐Ÿ‹ 1/n
Tweet media one
Tweet media two
19
107
974
@Tim_Dettmers
Tim Dettmers
2 years
We release the public beta for bnb-int8๐ŸŸช for all @huggingface ๐Ÿค—models, which allows for Int8 inference without performance degradation up to scales of 176B params ๐Ÿ“ˆ. You can run OPT-175B/BLOOM-176B easily on a single machine ๐Ÿ–ฅ๏ธ. You can try it here: 1/n
Tweet media one
27
222
911
@Tim_Dettmers
Tim Dettmers
4 years
Updated GPU recommendations for the new Ampere RTX 30 series are live! Performance benchmarks, architecture details, Q&A of frequently asked questions, and detailed explanations of how GPUs and Tensor Cores work for those that want to learn more:
Tweet media one
31
247
880
@Tim_Dettmers
Tim Dettmers
2 years
In the RTX 40 post, I introduce a GPU recommendation chart and discuss the new Tensor Memory Accelerator (TMA) and FP8 computation. Overall, RTX 40s are faster for inference and shine through their FP8 performance but are inefficient for 16-bit training.
Tweet media one
38
168
871
@Tim_Dettmers
Tim Dettmers
2 years
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:.
25
152
847
@Tim_Dettmers
Tim Dettmers
2 years
The result of long days of CUDA optimizations: the new bitsandbytes release includes 4-bit inference, which is up to 4.2x faster than 16-bit inference (bsz=1). Full HF integration for all models. No code change needed. Bnb is growing rapidly, just shy of 1M installs/month๐Ÿงต
Tweet media one
24
144
851
@Tim_Dettmers
Tim Dettmers
2 months
Just to clarify this benchmark. This is an apple to oranges comparison. - Cerebras is fast for batch size 1 but slow for batch size n. - GPUs are slow for batch size 1 but fast for batch size n. I get >800 tok/s on 8x H100 for a 405B model for batch size=n. Cerebras' system.
@CerebrasSystems
Cerebras
3 months
Llama 3.1 405B is now running on Cerebras!.โ€“ 969 tokens/s,ย frontier AI now runs at instant speed.โ€“ 12x faster than GPT-4o, 18x Claude, 12x fastest GPU cloud.โ€“ 128K context length, 16-bit weights.โ€“ Industryโ€™s fastest time-to-first token @ 240ms
Tweet media one
28
84
829
@Tim_Dettmers
Tim Dettmers
2 years
Finished RTX 4090 modeling . not good ๐Ÿ˜. If you have an RTX 3090, probably best to wait 4 years for chiplets and consumer HBM. This is what dead Moore's law looks like. You can only scale cost/perf with features, but you can only add Tensor Cores once. We are stuck. More soon!.
26
55
699
@Tim_Dettmers
Tim Dettmers
1 year
Just a reminder that the default hyperparameters of LoRA are performing poorly. You need to attach LoRA modules to all layers for it to perform as well as full fine-tuning. Once you do that, we find there is no difference between LoRA and fine-tuning.
@Shahules786
ikka
1 year
LoRA is not a drop-in replacement for Full Finetuning. Even though it reduces the compute requirements by 3x it comes with certain limitations. The data preparation needed for both is also different. ๐Ÿ”‘. - LoRA requires much more data to converge compared to full FT. This can be
Tweet media one
17
110
661
@Tim_Dettmers
Tim Dettmers
2 years
I got excited about a paper, implement stuff and then see they cheated: (1) Copy baseline results from other paper, (2) do much more hyperparam tuning on their own method, (3) accepted to EMNLP. Results look good, but their method is crap! Why waste people's time like this?.
39
38
612
@Tim_Dettmers
Tim Dettmers
6 years
I just updated my full deep learning hardware for the latest recommendations and advice. I reframed the blog post to help you avoid the most costly mistakes when you are building a deep learning machine.
11
156
504
@Tim_Dettmers
Tim Dettmers
3 months
This is actually a great argument for using MoEs. When I think about MoEs, I think about the cerebellum and its relationship to the rest of the brains. Here is my intuition: The human brain has ~20% "recurrent" neurons (cerebrum) and ~80% MoE-style forward neurons (cerebellum).
@EranMalach
Eran Malach
3 months
MoEs increase parameter count but not FLOPs. Do they offer "free lunch", improving performance without paying in compute?. Our answer: for memorization, MoEs give performance gains "for free", but have limited benefit for reasoning!. Arxiv: ๐Ÿฆœ๐Ÿฆœ๐Ÿฆœ
Tweet media one
20
77
521
@Tim_Dettmers
Tim Dettmers
4 months
Open-source models beating closed models will become more and more common. Scaling has diminishing returns. The best solution will not have the largest scale but best approach or data. Especially with test-time compute, you do not need the best model to have the best solution.
@allen_ai
Ai2
4 months
Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal dataโ€”it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it
11
80
504
@Tim_Dettmers
Tim Dettmers
2 months
The new bitsandbytes is here:. ~15% faster 4-bit.~70% faster 8-bit inference.8-bit support for H100s. Great engineering from @mattkdouglas. bitsandbytes now receives about 100,000 installations daily. A little history on 8-bit implementations in bnb ๐Ÿงต
Tweet media one
7
70
487
@Tim_Dettmers
Tim Dettmers
4 years
I am curious why people are not talking more about the OpenAI scaling law papers. For me, they seem very significant. What I heard so far: "Too complicated. I don't understand and I don't care", "NLP is not physics". Other criticism? Any insights why people ignore it?.
22
68
450
@Tim_Dettmers
Tim Dettmers
4 years
New GPUs have arrived, and they come with GDDR6X! You can expect a ~45% speed increase with the RTX 3090 vs RTX 2080 Ti. 3-slot-width is a problem though as is the fan-design. 4x RTX 2080 Ti >> 2x RTX 3090. 24GB mem is great, but RTX 3080 with 10GB is not very useful.
20
38
422
@Tim_Dettmers
Tim Dettmers
2 years
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw. Paper: Colab:
Tweet media one
4
97
427
@Tim_Dettmers
Tim Dettmers
4 months
I use Qwen models in my work. They are very high-quality and they have an extended hierarchy making it very easy to study scaling-behavior in detail. Qwen 2.5 is awesome!.
@BlancheMinerva
Stella Biderman
4 months
Qwen and DeepSeek don't get nearly as much applause and attention as they deserve.
5
26
426
@Tim_Dettmers
Tim Dettmers
6 years
I just updated my GPU recommendation blog post! I included the RTX Titan and GTX 1660 Ti in my analysis. The analysis now separates word RNNs from char RNNs/Transformers. I also recommend TPUs for larger transformers/CNNs. This and more in the update:
6
124
421
@Tim_Dettmers
Tim Dettmers
4 months
Just as a warning: I tried all of these, and all of these worked . at the small scale. When scaled up, none of these worked for me (except padding embeddings -- but what you should really do is optimize the layout to align with memory tiles). That being said, I do not want to.
@kellerjordan0
Keller Jordan
4 months
New NanoGPT training speed record: 3.28 Fineweb validation loss in 15.2 minutes. Previous record: 22.3 minutes.Changelog:.- pad embedding to nearest 64.- switch from GELU to ReLUยฒ.- zero-init projection layers.- QKNorm. All four changes driven by @Grad62304977.1/8
Tweet media one
Tweet media two
21
32
414
@Tim_Dettmers
Tim Dettmers
2 months
I am very confused by Claude 3.6. It is more convincing than 3.5, but it has more subtle but very consequential hallucinations. This got to a point where I no longer "trust" the model. In my workflow I now have an extra step to debug its own outputs. How is your experience?.
56
12
406
@Tim_Dettmers
Tim Dettmers
7 years
Out of nowhere: far better translator than Google. Begs the question: Can Google be overtaken in search too? #dlearn
Tweet media one
17
220
368
@Tim_Dettmers
Tim Dettmers
5 months
Surprisingly many details here (for OpenAi-level secrecy) of how they build the model.
@OpenAI
OpenAI
5 months
Some of our researchers behind OpenAI o1 ๐Ÿ“
12
23
394
@Tim_Dettmers
Tim Dettmers
2 years
Looking at the comments, some people missed the Guanaco-33B demo because it was added later: Big thanks to @huggingface for sponsoring this demo!. The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial๐Ÿงต.
@Tim_Dettmers
Tim Dettmers
2 years
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark:. Paper: Code+Demo: Samples: Colab:
Tweet media one
11
73
383
@Tim_Dettmers
Tim Dettmers
2 years
This is the main driving assumption of my research and it is still holding up after 10 years: Humans are not special, scale is. The other main fact (sparsity): Humans are not special, but primates are. Only primates and birds have neurons not proportional to their body size.
@SilverVVulpes
Siberian fox๐Ÿ”ธ
2 years
People claimed the human brain was special relative to other primates in the size of the temporal lobes, involved in functions such as language. Newer data once again shows that no, the human brain is just a scaled up primate brain
Tweet media one
Tweet media two
14
42
375
@Tim_Dettmers
Tim Dettmers
4 years
Turns out a lot of open-domain QA datasets have test set leakage. If you control for it, model performance drops by a mean absolute of 63%. Yikes! If we missed this for such a long time, I wonder if there are problems with other NLP datasets too.
4
97
369
@Tim_Dettmers
Tim Dettmers
1 year
@karpathy I have also seen this before. I think it's the psychology of material coming all at once that can be overwhelming for newcomers. If one builds up things bit-by-bit there not this overwhelming feeling of "this is too much; I am not good enough to learn this".
6
6
356
@Tim_Dettmers
Tim Dettmers
2 years
We ran +35,000 zero-shot experiments for our work on k-bit Inference Scaling Laws๐Ÿ“ˆ. A 30B 8-bit and 60B 4-bit LLM have the same model bits/inference latency, but different zero-shot accuracy. What is the best trade-off? The answer is clear: 4-bit is best.
Tweet media one
6
66
344
@Tim_Dettmers
Tim Dettmers
4 years
A friend asked me for a reference for why we did not increase the frequency further in CPUs and why parallelism was necessary to increase performance. This puts it quite bluntly (from .
Tweet media one
3
65
334
@Tim_Dettmers
Tim Dettmers
2 years
Just finished the final update for the RTX 40 GPU blog post:.- Performance/$ now includes Total Cost of Ownership in cost estimate (computer + 5y electricity).- Discussion Async copy vs. TMA.- Small update on FP8 training.- Font and figure improvements.
Tweet media one
12
57
336
@Tim_Dettmers
Tim Dettmers
10 months
This is excellent work โ€” a big step forward in quantization! It enables full 4-bit matmuls, which can speed up large batch inference by a lot. Anyone deploying LLMs at scale will soon use this or similar techniques.
@AshkboosSaleh
Saleh Ashkboos
10 months
[1/7] Happy to release ๐Ÿฅ•QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others. Paper: Code:
Tweet media one
5
56
335
@Tim_Dettmers
Tim Dettmers
11 months
This model can automatically debug CUDA version errors. AGI achieved โœ…๐Ÿ˜‚.
@cognition_labs
Cognition
11 months
3/4 Devin can train and fine tune its own AI models.
5
26
335
@Tim_Dettmers
Tim Dettmers
7 years
This work presents strong and rigorous evidence that we should abandon RNNs and move on to using convolutions for sequence modeling. I also made similar experiences in other domains such as graph embeddings and knowledge compression. Definitely an important read!.
7
77
320
@Tim_Dettmers
Tim Dettmers
2 years
The latest release of bitsandbytes has an improved CUDA setup and A100 4-bit inference. I thought that A40 and A100 GPUs were close enough, and optimized for A40s, but they are very different. A100 performance is now 40% faster with a small hit for other GPUs.
Tweet media one
9
53
318
@Tim_Dettmers
Tim Dettmers
6 years
I updated my guide with new GPU recommendations: RTX 2080 most cost-efficient choice. GTX 1080/1070 (+Ti) cards remain very good choices, especially as prices drop. Some discussion on TPUs/AWS โ€” can be good in some cases.
Tweet media one
9
123
321
@Tim_Dettmers
Tim Dettmers
2 months
I am very excited to be selected as one of the #AI2050 Early Career Fellows! .My research is shifting, and my main focus will be on building open-source AI agents that use dynamic computation to enable powerful AI systems on consumer devices. I am hiring PhD students at CMU!.
@schmidtsciences
Schmidt Sciences
2 months
We're thrilled to welcome the 2024 cohort of AI2050 Senior and Early Career Fellows โ€“โ€“ย 25 visionary researchers tackling AI's toughest challenges to ensure it serves humanity for the better. Learn more about this yearโ€™s cohort of fellows:.
Tweet media one
34
58
320
@Tim_Dettmers
Tim Dettmers
2 years
Did some optimizations for Ada/Ampere/Turing for 4-bit inference (bsz=1, arbitrary datatype e.g. NF4). It is now 3.71x, 3.13x, and 1.72x speedup vs 16-bit. The expected max would be 3.55x if NVIDIA kernels were 100% efficient. Will be released on Monday (no code change needed).
9
34
309
@Tim_Dettmers
Tim Dettmers
3 months
All of this means that the paradigm will soon shift from scaling to "what can we do with what we have". I think the paradigm of "how do we help people be more productive with AI" is the best mindset forward. This mindset is about processes and people rather than technology.
14
31
306
@Tim_Dettmers
Tim Dettmers
2 years
Continued pretraining with QLoRA is just around the corner! A second pretraining of models like Falcon-40B in 4-bit would be super-efficient.
@guitaricet
Vlad Lialin
2 years
Parameter-efficient fine-tuning revolutionized the accessibility of LLM fine-tuning, but can they also revolutionize pre-training? We present ReLoRA โ€” the first PEFT method that can be used for training from scratch! ๐Ÿ”ฅ๐Ÿ”ฅ.
Tweet media one
9
41
302
@Tim_Dettmers
Tim Dettmers
2 years
I never had time to do the proper bitsandbytes 4-bit release. The 0.39 release includes the 4-bit quantization variants and CUDA kernels, paged optimizers, Lion, as well as an important bugfix for a memory leak in 8-bit training/inference .
Tweet media one
7
32
292
@Tim_Dettmers
Tim Dettmers
1 year
@typedfemale Yes, it is a big problem. I really want to create a class for machine learning systems that also has an emphasis on CUDA programming for deep learning. So many people were interested in this. I will probably get on this once I finish the faculty application process.
7
7
298
@Tim_Dettmers
Tim Dettmers
1 year
Today, I will give a talk about "The making of QLoRA" at the LLM Efficiency Challenge at 2:30pm, Room 356. I will also talk a bit about how I go about doing research, running experiments and figuring out "what works".
13
24
295
@Tim_Dettmers
Tim Dettmers
2 years
A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via `pip install -U bitsandbytes`. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.
12
46
286
@Tim_Dettmers
Tim Dettmers
4 years
I have been working on 8-bit optimizers, and I am looking for testers for the initial release to test installation and ease of use. Uses up to 63% less GPU memory, faster/stabler training while maintaining performance. Currently, 8-bit Adam and 8-bit Momentum are supported. 1/5
Tweet media one
3
52
278
@Tim_Dettmers
Tim Dettmers
6 years
My new work with @LukeZettlemoyer on accelerated training of sparse networks from random weights to dense performance levels โ€” no retraining required!.Paper: Blog post: Code:
Tweet media one
2
81
271
@Tim_Dettmers
Tim Dettmers
1 year
Bitsandbytes now supports 4-bit store/load of any model. Load in 4-bit via:. from_pretrained(name, . , load_in_4bit=True, device_map='auto'). Then save/push the model to the hub. Get the newest bnb: pip install -U bitsandbytes. Implemented by Ruslan Svirschevski (gh: poedator).
4
37
271
@Tim_Dettmers
Tim Dettmers
2 years
We now have Int8 backprop support for all GPUs for bitsandbytes! Now available via 'pip install bitsandbytes'. This was a contribution from @sasha_borzunov. We will release Int8 fine-tuning for all @huggingface models soon โ€” stay tuned!.
5
40
267
@Tim_Dettmers
Tim Dettmers
2 years
The GPU blog post update is 90% done now. I think tomorrow morning, we will have an update! ๐Ÿš€.
7
5
257
@Tim_Dettmers
Tim Dettmers
1 year
We made a QLoRA promo video for @UWITNews. It is a very nice summary of the motivation behind QLoRA and what the environment was like to develop this research. @uwcse is a perfect place for doing such research!. Article: Youtube:
3
43
252
@Tim_Dettmers
Tim Dettmers
2 years
@HamelHusain If you wait for another two weeks, we have something nice for you ;) With the right methods you can fine-tune a 30B model on that GPU. A 30B policy with 30B value function also works for RLHF.
16
14
246
@Tim_Dettmers
Tim Dettmers
1 year
@willie_agnew Literally curing cancer. I talked to a biologist who used my methods in conjunction with open models to develop new methods for drug discovery. They developed drugs for previously incurable pediatric cancers. These are real wet lab in vitro results โ€” it just works.
14
12
241
@Tim_Dettmers
Tim Dettmers
1 year
The 0.42.0 bitsandbytes release adds 4-bit serialization, so you can save/load 4-bit weights directly. Otherwise, there are lots of bug fixes. Thank you, contributors! . The next goal is Apple/AMD/Intel and Windows integration. We now have 1.5M installs per month.
Tweet media one
10
34
245
@Tim_Dettmers
Tim Dettmers
1 year
An excellent end-to-end guide for finetuning. It has all the details from data prep to deployment. If you want to finetune, this is a great resource to get started.
@_philschmid
Philipp Schmid
1 year
What's the best way to fine-tune open LLMs in 2024? Look no further! ๐Ÿ‘€ย I am excited to share โ€œHow to Fine-Tune LLMs in 2024 with Hugging Faceโ€ using the latest research techniques, including Flash Attention, Q-LoRA, @OpenAI dataset formats (messages), ChatML, Packing, all built.
3
45
241
@Tim_Dettmers
Tim Dettmers
6 years
This is really great work! For layer 5 pyramidal neurons: A dendritic branch = MLP with 1 layer, 4 units; the entire neuron = MLP with 7 layers, 128 units each. One bio neuron > most MNIST models. We have about 85bn neurons in total and >1tn dendrites โ€” that is a lot of compute!.
@DavidBeniaguev
David Beniaguev
6 years
A story of a Cortical Neuron as a Deep Artificial Neural Net:. 1) Neurons in the brain are bombarded with massive synaptic input distributed across a large tree like structure - its dendritic tree. During this bombardment, the tree goes wild. preprint:
6
63
233
@Tim_Dettmers
Tim Dettmers
8 years
I updated my GPU advice blog post with the GTX 1080 Ti; also cleaned it so it is easier to find relevant information
2
98
233
@Tim_Dettmers
Tim Dettmers
3 months
(1) Scaling data centers: This still scales for ~2 years. (2) Scaling through dynamics: Route to smaller specialized models or larger/smaller models. (3) Knowledge distillation: I believe distillation behaves differently than other techniques and might have different properties.
13
20
240
@Tim_Dettmers
Tim Dettmers
4 years
I am a huge fan of einsum notation. Here is a multi-layer transformer in a couple lines of code (without norms though). I think it's simple to read, but whenever I show this to somebody in excitement they do not like it. I am curious: How is that for you? Easy to read or not?
Tweet media one
29
22
233
@Tim_Dettmers
Tim Dettmers
2 years
FP8 training works well and has large benefits. It has steep networking requirements to achieve good utilization but there are solutions to that too (we will release one in the next days). It's a big shift and everyone with RTX 40s / H100 GPUs should look into FP8 training.
@NamanGoyal21
Naman Goyal
2 years
Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back.
6
20
224
@Tim_Dettmers
Tim Dettmers
3 months
Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.
7
9
229
@Tim_Dettmers
Tim Dettmers
2 years
I think it will take another day or two for the full integration, but the kernels (batch size 1) are ready.
Tweet media one
5
15
226
@Tim_Dettmers
Tim Dettmers
1 year
If you are merging adapters with QLoRA 4-bit weights, please use the gist below for merging. This will increase the performance of the QLoRA model. I think I have seen a PR on PEFT, so this will soon come to PEFT by default, but for now, its better to merge it in this way.
@chris_hayduk1
Chris Hayduk
1 year
Just put together a gist for merging QLoRA with the quantized model weights as mentioned by @Tim_Dettmers @Teknium1 @erhartford Since I know you guys were looking into it. Should be able to quantize the whole thing after this without issue.
1
24
220
@Tim_Dettmers
Tim Dettmers
2 years
I forgot how much better Guanaco-65B is compared to 33B. You can try here via Petals (globally distributed inference): With Petals, you can also run a 65B model in a colab or locally on a small GPU at ~5 tokens/sec (see below).
@m_ryabinin
Max Ryabinin
2 years
@fernavid @Tim_Dettmers We've updated one just today: this notebook shows you:.- how to run a 65B model from Colab,.- how to plug in adapters between its layers, .- and how to write custom generation methods (you can't do this with an API).
1
36
217
@Tim_Dettmers
Tim Dettmers
2 years
bitsandbytes is on track to surpass half a million pip installs this month! Upcoming features:. - LLM.int8() support for all GPUs.- Int8 backward for fine-tuning.- Fast 4-bit float (FP4) kernels for inference. Always looking for more people to get involved. There is lots to do!.
6
15
210
@Tim_Dettmers
Tim Dettmers
2 years
The 0.38.0 release of bitsandbytes introduces:. - 8-bit Lion which is 8x more memory efficient than standard Adam.- Serialization of 8-bit layers now allows storing/loading 8-bit models to/from the HF Hub. We are now at half a million installs per month!
Tweet media one
5
39
203
@Tim_Dettmers
Tim Dettmers
4 years
I have my first draft right now: 7,000 words. It will be quite a comprehensive post. If you have any more GPU-related questions for me, right now is the last chance for them to be added. I will freeze the draft tonight, rewrite tomorrow, and then publish it on Monday.
15
13
198
@Tim_Dettmers
Tim Dettmers
7 years
I am very excited and proud to announce that I will join UW as a PhD student this fall. I will work with Yejin Choi on common sense knowledge and reasoning. I believe that with common sense, intelligent machines will be able to benefit everyone equally.
12
3
199
@Tim_Dettmers
Tim Dettmers
4 years
Just working on an update to my GPU recommendation blog post. Will cover the new GPUs and focus on the cost-effectiveness of 2/4/8 GPU systems. Any other things that you would like to see discussed?.
19
11
193
@Tim_Dettmers
Tim Dettmers
1 year
This model flew under the radar. It has the highest MMLU score of any open-source model. I have not tried it myself but I am curious how it compares to other models when evaluated across a broad range of tasks. Can somebody give it a try?.
@01AI_Yi
Yi-01.AI
1 year
Our team at @01AI_Yi is very proud to introduce the release of Yi-34B model now on top of @huggingface pretrained LLM leaderboard! Also a Yi-6B available. Welcome to give a try and build fantastic projects!.
12
12
194
@Tim_Dettmers
Tim Dettmers
3 years
An important but elusive quality to learn in a PhD is research style. It is valuable to be aware of this before you start a PhD. Among other updates, I added an extensive discussion on research style to my "choosing a grad school" blog post. Enjoy!
4
35
183
@Tim_Dettmers
Tim Dettmers
4 years
Making good progress on the updated GPU recommendation blog post. Have almost all data crunched and it seems that I will have pretty accurate estimates of performance. Will probably publish it Friday morning. If you have any more questions that I should include. Let me know!.
12
8
182
@Tim_Dettmers
Tim Dettmers
2 years
Guanaco-33B holds up well. Controlled for the memory footprint, its the best model. Since it was trained in 4-bit, it uses as much memory as a regular 7B model. The memory needed during fine-tuning is 17x less, so a 7B model is much more expensive to fine-tune than Guanaco 33B.
@lmarena_ai
lmarena.ai (formerly lmsys.org)
2 years
We are excited to announce the first major release of the Chatbot Arena conversation dataset!. - 33K conversations with pairwise human preferences.- 20 SOTA models such as GPT-4, Claude, and LLaMA-based Vicuna.- From 13K unique IPs in the wild.- An additional 3K expert-level
Tweet media one
Tweet media two
4
24
183
@Tim_Dettmers
Tim Dettmers
2 years
One thing that I care about in bitsandbytes is to provide _broad_ accessibility to LLMs. GPUs up to 9 years old are supported by 4-bit inference in bitsandbytes and you will see good speedups.
@daryl_imagineai
Daryl
2 years
@Tim_Dettmers Wow! This just gave Volta cards a new lease on life: Testing with 4xV100S and a 30B~ model. Got a 3.2x speedup! 7-8 tokens per second is very usable for an interactive chat experience.
Tweet media one
4
16
180
@Tim_Dettmers
Tim Dettmers
3 months
Claude even felt better today than usual. I was surprised that it could do things it could not do before. It felt much more nuanced. I tried Claude before reading this, so I thought, "Maybe I just prompted it right" ๐Ÿ˜‚. Now, I think this is just the new model.
@alexalbert__
Alex Albert
3 months
I'm excited to share what we've been working on lately at Anthropic. - Computer use API.- New Claude 3.5 Sonnet.- Claude 3.5 Haiku. Let's walk through everything:
Tweet media one
5
9
179
@Tim_Dettmers
Tim Dettmers
7 years
Deep learning hardware limbo is the battle between @Nvidia vs @AMD vs @IntelNervana for the throne of deep learning hardware. Learn who might win and why #dlearn #nlproc #ai.
6
80
176
@Tim_Dettmers
Tim Dettmers
2 years
Below highlights some problems with QLoRA (I should not have been so smirky๐Ÿ˜…), and I wanted to highlight some issues but also resolve some others. We integrated our QLoRA codebase with 5 other open-source codebases before release, and it seems we created some issues on the way๐Ÿงต.
@Tim_Dettmers
Tim Dettmers
2 years
No cons :).
1
26
179
@Tim_Dettmers
Tim Dettmers
4 years
Going to write another GPU blog post update in the coming days. Are there any GPU questions that you would like to have answered? Will include popular Q&A in the blog post.
28
9
182
@Tim_Dettmers
Tim Dettmers
6 years
A really nice blog post by @agrinh about recent progress in GANs and variational autoencoders. Gives a short overview about GANs and their problems and then dives deep into the newest methods from ICML2018.
0
47
180
@Tim_Dettmers
Tim Dettmers
5 years
After talking to many students about their grad school experience I compiled this blog post on "How to pick your grad school". I discuss all the important factors and details from contrasting but complementary perspectives. I hope it will be helpful!
6
46
178
@Tim_Dettmers
Tim Dettmers
6 months
@CarnegieMellon won me over. It is an amazing place. Highly collaborative, very collegial, close-knit, with excellent students and great support. Looking forward to my time there!. I will take 2-3 PhD students for Fall 2025. Please apply to the CMU PhD program to work with me.
8
9
179
@Tim_Dettmers
Tim Dettmers
2 years
Catch my talk on k-bit Inference Scaling Laws at the @ESFoMo workshop, ballroom A (fourth floor), 10:50am. Slides:
2
32
177
@Tim_Dettmers
Tim Dettmers
4 years
We have confirmation that Tensor Cores in RTX 30 GPUs will be limited to make Quadro / Tesla cards more attractive for deep learning. This is the same as in the RTX 20s series. I will update my performance figures later today and will post an update.
7
29
166
@Tim_Dettmers
Tim Dettmers
4 years
This is pretty significant for custom CUDA code. Even with years of CUDA experience, it is very difficult to write peak performance matrix multiplication code. CUTLASS is great, but it seems Triton has better performance, is more customizable, and you can write code in Python.
@OpenAI
OpenAI
4 years
Weโ€™re releasing Triton 1.0, an open-source Python-like programming language for writing efficient GPU code. OpenAI researchers with no GPU programming experience have used Triton to produce kernels that are 2x faster than their PyTorch equivalents.
4
19
171
@Tim_Dettmers
Tim Dettmers
2 years
I am currently preparing a new GPU blog post updated for the RTX 4090 etc. I am collecting some Q&A questions. If you have any questions that you would like me to answer in the blog post, please leave them here as a comment.
26
9
170
@Tim_Dettmers
Tim Dettmers
3 months
From my own experience (a lot of failed research), you cannot cheat efficiency. If quantization fails, then also sparsification fails, and other efficiency mechanisms too. If this is true, we are close to optimal now. With this, there are only three ways forward that I see. .
1
9
172
@Tim_Dettmers
Tim Dettmers
6 months
The six months on the academic job market were brutal but also very successful. More than 125 individual interviews across 17 universities leading to 15 job offers. It was a unique experience for which I am very grateful for. I will write up my learnings and insights soon.
4
0
172
@Tim_Dettmers
Tim Dettmers
4 months
This is a big deal. You no longer need labels to get good robot performance.
@jang_yoel
Joel Jang
4 months
Excited to introduce ๐‹๐€๐๐€: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions.30x more efficient than conventional VLA pretraining. ๐Ÿ“: ๐Ÿงต 1/9
2
15
167
@Tim_Dettmers
Tim Dettmers
1 year
@srush_nlp @4evaBehindSOTA Regular transformers are notoriously difficult to sparsify, this is even true for the FFN layers in MoE transformers. But MoE layers are very different. You can also quantize them to 1 bit without any problem, but sparsification gives you better memory benefits than 1-bit quant.
5
22
166
@Tim_Dettmers
Tim Dettmers
2 years
Just pushed a major CUDA-related update to pip for bnb. I need feedback because it's so difficult to test CUDA envs. It will either fix 90% of all CUDA issues, or fix 90% of issues and create many new ones ๐Ÿซ . Please let me know if it works for you. I am ready to hotfix things.
7
14
165
@Tim_Dettmers
Tim Dettmers
6 years
It seems that first data suggest that the RTX 2080 Ti deep learning performance is very close to Titan V performance. Also key facts: Tensor Cores are programmable and NVLink can be used for data (+50GB/s). NVLink makes PCIe lanes obsolete for parallelism.
8
40
165
@Tim_Dettmers
Tim Dettmers
4 months
Now you can use bitsandbytes on AMD GPUs and Intel hardware. This is a big milestone and was a huge undertaking. @Titus_vK did an amazing job here. Eager to hear feedback! Let us know how it works for you.
@Titus_vK
Titus von Koeller
4 months
๐Ÿš€ Big news! After months of hard work and incredible community contributions, we're thrilled to announce the ๐—ฏ๐—ถ๐˜๐˜€๐—ฎ๐—ป๐—ฑ๐—ฏ๐˜†๐˜๐—ฒ๐˜€ ๐—บ๐˜‚๐—น๐˜๐—ถ-๐—ฏ๐—ฎ๐—ฐ๐—ธ๐—ฒ๐—ป๐—ฑ ๐™–๐™ก๐™ฅ๐™๐™– ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ! ๐Ÿ’ฅ. Now supporting:.- ๐Ÿ”ฅ ๐—”๐— ๐—— ๐—š๐—ฃ๐—จ๐˜€ (ROCm).- โšก ๐—œ๐—ป๐˜๐—ฒ๐—น ๐—–๐—ฃ๐—จ๐˜€ & ๐—š๐—ฃ๐—จ๐˜€. (1/2).
1
19
163
@Tim_Dettmers
Tim Dettmers
2 years
Catch my posters today:.SWARM parallelism (fault tolerant globally distributed):. 11am, slot 217. k-bit Inference Scaling Laws (foundation of QLoRA and SpQR):. 2pm, slot 824.
0
34
163
@Tim_Dettmers
Tim Dettmers
1 month
Training in low-precision looks good . until it doesn't. People should be more of the following work, that basically says that low-precision will not work at scale:.
@_brickner
Will
1 month
wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
Tweet media one
7
11
164