Samuel Müller Profile Banner
Samuel Müller Profile
Samuel Müller

@SamuelMullr

Followers
1,088
Following
403
Media
27
Statuses
342

Deep Learning PhD Student focused on (Tab)PFNs supervised by @FrankRHutter . Ex-DeepL, Ex-Amazon. ETH BSc, Cambridge MPhil. Opinions are my own. (he/him)

Berlin
Joined February 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@SamuelMullr
Samuel Müller
10 months
There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
Tweet media one
12
142
1K
@SamuelMullr
Samuel Müller
3 years
Transformers Can Do Bayesian Inference (, ICLR '22) We show how to train Transformers to (pretty exactly) approximate Bayesian predictions in a single forward pass for any prior you can sample from. It yields >200x speedups compared to VI and MCMC. (1/n)
2
32
156
@SamuelMullr
Samuel Müller
1 year
@EvMill This trick is not new, it is even part of the standard torch implementation of multi head attention. The option is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator, as e^0=1.
5
11
151
@SamuelMullr
Samuel Müller
10 months
There are two important things to note here: i) This experiment is very reproducible. ii) This fits with the interpretation of in-context learning as an approximation to the posterior predictive distribution (PPD) that we push in our PFN work, see
3
12
135
@SamuelMullr
Samuel Müller
10 months
And my super small transformer (4 layers, 256 embedding size) was able to generalize to a sloped sine after 5 minutes of training on a single colab GPU. Here are its predictions:
Tweet media one
3
6
111
@SamuelMullr
Samuel Müller
10 months
The conclusion: think about in-context learning as approximating the PPD and think about your training data as a prior, like we do in our (Tab)PFN work. Have fun with my colab: For more background see our 2021 paper:
2
11
111
@SamuelMullr
Samuel Müller
10 months
They () claim that a transformer trained to predict hold-out data from datasets that either sample from sine curves or sloped lines, can't generalize to predict a sloped sine.
1
5
77
@SamuelMullr
Samuel Müller
10 months
This interpretation requires that there is some probability that the data seen during in-context learning comes out of either distribution, the sine curves or the sloped lines. I made sure that this is the case by adding noise to all datapoints and the sine, see the sample.
3
2
64
@SamuelMullr
Samuel Müller
10 months
I tried it out and trained on data like they describe. Here is a sample of my training data.
Tweet media one
1
3
64
@SamuelMullr
Samuel Müller
10 months
Now our predictions become worse... but there still seems to be some generalization for some reason...
Tweet media one
1
3
61
@SamuelMullr
Samuel Müller
10 months
If we do not add noise, and thus make a sloped sine have 0 probability, the PPD is actually not defined. Our training data then looks like this:
Tweet media one
1
2
57
@SamuelMullr
Samuel Müller
10 months
@graduatedescent what would be your take on this? Why do you think we saw such different outcomes?
1
1
48
@SamuelMullr
Samuel Müller
2 years
Yeah 🎉 Best paper award for TabPFN ☺️
@TrlWorkshop
Table Representation Learning @NeurIPS
2 years
We congratulate @noahholl @SamuelMullr @KEggensperger @frankhutter1 for receiving the best-paper award and @dja_vos @tdoehmen @sscdotopen for the best-paper runner-up award. Check out their great work! Best-paper: . Runner-up: . 4/6
1
3
12
2
1
31
@SamuelMullr
Samuel Müller
2 years
@danijarh @FrankRHutter [Author here] We actually tried this and, to our surprisal, we found that the TabPFN does generalize to longer sequences than it was trained on. Figure 9 in the Appendix (…)
Tweet media one
2
0
32
@SamuelMullr
Samuel Müller
3 years
I am happy to announce that TrivialAugment (paper: , blog: ) got accepted as an Oral at ICCV '21. We show that a baseline that is almost too simple to be true outperforms most of the automatic augmentation methods out there. (1/4)
2
2
30
@SamuelMullr
Samuel Müller
10 months
@LaurenceBrem I think this is not very special tbh and standard MLPs can do it, too, in many instances. For this you might want to checkout the work on MLP-Mixers :)
0
1
27
@SamuelMullr
Samuel Müller
2 years
I am very proud of our achievements in this paper towards learning an algorithm that solve real-world tabular classification problems. This might be the first step towards a new category of (bayesian) classification algorithms, that are super fast and accurate. 🚀
@FrankRHutter
Frank Hutter
2 years
This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 🧵1/6
Tweet media one
113
779
4K
3
3
27
@SamuelMullr
Samuel Müller
3 years
The TrivialAugment Oral for the ICCV is online!! 🎉 Yes, of course I sing my presentation to the tune of "Hey Jude" 😉😂 Join us in session 1, you can also watch the entire talk on . Joint work with @FrankRHutter : .
2
6
22
@SamuelMullr
Samuel Müller
1 year
Can we replace the Gaussian Process in BayesOpt with in-context learning? Most certainly. We attain strong real-world performance on a variety of benchmarks with a PFN, which only uses in-context learning to yield acquisition values. A 🧵on our ICML paper
Tweet media one
1
6
21
@SamuelMullr
Samuel Müller
2 years
@JulienMouchnino @FrankRHutter Both together. The TabPFN is a neural network that accepts a training set of (x,y) pairs and a set of test inputs x and predicts the y for all test inputs at once. The whole process (training + testing) is a single forward-pass.
3
0
19
@SamuelMullr
Samuel Müller
2 years
What about changing neural networks to make hyper-parameters easy to tune instead of finding good tuning methods? That is what @TheGregYang et al propose in their new paper . Let's dive into it and try it out. 1/n
2
1
16
@SamuelMullr
Samuel Müller
2 years
@rasbt @PyTorch Check out TrivialAugment, if you are at it. (It has *no* hyperparameters.)
1
0
17
@SamuelMullr
Samuel Müller
3 years
I am happy to announce that the TrivialAugment algorithm () is part of the newest version of torchvision, released yesterday. For more info, see . It does not have hyper-parameters and usually outperformed RandAugment in our tests.
2
3
16
@SamuelMullr
Samuel Müller
1 year
@swati1729 Cold emailing can work wonders for you, too, though :)
3
0
14
@SamuelMullr
Samuel Müller
10 months
Looking forward to AutoML Fall School, I will give a tutorial on the next generation of AutoML and its connection to in-context learning. We will train our own little (Tab)PFN :)
@AutoML_org
AutoML.org
10 months
🚨 Only 1 week left to secure your spot at the #AutoMLFallSchool in Munich! Don't miss on 6 keynotes, 7 hands-on tutorials and a poster session, all on #AutoML 📚💡 Register now at for one of the few remaining tickets.
2
6
12
0
3
13
@SamuelMullr
Samuel Müller
2 years
@ilyaraz2 Hi there :) would you mind sharing your setup? We would like to look into that!
0
0
10
@SamuelMullr
Samuel Müller
1 year
Very proud to be part of this cool work. We could use GPT-4 to automatically do feature engineering for us. This improved scores a lot (> 2% on average) and led to very reasonable extra features. 🎉 We have open sourced everything, few free to try it out for your data :)
@FrankRHutter
Frank Hutter
1 year
#GPT meets #AutoML : in an effort to integrate user knowledge into AutoML, our new tool CAAFE uses LLMs to generate semantically meaningful features for tabular data (and also explains them). Towards an AI assistant for human data scientists🚀 Paper Demo
Tweet media one
2
9
48
3
1
10
@SamuelMullr
Samuel Müller
2 years
@predict_addict @FrankRHutter [Author here] We only tested calibration for smaller models, thus I am not sure for the final model. In our previous paper (…), we found good calibration for PFNs trained on a similar prior in Table 1.
1
0
9
@SamuelMullr
Samuel Müller
2 years
@lnsmith613 @FrankRHutter A direct link to the paper: :)
1
1
8
@SamuelMullr
Samuel Müller
2 years
@WvanAmsterdam @FrankRHutter [Author here] We only tested calibration for smaller models, thus I am not sure for the final model. In our previous paper (), we found good calibration for PFNs trained on a similar prior in Table 1.
1
0
7
@SamuelMullr
Samuel Müller
2 years
@m_elantkowski @FrankRHutter @noahholl @KEggensperger [Author here] 1. The prior that generates the synthetic datasets is chosen very broad. Additionally, we take care to normalize data before feeding it to the TabPFN for prediction. Thus, we believe most real-world datasets are in-distribution. 2. Yes! We are on it ;)
2
0
6
@SamuelMullr
Samuel Müller
3 years
If this thread caught your interest and you wonder “How do I train these models!?”, check out our paper at . This is joint work and I want to thank the great team consisting of @noahholl , Sebastian, Josif and @FrankRHutter . (7/n)
0
1
7
@SamuelMullr
Samuel Müller
2 years
Just felt outcompeted by an algorithm on my job for the first time. It was Github's Copilot. I couldn't come up with a nice way to write a function that manipulates all parameters of a neural net in PyTorch, while treating embeddings differently. 1/n
1
1
7
@SamuelMullr
Samuel Müller
2 years
@ArnaudovKrum @FrankRHutter [Author here] There are a few reasons we focused on small datasets first. The biggest are: i) Training is cheaper on small datasets (we train on around 500M artificial datasets) ii) GPU memory is limited and transformers scale quadratically with the number of inputs.
3
0
7
@SamuelMullr
Samuel Müller
2 years
@tobias_sterbak @FrankRHutter [Author here] There are a few reasons we focused on small datasets first. The biggest are: i) Training is cheaper on small datasets ii) GPU memory is limited and transformers scale quadratically with the number of inputs. Both are limitations we are working on. 😊
0
0
6
@SamuelMullr
Samuel Müller
3 years
🤗
@abidlabs
Abubakar Abid
3 years
Props to the authors of "Transformers Can Do Bayesian Inference" for releasing a demo of their model along with the code! Great for accessibility and reproducibility in machine learning 👏 Paper: Demo:
Tweet media one
Tweet media two
0
84
331
0
0
6
@SamuelMullr
Samuel Müller
2 years
@pradanadimass @FrankRHutter [Author here] Transformers do not require a lot of memory at inference time, as nothing has to be saved for backprop. We did not try it, but I expect about a GB of memory is enough to make predictions using the TabPFN.
0
0
6
@SamuelMullr
Samuel Müller
2 years
@francoisfleuret torch.ones_like(x[:,:1]) maybe this?
1
0
6
@SamuelMullr
Samuel Müller
1 year
@hgoldstein95 @swati1729 Totally, but the original tweet is horrible advice for young people without contacts. She also pushes the expectation of the phd (publish paper, visit conference) before the PhD, which is toxic bullshit.
2
0
6
@SamuelMullr
Samuel Müller
2 years
@hbouammar I don’t think this is true. A inner-product (the simplest neural network) cannot be represented by a finite decision tree. One would need to map any input to a different leave. No?
1
0
6
@SamuelMullr
Samuel Müller
1 year
Exiting! Seems to be comparable to ChatGPT for non-coding tasks but is available for download with permissive license 🏋️
@_akhaliq
AK
1 year
Meta releases Llama 2: Open Foundation and Fine-Tuned Chat Models paper: blog: develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion
Tweet media one
35
569
2K
1
2
6
@SamuelMullr
Samuel Müller
3 years
@michalwols @giffmana @karpathy @PreetumNakkiran @francoisfleuret This is actually something I ran into with Jax also, just never with PyTorch.
1
0
5
@SamuelMullr
Samuel Müller
3 years
We created a demo of the experiment above, so that you can play around with it yourself and thoroughly test our model. (3/n)
1
0
5
@SamuelMullr
Samuel Müller
3 years
As an example, we show how our method can approximate GPs. In the plot below, we compare the exact GP posterior (🟩) to the Transformer approximating it (🟦). We marked two differences with arrows for easier distinction. This is how exact our approximation is here. (2/n)
Tweet media one
1
0
5
@SamuelMullr
Samuel Müller
3 years
I don't usually ski, but when I do, I do it like an NN trained with SGD+Momentum.
1
0
5
@SamuelMullr
Samuel Müller
8 months
Synthetic data rules. A model trained on randomly sampled data from a kind of simple prior can almost perform as strong as a math olympiad winner at writing geometry proofs, as a new Nature paper shows... (1/n)
1
0
5
@SamuelMullr
Samuel Müller
1 year
Kigali was clean, safe, fun and had good weather. What a great place to have a conference at. 🌴🛵🐆 Thank you @iclr_conf for showing us Rwanda.
0
0
5
@SamuelMullr
Samuel Müller
9 months
@marktenenholtz A TabPFN update is coming! 😌
0
0
4
@SamuelMullr
Samuel Müller
10 months
@abacaj I think this might not be true, a shameless plug for my colab notebook for you to try out yourself :)
@SamuelMullr
Samuel Müller
10 months
There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
Tweet media one
12
142
1K
0
1
5
@SamuelMullr
Samuel Müller
2 years
Great 🙏 that’s enough for this year! I can stop working on my neurips submission then.. 😅
@srush_nlp
Sasha Rush
2 years
Samuel Müller ( @SamuelMullr ) has extremely impressive pytorch skills.
Tweet media one
1
0
2
1
0
5
@SamuelMullr
Samuel Müller
3 years
Since our method allows for such flexible priors, we developed a BNN prior that not only includes a distribution over weights, but architectures, too. Something that is easy with our method, but hard to do with VI or MCMC. (4/n)
Tweet media one
1
0
5
@SamuelMullr
Samuel Müller
1 year
@DimitrisPapail I think they can actually learn algorithms. Check out the TabPFN paper, we train a transformer to learn to do tabular prediction on unseen datasets, the model is only trained on artificial data.
1
0
5
@SamuelMullr
Samuel Müller
2 years
@ChristophMolnar [Author here] Yes it is different from traditional fit and predict. The setup is more similar to meta-learning, but on artificial data. This has the side-effect that we are approximating a Bayesian solution, actually.
0
0
4
@SamuelMullr
Samuel Müller
2 years
RLHF inspired training for vision. First train a standard model with CE. Then fine tune with RL on the non-differentiable metric you actually care about.
@__kolesnikov__
Alexander Kolesnikov
2 years
Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc.
Tweet media one
4
131
641
0
0
4
@SamuelMullr
Samuel Müller
1 year
Reading papers that write out ideas, you thought in the direction of too always gives me such a warm feeling 🥰 Just had this with where they try out whether some of the recently proposed extra training methodology actually pays off, it mostly doesn’t :(
0
0
4
@SamuelMullr
Samuel Müller
2 years
@pradanadimass @JulienMouchnino @FrankRHutter The method can be seen as similar to in-context learning with LLMs like GPT-3.
1
0
4
@SamuelMullr
Samuel Müller
3 years
The method is so simple, I can describe it in less than a tweet: Given an image, you randomly sample an augmentation from a set of augmentations. Now you sample uniformly at random how strongly to apply this augmentation. You apply it with that strength. That is it. ✔️ (3/4)
1
0
4
@SamuelMullr
Samuel Müller
2 years
@hbouammar In current computers this of course can be seen as true, as our floats are discrete. I don’t think this is meant though.
1
0
4
@SamuelMullr
Samuel Müller
3 years
Using this prior, we built a probabilistic model that beats standard baselines, like CatBoost or XGBoost, on small tabular datasets in terms of ROC AUC, ECE and speed at once. PFN-BNN is our model with the prior presented above, PFN-GP is our model approximating a GP. (5/n)
Tweet media one
1
0
4
@SamuelMullr
Samuel Müller
11 months
@francoisfleuret Kinda second law of thermodynamics dynamics, right?
2
0
3
@SamuelMullr
Samuel Müller
3 years
You want to know how they created this nice table/figure in that other paper? Arxiv allows you to simply access the Latex source code of papers to find out. i) Click "Other formats" ii) Click "Download source" iii) Unzip in the terminal or add ".tar.gz" as extension and unzip.
Tweet media one
Tweet media two
0
0
4
@SamuelMullr
Samuel Müller
3 years
@karpathy This will be a viable direction. Randomised simulations can implicitly generate a prior. We can show that meta-learning on such data yields a posterior approximation for this prior, see So it is all still Bayesian! (shameless plug end)
0
0
4
@SamuelMullr
Samuel Müller
1 year
We also show how to build the knowledge gradient acquisition function straight into the neural network for lookahead optimization. If you want to try out our models, simply `pip install pfns4bo`. Our documentation and training code can be found at 4/6
1
1
3
@SamuelMullr
Samuel Müller
1 year
@SuryaGanguli @AllanRaventos @mansiege @FCHEN_AI Looking forward to reading the paper and understanding exactly what you mean by becoming better than bayes-optimal with task diversity. Sounds great! I think we actually demonstrated ICL out-of-dist performance in the real-world on tabular tasks with .
1
0
3
@SamuelMullr
Samuel Müller
3 years
Always good to remember: Even if ML will eat the world, it might be so automatic that it’ll eat our ML Engineer/Scientist jobs, too. 🧑🏻‍💻🥢
1
0
3
@SamuelMullr
Samuel Müller
3 years
Finally, due to how we train our model, we can fine-tune it to real world data, when meta-datasets are available. We created a prior for hand-written letters and then fine-tuned on Omniglot to be competitive to the state-of-the-art in this setup. (6/n)
Tweet media one
1
0
3
@SamuelMullr
Samuel Müller
2 years
A year ago I thought I need to know more neuro science to be a better ML researcher. With the LLM advances I believe I am rather lacking pedagogy knowledge..
2
0
3
@SamuelMullr
Samuel Müller
2 years
While this idea, was based on "knowledge" (it seems to know PyTorch better than me), all ideas are somewhat based on knowledge. I can see how AI evolves to outperform me on more coding tasks. BTW if you use copilot, don't forget to write doc strings and comments to help it.
0
0
3
@SamuelMullr
Samuel Müller
1 year
Our models are trained only on data drawn from a prior, i.e. we did not fine tune on any real-world data. Since our models are simple transformers under the hood, we can use them very flexibly, e.g. we use two priors that were impossible to model with previous methods. 2/6
Tweet media one
1
1
3
@SamuelMullr
Samuel Müller
9 months
@BlackHC @zacharylipton Haha they are but I think this is actually one where the Bayesians are right and the connection is not very far fetched:
0
0
3
@SamuelMullr
Samuel Müller
8 months
They just generate random geometry settings and random statements. Then they let a transformer learn to do proofs by treating the paths to the random statements as proof sequences.
1
0
2
@SamuelMullr
Samuel Müller
1 year
We even show how to allow users to change the prior of a trained model post hoc, by feeding the model a prior over the location of the optimum to integrate human expert knowledge. If the user sets a reasonable prior, this can substantially improve performance. 3/6
Tweet media one
1
1
3
@SamuelMullr
Samuel Müller
2 years
@ArashVahdat Reviewers actually can't answer your emails, as they are delivered from a no reply email. :(
0
0
3
@SamuelMullr
Samuel Müller
2 years
For a similar technique, SWA (), the authors lay out the intuition, that these averages lead to choosing a flatter region of the local optimum that thus generalizes better. They saw test acc. increase and train acc. decrease, thus less overfitting.
1
0
3
@SamuelMullr
Samuel Müller
10 months
@graduatedescent Re Sine frequencies: I only trained on one, but makes it even harder arguably!? But yeah, I overlooked that.. Re Architecture: Yeah, right! I did not use pos embs. I wonder whether they actually destroy it all. I guess for length generalization it is well known they are bad. 🤔
0
0
2
@SamuelMullr
Samuel Müller
1 year
@marktenenholtz In a one epoch setting everything is the validation setting, right?
1
0
3
@SamuelMullr
Samuel Müller
2 years
I was about to discern embeddings by their shape... ugly 🤢Then Copilot chimed in and suggested a perfect loop over `model.named_paramters()`. This has a new quality to me. Copilot had an idea, I didn't have.
1
0
3
@SamuelMullr
Samuel Müller
2 years
@rasbt @FrankRHutter 1. We use datasets up to 2K examples, as we use a 50/50 train/val split. 2. Categoricals are simply treated as scalars (with integer values). A training example is treated as a single vector, in this case of size 5, which is encoded with a simple linear.
0
0
3
@SamuelMullr
Samuel Müller
3 years
@vlastelicap @_arohan_ @giffmana Not significant. The usual rule of thumb I know is >= 1 BLEU is considered significant.
0
0
3
@SamuelMullr
Samuel Müller
3 years
Feeling cute to work in a field, where leading works have names like Bert, DrNAS, Roberta and YOLOv3. ☺️
0
0
3
@SamuelMullr
Samuel Müller
2 years
You need to comfort the LLM, you need to ask it to think step-by-step.
0
0
3
@SamuelMullr
Samuel Müller
4 years
@fhuszar There is not even a paper, only a number.
0
0
3
@SamuelMullr
Samuel Müller
3 years
Unlike prev. methods we only apply a single augmentation, but we use varying strengths. See paper or blog for more. We even open-sourced it as a mini-library: . A big thank you to my supervisor, @FrankRHutter , for the collaboration on this project. (4/4)
1
0
3
@SamuelMullr
Samuel Müller
1 year
Thank you @__mfeurer__ , @noahholl , and @FrankRHutter for being amazing colleagues and co-authors! If you this caught your interest and you have questions, please feel free to reach out. (6/6)
0
1
3
@SamuelMullr
Samuel Müller
8 months
@giffmana @cgarciae88 @predict_addict @kaggle @tunguz Mmh I do not know any work that does that for tabular
0
0
3
@SamuelMullr
Samuel Müller
2 years
What if we had a version with down-sampled images for each paper on arxiv? Download times recently exploded for me for vision papers. This would surely improve accessibility and could probably be done automatically by @arxiv .
0
0
3
@SamuelMullr
Samuel Müller
4 years
@mkhoury @will_lawrenceTO @Austen I agree, but that is clearly a mistake of YouTube. I use like 10 apps on my Apple TV and YouTube is the only one I struggle with.
0
0
3
@SamuelMullr
Samuel Müller
10 months
@martin_casado I think this might not be true... shameless plug...
@SamuelMullr
Samuel Müller
10 months
There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
Tweet media one
12
142
1K
1
0
2
@SamuelMullr
Samuel Müller
10 months
@chiefluddite This might actually not pan out, I tried it out and could generalize...
@SamuelMullr
Samuel Müller
10 months
There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
Tweet media one
12
142
1K
0
0
2
@SamuelMullr
Samuel Müller
3 years
@skoularidou I see! In my feed your post was stuffed between ML posts, so I had the wrong context, I guess. 😅
0
0
2
@SamuelMullr
Samuel Müller
1 year
@fchollet Isn’t torch just clearly the best? Can’t Google allow its employees also switch to it as it is real OSS now? I don’t think the exceptionalism of Google helps research, as it reduces reproducibility a lot. A step in the right direction is Keras support for torch as backend i think
2
0
2
@SamuelMullr
Samuel Müller
1 year
Tweet media one
0
0
2
@SamuelMullr
Samuel Müller
10 months
@GaryMarcus For me it just generalized out of the box, you can try it yourself. Everything is public
@SamuelMullr
Samuel Müller
10 months
There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
Tweet media one
12
142
1K
0
0
1
@SamuelMullr
Samuel Müller
3 years
@giffmana @samsamoa Thanks for the inside! I just realized that either I am overlooking it or you didn't mention the pre-training peak-lr in your paper. So, you just used peak-lr=8e-4 in all experiments? That is beautifully simple :)
1
0
2
@SamuelMullr
Samuel Müller
10 months
@graduatedescent Re overlap: Agreed. That is my main finding, too. Just make sure to add some noise and all distributions overlap. I have to say, though, that even if I made sure there was no overlap (second experiment) that I saw some generalization. This is weird to me too though...
2
0
1
@SamuelMullr
Samuel Müller
1 year
I did not expect that there could be further improvements in flash attention of this magnitude! Looking forward to try this on our old RTX 2080 TIs, let’s pray it works there, too 🙏🙏🙏
@tri_dao
Tri Dao
1 year
Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
Tweet media one
Tweet media two
40
682
3K
0
0
2
@SamuelMullr
Samuel Müller
2 years
Hot Take: Shortening attention spans due to digital media (TikTok etc) are actually good for scientific progress. Researchers can’t focus on too lengthy setups, like overly complicated ML algorithms, making them adhere to Occhams Razor 🪒 more closely. Come @ me
2
0
2
@SamuelMullr
Samuel Müller
3 years
Wondering about the (a tad esoteric) original Transformer LR schedule: Is anyone still using this? It incorporates the size of the model directly, still the scaling papers I know (Scaling LMs, Scaling ViTs) do not use it. @samsamoa @giffmana did you consider it? any thoughts?
Tweet media one
1
0
2