Samuel Müller @SamuelMullr profile

Samuel Müller

@SamuelMullr

Followers

1,088

Following

403

Media

27

Statuses

342

Deep Learning PhD Student focused on (Tab)PFNs supervised by @FrankRHutter . Ex-DeepL, Ex-Amazon. ETH BSc, Cambridge MPhil. Opinions are my own. (he/him)

https://t.co/5wQN32uQPU

Berlin

Joined February 2020

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Kendrick • 294191 Tweets

Super Bowl • 253672 Tweets

Defense • 232734 Tweets

Ronaldo • 184954 Tweets

Drake • 167583 Tweets

Patriots • 137258 Tweets

Giants • 114541 Tweets

Steelers • 105468 Tweets

Bears • 104848 Tweets

Wayne • 102484 Tweets

Bengals • 78959 Tweets

Cowboys • 76589 Tweets

Dolphins • 66848 Tweets

Panthers • 65937 Tweets

Browns • 62319 Tweets

Falcons • 55273 Tweets

Titans • 51436 Tweets

JISOO AT NYFW WITH TOMMY • 49476 Tweets

Colts • 43358 Tweets

Texans • 39916 Tweets

Raiders • 34095 Tweets

Seahawks • 31408 Tweets

Daniel Jones • 30369 Tweets

Broncos • 29925 Tweets

Chargers • 28036 Tweets

#Sinner • 25817 Tweets

Caleb Williams • 25593 Tweets

Josh Allen • 24380 Tweets

Baker • 20979 Tweets

#深澤辰哉 • 19973 Tweets

Cleveland • 19906 Tweets

#わたしの宝物 • 19617 Tweets

Bryce Young • 19332 Tweets

#GHDBT1 • 18912 Tweets

Tom Brady • 18733 Tweets

Forbes • 17502 Tweets

TJ Watt • 16464 Tweets

Deshaun Watson • 12923 Tweets

Kirk Cousins • 11313 Tweets

Zeke • 11277 Tweets

Aubrey • 10915 Tweets

Will Levis • 10439 Tweets

Cade York

Geno

Flacco

Bo Nix

Mike Evans

Turpin

#رايد_الفهد_ترجع_لنا_بالسلامه

ふっかさん

Last Seen Profiles

@EnglishNFTea

@OliverGSims

@SwagoiGaming

@hebisan_nisshi

@jsaintonge7

@vaxtru

@UW_SAFS

@Rei41091157

@Eglise1001

@bandmerakee

@SyedaQandeel21

@PistachioPickle

@Salus_in_fuga

@SAC_Education

@nrkrpt

@kanaandyuucp

@enesstok

@_Temii___

@NathalieEDIMOKO

@Manglersss

Pinned Tweet

Samuel Müller

@SamuelMullr

10 months

There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...

12

142

1K

Samuel Müller

@SamuelMullr

3 years

Transformers Can Do Bayesian Inference (, ICLR '22) We show how to train Transformers to (pretty exactly) approximate Bayesian predictions in a single forward pass for any prior you can sample from. It yields >200x speedups compared to VI and MCMC. (1/n)

Transformers Can Do Bayesian Inference

Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...

arxiv.org

2

32

156

Samuel Müller

@SamuelMullr

1 year

@EvMill This trick is not new, it is even part of the standard torch implementation of multi head attention. The option is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator, as e^0=1.

5

11

151

Samuel Müller

@SamuelMullr

10 months

There are two important things to note here: i) This experiment is very reproducible. ii) This fits with the interpretation of in-context learning as an approximation to the posterior predictive distribution (PPD) that we push in our PFN work, see

Transformers Can Do Bayesian Inference

Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...

arxiv.org

3

12

135

Samuel Müller

@SamuelMullr

10 months

And my super small transformer (4 layers, 256 embedding size) was able to generalize to a sloped sine after 5 minutes of training on a single colab GPU. Here are its predictions:

3

6

111

Samuel Müller

@SamuelMullr

10 months

The conclusion: think about in-context learning as approximating the PPD and think about your training data as a prior, like we do in our (Tab)PFN work. Have fun with my colab: For more background see our 2021 paper:

2

11

111

Samuel Müller

@SamuelMullr

10 months

They () claim that a transformer trained to predict hold-out data from datasets that either sample from sine curves or sloped lines, can't generalize to predict a sloped sine.

1

5

77

Samuel Müller

@SamuelMullr

10 months

This interpretation requires that there is some probability that the data seen during in-context learning comes out of either distribution, the sine curves or the sloped lines. I made sure that this is the case by adding noise to all datapoints and the sine, see the sample.

3

2

64

Samuel Müller

@SamuelMullr

10 months

I tried it out and trained on data like they describe. Here is a sample of my training data.

1

3

64

Samuel Müller

@SamuelMullr

10 months

Now our predictions become worse... but there still seems to be some generalization for some reason...

1

3

61

Samuel Müller

@SamuelMullr

10 months

If we do not add noise, and thus make a sloped sine have 0 probability, the PPD is actually not defined. Our training data then looks like this:

1

2

57

Samuel Müller

@SamuelMullr

10 months

@graduatedescent what would be your take on this? Why do you think we saw such different outcomes?

1

48

Samuel Müller

@SamuelMullr

2 years

Yeah 🎉 Best paper award for TabPFN ☺️

Table Representation Learning @NeurIPS

@TrlWorkshop

2 years

We congratulate @noahholl @SamuelMullr @KEggensperger @frankhutter1 for receiving the best-paper award and @dja_vos @tdoehmen @sscdotopen for the best-paper runner-up award. Check out their great work! Best-paper: . Runner-up: . 4/6

1

3

12

2

1

31

Samuel Müller

@SamuelMullr

2 years

@danijarh @FrankRHutter [Author here] We actually tried this and, to our surprisal, we found that the TabPFN does generalize to longer sequences than it was trained on. Figure 9 in the Appendix (…)

2

0

32

Samuel Müller

@SamuelMullr

3 years

I am happy to announce that TrivialAugment (paper: , blog: ) got accepted as an Oral at ICCV '21. We show that a baseline that is almost too simple to be true outperforms most of the automatic augmentation methods out there. (1/4)

TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation

Automatic augmentation methods have recently become a crucial pillar for strong model performance in vision tasks. While existing automatic augmentation methods need to trade off simplicity, cost...

arxiv.org

2

30

Samuel Müller

@SamuelMullr

10 months

@LaurenceBrem I think this is not very special tbh and standard MLPs can do it, too, in many instances. For this you might want to checkout the work on MLP-Mixers :)

0

1

27

Samuel Müller

@SamuelMullr

2 years

I am very proud of our achievements in this paper towards learning an algorithm that solve real-world tabular classification problems. This might be the first step towards a new category of (bayesian) classification algorithms, that are super fast and accurate. 🚀

Frank Hutter

@FrankRHutter

2 years

This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 🧵1/6

113

779

4K

3

27

Samuel Müller

@SamuelMullr

3 years

The TrivialAugment Oral for the ICCV is online!! 🎉 Yes, of course I sing my presentation to the tune of "Hey Jude" 😉😂 Join us in session 1, you can also watch the entire talk on . Joint work with @FrankRHutter : .

2

6

22

Samuel Müller

@SamuelMullr

1 year

Can we replace the Gaussian Process in BayesOpt with in-context learning? Most certainly. We attain strong real-world performance on a variety of benchmarks with a PFN, which only uses in-context learning to yield acquisition values. A 🧵on our ICML paper

1

6

21

Samuel Müller

@SamuelMullr

2 years

@JulienMouchnino @FrankRHutter Both together. The TabPFN is a neural network that accepts a training set of (x,y) pairs and a set of test inputs x and predicts the y for all test inputs at once. The whole process (training + testing) is a single forward-pass.

3

0

19

Samuel Müller

@SamuelMullr

2 years

What about changing neural networks to make hyper-parameters easy to tune instead of finding good tuning methods? That is what @TheGregYang et al propose in their new paper . Let's dive into it and try it out. 1/n

2

1

16

Samuel Müller

@SamuelMullr

2 years

@rasbt @PyTorch Check out TrivialAugment, if you are at it. (It has *no* hyperparameters.)

1

0

17

Samuel Müller

@SamuelMullr

3 years

I am happy to announce that the TrivialAugment algorithm () is part of the newest version of torchvision, released yesterday. For more info, see . It does not have hyper-parameters and usually outperformed RandAugment in our tests.

New Library Releases in PyTorch 1.10, including TorchX, TorchAudio, TorchVision

Today, we are announcing a number of new features and improvements to PyTorch libraries, alongside the PyTorch 1.10 release. Some highlights include:

pytorch.org

2

3

16

Samuel Müller

@SamuelMullr

1 year

@swati1729 Cold emailing can work wonders for you, too, though :)

3

0

14

Samuel Müller

@SamuelMullr

10 months

Looking forward to AutoML Fall School, I will give a tutorial on the next generation of AutoML and its connection to in-context learning. We will train our own little (Tab)PFN :)

AutoML.org

@AutoML_org

10 months

🚨 Only 1 week left to secure your spot at the #AutoMLFallSchool in Munich! Don't miss on 6 keynotes, 7 hands-on tutorials and a poster session, all on #AutoML 📚💡 Register now at for one of the few remaining tickets.

2

6

12

0

3

13

Samuel Müller

@SamuelMullr

2 years

@ilyaraz2 Hi there :) would you mind sharing your setup? We would like to look into that!

0

10

Samuel Müller

@SamuelMullr

1 year

If you’re in Kigali, check out @noahholl ’s presentation on TabPFN in the Auditorium at 10:20 or come to our poster in MH1-2-3-4 at 11:30

GitHub - automl/TabPFN: Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.0...

Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.01848) and the tabpfn package. - automl/TabPFN

github.com

0

2

9

Samuel Müller

@SamuelMullr

1 year

Very proud to be part of this cool work. We could use GPT-4 to automatically do feature engineering for us. This improved scores a lot (> 2% on average) and led to very reasonable extra features. 🎉 We have open sourced everything, few free to try it out for your data :)

Frank Hutter

@FrankRHutter

1 year

#GPT meets #AutoML : in an effort to integrate user knowledge into AutoML, our new tool CAAFE uses LLMs to generate semantically meaningful features for tabular data (and also explains them). Towards an AI assistant for human data scientists🚀 Paper Demo

2

9

48

3

1

10

Samuel Müller

@SamuelMullr

2 years

@predict_addict @FrankRHutter [Author here] We only tested calibration for smaller models, thus I am not sure for the final model. In our previous paper (…), we found good calibration for PFNs trained on a similar prior in Table 1.

1

0

9

Samuel Müller

@SamuelMullr

2 years

@lnsmith613 @FrankRHutter A direct link to the paper: :)

1

8

Samuel Müller

@SamuelMullr

8 months

This is super similar to how we sampled random datasets in TabPFN to learn to do classification. Exciting!

Solving olympiad geometry without human demonstrations

Nature - A new neuro-symbolic theorem prover for Euclidean plane geometry trained from scratch on millions of synthesized theorems and proofs outperforms the previous best method and reaches the...

www.nature.com

1

0

7

Samuel Müller

@SamuelMullr

2 years

@WvanAmsterdam @FrankRHutter [Author here] We only tested calibration for smaller models, thus I am not sure for the final model. In our previous paper (), we found good calibration for PFNs trained on a similar prior in Table 1.

1

0

7

Samuel Müller

@SamuelMullr

2 years

@m_elantkowski @FrankRHutter @noahholl @KEggensperger [Author here] 1. The prior that generates the synthetic datasets is chosen very broad. Additionally, we take care to normalize data before feeding it to the TabPFN for prediction. Thus, we believe most real-world datasets are in-distribution. 2. Yes! We are on it ;)

2

0

6

Samuel Müller

@SamuelMullr

3 years

If this thread caught your interest and you wonder “How do I train these models!?”, check out our paper at . This is joint work and I want to thank the great team consisting of @noahholl , Sebastian, Josif and @FrankRHutter . (7/n)

Transformers Can Do Bayesian Inference

Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...

arxiv.org

0

1

7

Samuel Müller

@SamuelMullr

2 years

Just felt outcompeted by an algorithm on my job for the first time. It was Github's Copilot. I couldn't come up with a nice way to write a function that manipulates all parameters of a neural net in PyTorch, while treating embeddings differently. 1/n

1

7

Samuel Müller

@SamuelMullr

2 years

@ArnaudovKrum @FrankRHutter [Author here] There are a few reasons we focused on small datasets first. The biggest are: i) Training is cheaper on small datasets (we train on around 500M artificial datasets) ii) GPU memory is limited and transformers scale quadratically with the number of inputs.

3

0

7

Samuel Müller

@SamuelMullr

2 years

@tobias_sterbak @FrankRHutter [Author here] There are a few reasons we focused on small datasets first. The biggest are: i) Training is cheaper on small datasets ii) GPU memory is limited and transformers scale quadratically with the number of inputs. Both are limitations we are working on. 😊

0

6

Samuel Müller

@SamuelMullr

3 years

🤗

Abubakar Abid

@abidlabs

3 years

Props to the authors of "Transformers Can Do Bayesian Inference" for releasing a demo of their model along with the code! Great for accessibility and reproducibility in machine learning 👏 Paper: Demo:

0

84

331

0

6

Samuel Müller

@SamuelMullr

2 years

@pradanadimass @FrankRHutter [Author here] Transformers do not require a lot of memory at inference time, as nothing has to be saved for backprop. We did not try it, but I expect about a GB of memory is enough to make predictions using the TabPFN.

0

6

Samuel Müller

@SamuelMullr

2 years

@francoisfleuret torch.ones_like(x[:,:1]) maybe this?

1

0

6

Samuel Müller

@SamuelMullr

1 year

@hgoldstein95 @swati1729 Totally, but the original tweet is horrible advice for young people without contacts. She also pushes the expectation of the phd (publish paper, visit conference) before the PhD, which is toxic bullshit.

2

0

6

Samuel Müller

@SamuelMullr

2 years

@hbouammar I don’t think this is true. A inner-product (the simplest neural network) cannot be represented by a finite decision tree. One would need to map any input to a different leave. No?

1

0

6

Samuel Müller

@SamuelMullr

1 year

Exiting! Seems to be comparable to ChatGPT for non-coding tasks but is available for download with permissive license 🏋️

AK

@_akhaliq

1 year

Meta releases Llama 2: Open Foundation and Fine-Tuned Chat Models paper: blog: develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion

35

569

2K

1

2

6

Samuel Müller

@SamuelMullr

3 years

@michalwols @giffmana @karpathy @PreetumNakkiran @francoisfleuret This is actually something I ran into with Jax also, just never with PyTorch.

1

0

5

Samuel Müller

@SamuelMullr

3 years

We created a demo of the experiment above, so that you can play around with it yourself and thoroughly test our model. (3/n)

Transformers Can Do Bayesian Inference - a Hugging Face Space by samuelinferences

huggingface.co

1

0

5

Samuel Müller

@SamuelMullr

3 years

As an example, we show how our method can approximate GPs. In the plot below, we compare the exact GP posterior (🟩) to the Transformer approximating it (🟦). We marked two differences with arrows for easier distinction. This is how exact our approximation is here. (2/n)

1

0

5

Samuel Müller

@SamuelMullr

3 years

I don't usually ski, but when I do, I do it like an NN trained with SGD+Momentum.

1

0

5

Samuel Müller

@SamuelMullr

8 months

Synthetic data rules. A model trained on randomly sampled data from a kind of simple prior can almost perform as strong as a math olympiad winner at writing geometry proofs, as a new Nature paper shows... (1/n)

1

0

5

Samuel Müller

@SamuelMullr

1 year

Kigali was clean, safe, fun and had good weather. What a great place to have a conference at. 🌴🛵🐆 Thank you @iclr_conf for showing us Rwanda.

0

5

Samuel Müller

@SamuelMullr

9 months

@marktenenholtz A TabPFN update is coming! 😌

0

4

Samuel Müller

@SamuelMullr

10 months

@abacaj I think this might not be true, a shameless plug for my colab notebook for you to try out yourself :)

Samuel Müller

@SamuelMullr

10 months

There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...

12

142

1K

0

1

5

Samuel Müller

@SamuelMullr

2 years

Great 🙏 that’s enough for this year! I can stop working on my neurips submission then.. 😅

Sasha Rush

@srush_nlp

2 years

Samuel Müller ( @SamuelMullr ) has extremely impressive pytorch skills.

1

0

2

1

0

5

Samuel Müller

@SamuelMullr

3 years

Since our method allows for such flexible priors, we developed a BNN prior that not only includes a distribution over weights, but architectures, too. Something that is easy with our method, but hard to do with VI or MCMC. (4/n)

1

0

5

Samuel Müller

@SamuelMullr

1 year

@DimitrisPapail I think they can actually learn algorithms. Check out the TabPFN paper, we train a transformer to learn to do tabular prediction on unseen datasets, the model is only trained on artificial data.

1

0

5

Samuel Müller

@SamuelMullr

2 years

@ChristophMolnar [Author here] Yes it is different from traditional fit and predict. The setup is more similar to meta-learning, but on artificial data. This has the side-effect that we are approximating a Bayesian solution, actually.

0

4

Samuel Müller

@SamuelMullr

2 years

RLHF inspired training for vision. First train a standard model with CE. Then fine tune with RL on the non-differentiable metric you actually care about.

Alexander Kolesnikov

@__kolesnikov__

2 years

Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc.

4

131

641

0

4

Samuel Müller

@SamuelMullr

1 year

Reading papers that write out ideas, you thought in the direction of too always gives me such a warm feeling 🥰 Just had this with where they try out whether some of the recently proposed extra training methodology actually pays off, it mostly doesn’t :(

No Train No Gain: Revisiting Efficient Training Algorithms For...

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve...

arxiv.org

0

4

Samuel Müller

@SamuelMullr

2 years

@pradanadimass @JulienMouchnino @FrankRHutter The method can be seen as similar to in-context learning with LLMs like GPT-3.

1

0

4

Samuel Müller

@SamuelMullr

3 years

The method is so simple, I can describe it in less than a tweet: Given an image, you randomly sample an augmentation from a set of augmentations. Now you sample uniformly at random how strongly to apply this augmentation. You apply it with that strength. That is it. ✔️ (3/4)

1

0

4

Samuel Müller

@SamuelMullr

2 years

@hbouammar In current computers this of course can be seen as true, as our floats are discrete. I don’t think this is meant though.

1

0

4

Samuel Müller

@SamuelMullr

3 years

Using this prior, we built a probabilistic model that beats standard baselines, like CatBoost or XGBoost, on small tabular datasets in terms of ROC AUC, ECE and speed at once. PFN-BNN is our model with the prior presented above, PFN-GP is our model approximating a GP. (5/n)

1

0

4

Samuel Müller

@SamuelMullr

11 months

@francoisfleuret Kinda second law of thermodynamics dynamics, right?

2

0

3

Samuel Müller

@SamuelMullr

3 years

You want to know how they created this nice table/figure in that other paper? Arxiv allows you to simply access the Latex source code of papers to find out. i) Click "Other formats" ii) Click "Download source" iii) Unzip in the terminal or add ".tar.gz" as extension and unzip.

0

4

Samuel Müller

@SamuelMullr

3 years

@karpathy This will be a viable direction. Randomised simulations can implicitly generate a prior. We can show that meta-learning on such data yields a posterior approximation for this prior, see So it is all still Bayesian! (shameless plug end)

Transformers Can Do Bayesian Inference

Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...

arxiv.org

0

4

Samuel Müller

@SamuelMullr

1 year

We also show how to build the knowledge gradient acquisition function straight into the neural network for lookahead optimization. If you want to try out our models, simply `pip install pfns4bo`. Our documentation and training code can be found at 4/6

GitHub - automl/PFNs4BO: The official implementation of PFNs4BO: In-Context Learning for Bayesian...

The official implementation of PFNs4BO: In-Context Learning for Bayesian Optimization - automl/PFNs4BO

github.com

1

3

Samuel Müller

@SamuelMullr

1 year

@SuryaGanguli @AllanRaventos @mansiege @FCHEN_AI Looking forward to reading the paper and understanding exactly what you mean by becoming better than bayes-optimal with task diversity. Sounds great! I think we actually demonstrated ICL out-of-dist performance in the real-world on tabular tasks with .

GitHub - automl/TabPFN: Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.0...

Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.01848) and the tabpfn package. - automl/TabPFN

github.com

1

0

3

Samuel Müller

@SamuelMullr

3 years

Always good to remember: Even if ML will eat the world, it might be so automatic that it’ll eat our ML Engineer/Scientist jobs, too. 🧑🏻‍💻🥢

1

0

3

Samuel Müller

@SamuelMullr

3 years

Finally, due to how we train our model, we can fine-tune it to real world data, when meta-datasets are available. We created a prior for hand-written letters and then fine-tuned on Omniglot to be competitive to the state-of-the-art in this setup. (6/n)

1

0

3

Samuel Müller

@SamuelMullr

2 years

A year ago I thought I need to know more neuro science to be a better ML researcher. With the LLM advances I believe I am rather lacking pedagogy knowledge..

2

0

3

Samuel Müller

@SamuelMullr

2 years

While this idea, was based on "knowledge" (it seems to know PyTorch better than me), all ideas are somewhat based on knowledge. I can see how AI evolves to outperform me on more coding tasks. BTW if you use copilot, don't forget to write doc strings and comments to help it.

0

3

Samuel Müller

@SamuelMullr

1 year

Our models are trained only on data drawn from a prior, i.e. we did not fine tune on any real-world data. Since our models are simple transformers under the hood, we can use them very flexibly, e.g. we use two priors that were impossible to model with previous methods. 2/6

1

3

Samuel Müller

@SamuelMullr

9 months

@BlackHC @zacharylipton Haha they are but I think this is actually one where the Bayesians are right and the connection is not very far fetched:

0

3

Samuel Müller

@SamuelMullr

8 months

They just generate random geometry settings and random statements. Then they let a transformer learn to do proofs by treating the paths to the random statements as proof sequences.

1

0

2

Samuel Müller

@SamuelMullr

1 year

We even show how to allow users to change the prior of a trained model post hoc, by feeding the model a prior over the location of the optimum to integrate human expert knowledge. If the user sets a reasonable prior, this can substantially improve performance. 3/6

1

3

Samuel Müller

@SamuelMullr

2 years

@ArashVahdat Reviewers actually can't answer your emails, as they are delivered from a no reply email. :(

0

3

Samuel Müller

@SamuelMullr

2 years

For a similar technique, SWA (), the authors lay out the intuition, that these averages lead to choosing a flatter region of the local optimum that thus generalizes better. They saw test acc. increase and train acc. decrease, thus less overfitting.

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of...

arxiv.org

1

0

3

Samuel Müller

@SamuelMullr

10 months

@graduatedescent Re Sine frequencies: I only trained on one, but makes it even harder arguably!? But yeah, I overlooked that.. Re Architecture: Yeah, right! I did not use pos embs. I wonder whether they actually destroy it all. I guess for length generalization it is well known they are bad. 🤔

0

2

Samuel Müller

@SamuelMullr

1 year

@marktenenholtz In a one epoch setting everything is the validation setting, right?

1

0

3

Samuel Müller

@SamuelMullr

2 years

I was about to discern embeddings by their shape... ugly 🤢Then Copilot chimed in and suggested a perfect loop over `model.named_paramters()`. This has a new quality to me. Copilot had an idea, I didn't have.

1

0

3

Samuel Müller

@SamuelMullr

2 years

@rasbt @FrankRHutter 1. We use datasets up to 2K examples, as we use a 50/50 train/val split. 2. Categoricals are simply treated as scalars (with integer values). A training example is treated as a single vector, in this case of size 5, which is encoded with a simple linear.

0

3

Samuel Müller

@SamuelMullr

3 years

@vlastelicap @_arohan_ @giffmana Not significant. The usual rule of thumb I know is >= 1 BLEU is considered significant.

0

3

Samuel Müller

@SamuelMullr

3 years

Feeling cute to work in a field, where leading works have names like Bert, DrNAS, Roberta and YOLOv3. ☺️

0

3

Samuel Müller

@SamuelMullr

2 years

You need to comfort the LLM, you need to ask it to think step-by-step.

0

3

Samuel Müller

@SamuelMullr

4 years

@fhuszar There is not even a paper, only a number.

0

3

Samuel Müller

@SamuelMullr

3 years

Unlike prev. methods we only apply a single augmentation, but we use varying strengths. See paper or blog for more. We even open-sourced it as a mini-library: . A big thank you to my supervisor, @FrankRHutter , for the collaboration on this project. (4/4)

GitHub - automl/trivialaugment: This is the official implementation of TrivialAugment and a...

This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment. - automl/trivialaugment

github.com

1

0

3

Samuel Müller

@SamuelMullr

1 year

Thank you @__mfeurer__ , @noahholl , and @FrankRHutter for being amazing colleagues and co-authors! If you this caught your interest and you have questions, please feel free to reach out. (6/6)

0

1

3

Samuel Müller

@SamuelMullr

8 months

@giffmana @cgarciae88 @predict_addict @kaggle @tunguz Mmh I do not know any work that does that for tabular

0

3

Samuel Müller

@SamuelMullr

2 years

What if we had a version with down-sampled images for each paper on arxiv? Download times recently exploded for me for vision papers. This would surely improve accessibility and could probably be done automatically by @arxiv .

0

3

Samuel Müller

@SamuelMullr

4 years

@mkhoury @will_lawrenceTO @Austen I agree, but that is clearly a mistake of YouTube. I use like 10 apps on my Apple TV and YouTube is the only one I struggle with.

0

3

Samuel Müller

@SamuelMullr

10 months

@martin_casado I think this might not be true... shameless plug...

Samuel Müller

@SamuelMullr

10 months

There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...

12

142

1K

1

0

2

Samuel Müller

@SamuelMullr

10 months

@chiefluddite This might actually not pan out, I tried it out and could generalize...

Samuel Müller

@SamuelMullr

10 months

There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...

12

142

1K

0

2

Samuel Müller

@SamuelMullr

3 years

@skoularidou I see! In my feed your post was stuffed between ML posts, so I had the wrong context, I guess. 😅

0

2

Samuel Müller

@SamuelMullr

1 year

@fchollet Isn’t torch just clearly the best? Can’t Google allow its employees also switch to it as it is real OSS now? I don’t think the exceptionalism of Google helps research, as it reduces reproducibility a lot. A step in the right direction is Keras support for torch as backend i think

2

0

2

Samuel Müller

@SamuelMullr

1 year

@abhi1thakur

0

2

Samuel Müller

@SamuelMullr

10 months

@GaryMarcus For me it just generalized out of the box, you can try it yourself. Everything is public

Samuel Müller

@SamuelMullr

10 months

There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...

12

142

1K

0

1

Samuel Müller

@SamuelMullr

3 years

@giffmana @samsamoa Thanks for the inside! I just realized that either I am overlooking it or you didn't mention the pre-training peak-lr in your paper. So, you just used peak-lr=8e-4 in all experiments? That is beautifully simple :)

1

0

2

Samuel Müller

@SamuelMullr

10 months

@graduatedescent Re overlap: Agreed. That is my main finding, too. Just make sure to add some noise and all distributions overlap. I have to say, though, that even if I made sure there was no overlap (second experiment) that I saw some generalization. This is weird to me too though...

2

0

1

Samuel Müller

@SamuelMullr

1 year

I did not expect that there could be further improvements in flash attention of this magnitude! Looking forward to try this on our old RTX 2080 TIs, let’s pray it works there, too 🙏🙏🙏

Tri Dao

@tri_dao

1 year

Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/

40

682

3K

0

2

Samuel Müller

@SamuelMullr

2 years

Hot Take: Shortening attention spans due to digital media (TikTok etc) are actually good for scientific progress. Researchers can’t focus on too lengthy setups, like overly complicated ML algorithms, making them adhere to Occhams Razor 🪒 more closely. Come @ me

2

0

2

Samuel Müller

@SamuelMullr

3 years

Wondering about the (a tad esoteric) original Transformer LR schedule: Is anyone still using this? It incorporates the size of the model directly, still the scaling papers I know (Scaling LMs, Scaling ViTs) do not use it. @samsamoa @giffmana did you consider it? any thoughts?

1

0

2