Randall Balestriero @randall_balestr profile

Randall Balestriero

@randall_balestr

Followers

3,067

Following

243

Media

124

Statuses

454

AI Researcher: From theory to practice (and back) Postdoc @MetaAI with @ylecun PhD @RiceUniversity with @rbaraniuk Masters @ENS_Ulm @Paris_Sorbonne

https://t.co/vb5e6gCvqF

USA

Joined April 2020

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Tuchel • 164446 Tweets

#FGO • 113600 Tweets

#sbhawks • 48780 Tweets

カズラドロップ • 47629 Tweets

バーニス • 38359 Tweets

#deprem • 36482 Tweets

#baystars • 34164 Tweets

FY RECAP BLANK SS2EP4 • 31218 Tweets

#BlankReactSS2Ep4 • 28528 Tweets

ZETA • 23817 Tweets

ソフトバンク • 19958 Tweets

ジェロニモ • 19709 Tweets

ホークス • 17286 Tweets

スタメン • 14523 Tweets

ジャイアンツ • 14382 Tweets

カノウさん • 12440 Tweets

ベイスターズ • 12014 Tweets

日本シリーズ • 10773 Tweets

ヘルナンデス

衆院選JNN序盤情勢調査

右京さん

ターシャ

いまみー

岸波白野

増田大輝

完封リレー

堀岡くん

完封負け

巨人打線

伊藤大海

吉野家コピペ

スマイルウィ

カズラちゃん

三振ゲッツー

サクラファイブ

スイパラ

ソフト図鑑

イーブン

ヴァイオレット

オベロン

ヤスアキ

アドバンテージ

まけほー

色メロエッタ

モンテス

レイエス

カドック

ムリアン

たかほー

横浜優勝

Last Seen Profiles

@poniyama_com

@SUMANTHREBEL4

@babylonlvrs

@thedabianc68104

@mirzatetik78

@Alphacalypse

@pat1ma

@kareemcnn

@LWestLivingston

@nov_aep

@turk_ifsa2019

@NHWharfRat

@Deepmal95977093

@GoddessOzlemCd

@AndreChaconDiaz

@henryhakula

@mebeta3

@NFT_narrat0r

@cooney_che

@Zach_Hays24

Pinned Tweet

Randall Balestriero

@randall_balestr

2 months

Latest preprint gathering some of our papers using affine spline to dive into AI models. The math is exact, intuitive and actionable...allowing us to derive new methods that improved SOTAs. Dive in if you want to make AI less of a trial-and-error science!

4

33

169

Randall Balestriero

@randall_balestr

2 years

Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!

7

83

545

Randall Balestriero

@randall_balestr

2 years

Do you want a self-supervised learning based kernel (and embedding) of your data without training a deep network? Here it is... SimCLR and VICReg in the kernel regime (no training), to be used whenever training a deep network is not an option!

2

75

370

Randall Balestriero

@randall_balestr

8 months

Keep training your Deep Network past the point of perfect training set accuracy and its robustness will increase. Why? Because the spline partition keeps concentrating near the decision boundary ➡️the DN is affine all around the training samples!

11

50

323

Randall Balestriero

@randall_balestr

2 years

-A new paper explaining batch norm ! -Please... there are already >>1 papers showing it helps optimization, @ylecun even said it in 1998 -Wait! it does more and can be studied from a spline viewpoint e.g. batch norm fits a random weight DN to your data! ⬇️

3

52

318

Randall Balestriero

@randall_balestr

1 year

If you train your AI system without labels, SSL is probably what you will end up using. But you might hit many walls along the way as SSL builds upon decades of research. To help, we compiled this guide: whether you train/deploy/research, give it a read!

A Cookbook of Self-Supervised Learning

Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high...

arxiv.org

Yann LeCun

@ylecun

1 year

Everything you ever wanted to know about Self-Supervised Learning but were afraid to ask. A giant cookbook of SSL recipes. By a large crowd from Meta-FAIR with various academic collaborators led by @randall_balestr and Mark Ibrahim.

44

684

3K

5

39

236

Randall Balestriero

@randall_balestr

2 years

HUGE diff between Decision Tree (any variant) and Deep Network explaining generalization/extrapolation: DTs only partition the space where there is training data, DNs also partition the space where there is no training data by extrapolating the subdivision

10

45

245

Randall Balestriero

@randall_balestr

2 years

If you don't use self-supervised learning today, it can only be due to two reasons: 1. you did not hear of SSL yet 2. you don't have time/GPUs to use it Now that we resolved 1., check out our latest preprint where we resolve 2. !!

4

35

238

Randall Balestriero

@randall_balestr

2 years

Very happy to *finally* share our latest findings with @ylecun tying different SSL methods to known spectral embedding methods (in addition of providing as many insights/ideas as we could)... ⬇️ is a very brief summary of some key results :)

Contrastive and Non-Contrastive Self-Supervised Learning Recover...

Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming...

arxiv.org

8

42

222

Randall Balestriero

@randall_balestr

2 years

Supervised and self-supervised learning? Two separate methods for different cases... one might say! With @CabannesVivien @ylecun Leon Bottou we show instead that both live on the same continuum... opening the door to novel principled learning strategies!

2

49

204

Randall Balestriero

@randall_balestr

1 year

Happy that our Active Self-Supervised Learning got accepted at ICCV! We prove that DNNs learn optimal representations only from positive data pairing. Since positive pairs are way cheaper than labels to query we also study that new active learning strategy

2

40

201

Randall Balestriero

@randall_balestr

3 years

Latest preprint with Léon Bottou and @ylecun on the impact of regularization/data-augmentation on per-class performances (for better or worse)! Using them improves average generalization but some classes will have worse performance than without them 🧵1/4

4

46

191

Randall Balestriero

@randall_balestr

2 months

Batch-normalization (BN)--used in pretty much all non-transformer AI models--minimize the total least square objective between the training points and the model's input space partition! TLDR: total least square is all you need to dive into AI theory!

Peyman Milanfar

@docmilanfar

2 months

Every technical person knows about ordinary least-squares (OLS) but most don’t know *total* least-squares (TLS). These measure fitting error differently: OLS minimizes sum of sq. vertical distances whereas TLS minimizes the sum of orthogonal distances from data to fit line 1/2

15

92

857

2

35

185

Randall Balestriero

@randall_balestr

3 years

Very happy to share our preprint, a joint-work with @imisra_ and @ylecun , which is about data-augmentations (DAs), or rather, the expectation and variance of models' predictions and training losses under randomly augmented samples! (1/5)

5

37

173

Randall Balestriero

@randall_balestr

1 year

Very happy to share that our paper at the intersection of Information Theory/Self-Supervised Learning/Spline Theory got into #NeurIPS ! We show how to (i) do information theory with deterministic network (ii) derive new SSL guarantees/methods from it!

An Information-Theoretic Perspective on...

Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised learning (SSL) method that has shown promising results on a variety of tasks. However, the fundamental mechanisms...

arxiv.org

4

34

173

Randall Balestriero

@randall_balestr

2 years

Affine splines enable you to do deep learning theory without resorting to the linearized/kernel regime i.e. you study what practitioners actually deploy. But even more important, splines provide the coolest viz. of deep networks you could dream of! List of useful spline papers⬇️

Imtiaz Humayun

@imtiazprio

2 years

How are Deep Neural Networks black-boxes if you can visualize them in an 'exact' manner? Our new #CVPR23 paper, presents a fast and scalable PyTorch toolbox to visualize the linear regions, aka partition+decision boundary, of any DNN (red🔻)! 🧵 1/N

7

57

275

1

26

163

Randall Balestriero

@randall_balestr

2 years

Learning good representations using manifold learning? Spectral embedding? Energy based models? Self-supervised learning? All share one goal: learning non-collapsed representations with minimal variations. Join @CabannesVivien @albertobietti for a journey:

4

33

159

Randall Balestriero

@randall_balestr

3 years

100% true. That is why I strongly recommend anyone learning deep learning to also take a basic digital signal processing course. At least to get the basics of convolution (CNNs), aliasing (sub-sampling/pre-processing), FIR and IIR filters (RNNs), wavelet thresholding (AEs)

Alex Dimakis

@AlexGDimakis

3 years

Here is a very good reason why the Nyquist–Shannon sampling theorem requires that your function is low-pass before you sub-sample to downscale. If you just sub-sample without smoothing, a bad guy can place another image exactly on the pixels you sub-sample. Adversarial aliasing.

8

54

348

5

19

146

Randall Balestriero

@randall_balestr

2 years

Wanna - use Information Theory - but with deterministic deep networks - to study and improve self-supervised learning? We do just that and explain how in our latest preprint with @ziv_ravid @ylecun @timrudner and Kenji! Bonus: it uses affine splines ;)

An Information-Theoretic Perspective on...

Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised learning (SSL) method that has shown promising results on a variety of tasks. However, the fundamental mechanisms...

arxiv.org

2

31

151

Randall Balestriero

@randall_balestr

8 months

Learning by reconstruction ``easily'' provides eye-candy samples...but the learned representation's ability to solve perception tasks is often a letdown. We pinpoint that misalignement, measure it, and show how some denoising tasks (masking) sometimes help

4

28

152

Randall Balestriero

@randall_balestr

11 months

Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!

1

24

146

Randall Balestriero

@randall_balestr

2 months

Honored to join @BrownCSDept to keep pushing for theoretically grounded AI solutions! From self supervised learning (what else do you need?) to fairness, we have one motto: Prove Once Train Once I want to thank everyone I have talked/pdb/trained/published with... you made me!

Brown CS

@BrownCSDept

2 months

Please welcome @randall_balestr , joining @BrownCSDept as assistant professor! His research focuses on novel theoretical solutions to guide practitioners, to safeguard users, and to pave the way towards a truly autonomous AI solution. Learn more:

0

3

52

13

11

145

Randall Balestriero

@randall_balestr

2 years

Decision trees do not combine input dims at each node but an oblique DT does 1. ODTs are not easily interpretable due that fact 2. some deep networks can be turned into ODTs (very deep + lots of nodes) this does not help much for DN interpretability (1+2)

6

17

132

Randall Balestriero

@randall_balestr

1 year

It has never been simpler to prevent DNs to overfit! Guillotine Regularization (accepted at TMLR) (i) adds a few layers on top of your favorite DN during training, (ii) removes them post-training, (iii) trains a linear layer on top of the frozen DN!

4

31

129

Randall Balestriero

@randall_balestr

2 years

How to assess SSL models’ downstream performance with no labels, no tuning/training, and in a matter of minutes? With @garridoq_ , @laurentnajman , and @ylecun , we answer this question by introducing RankMe, a simple metric based on the rank of embeddings!

2

20

125

Randall Balestriero

@randall_balestr

3 months

Aaaand we are live from Vienna at poster 1002! Come by to discuss about training dynamics, splines, and the two stage learning that secretly occurs within your deep networks!

2

14

120

Randall Balestriero

@randall_balestr

3 years

Very happy to share our preprint that explains why residual connections provably make the loss surface of deep networks everywhere less erratic and eccentric (better conditioned)... hence resnet/densenet are easier to optimize under SGD out-of-the-box. 1/2

Singular Value Perturbation and Deep Network Optimization

We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep...

arxiv.org

3

24

118

Randall Balestriero

@randall_balestr

2 years

Less is more, which is why we put unsupervised learning on a DIET! By predicting the datum index (as if it were its class) DIET learns SOTA representations without labels! + it works without projector/siemese nets/... on resnets/vits/convnexts/.. WYSIWYG⬇️

9

18

118

Randall Balestriero

@randall_balestr

2 years

Happy to be at #ICML2022 ! And happy to chat/brainstorm about SSL/splines/data-augmentation/... at the @MetaAI booth (Tuesday/Wednesday, 8:30 am until early afternoon)... or DM me!

4

7

115

Randall Balestriero

@randall_balestr

1 year

We had found that training with a projector (MLP layers topping your DN) reduces the DN's learned biases e.g. to poor data-augmentation. We now found that you can control this effect only by changing the projector's input dimension!

3

20

110

Randall Balestriero

@randall_balestr

2 years

Self-supervised learning involves many design choices (architecture, data-augmentation, ...) and cross-validation is not always an option. That is why, in our latest paper, we theoretically study the interplay between those choices and provide guidelines:

2

19

113

Randall Balestriero

@randall_balestr

9 months

How to inject prior knowledge into Self Supervised Learning: -loss -architecture -data augmentation we add a fourth🕑dimension with Guided Positive Sampling: -embedding space to query positive samples removing the need to define strong DA + trains faster!

1

27

107

Randall Balestriero

@randall_balestr

3 years

Even if the Fourier transform was not explicitly invoked, it has been present for decades as the preferred convolution algorithm for large image and/or filter sizes! Here is yet another classic read from @ylecun on the subject

Fast Training of Convolutional Networks through FFTs

Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of...

arxiv.org

JFPuget 🇺🇦

@JFPuget

3 years

It looks like Fourier transform is everywhere now in deep learning. at least in the papers I am reading now.

28

34

443

3

17

107

Randall Balestriero

@randall_balestr

4 months

Self Supervised Learning learns informative and organized representations of unlabeled data... but involve many moving pieces... Q:which are necessary and which are sugar coating? A: Bonus: removing the sugar coating makes SSL training stable and reliable

Occam's Razor for Self Supervised Learning: What is Sufficient...

Deep Learning is often depicted as a trio of data-architecture-loss. Yet, recent Self Supervised Learning (SSL) solutions have introduced numerous additional design choices, e.g., a projector...

arxiv.org

2

25

104

Randall Balestriero

@randall_balestr

1 year

Happy that our work on understanding the interplay between architecture/data-augmentation on Self-Supervised Learning downstream perfs. has been accepted at #ICML2023 ! YES, you can successfully use SSL with ``bad'' DA as long as your DN archit. is right

1

17

91

Randall Balestriero

@randall_balestr

1 year

We hope you have found all the answers you needed in our cookbook around SOTA representation learning with SSL! But wait, we will be giving away even more tips and tricks at our #ICML2023 tutorial! Monday/1:30pm local/exhibit hall2 speakers include @imisra_ @mcaron31 @endernewton

Yann LeCun

@ylecun

1 year

Everything you ever wanted to know about Self-Supervised Learning but were afraid to ask. A giant cookbook of SSL recipes. By a large crowd from Meta-FAIR with various academic collaborators led by @randall_balestr and Mark Ibrahim.

44

684

3K

0

15

88

Randall Balestriero

@randall_balestr

2 years

You know a deadline is approaching when you start using np.sqrt and \sqrt interchangeably...!

1

4

86

Randall Balestriero

@randall_balestr

3 months

We previously showed () how many SSL methods could be unified using an inter-sample relationship graph (spectral embedding). From that, we now propose a new SSL method: 𝕏-CLR ()! better loss=less spurious correlations being learned

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive...

Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to...

arxiv.org

Vlad Sobal

@vlad_is_ai

3 months

Representation learning is often done by considering samples to be either identical (same class, positive pairs) or not–with no middle ground. We propose 𝕏-CLR to learn from soft inter-sample relationships, and get better accuracy & improved robustness.

2

20

79

1

20

82

Randall Balestriero

@randall_balestr

3 months

Aaaaand we are back on the ground at poster 602 to cover a breaking news: learning a representation by reconstruction will not produce something useful for perception tasks! They don't have the same taste in features! Come by to learn why and to discuss alternative solutions!

Randall Balestriero

@randall_balestr

3 months

Aaaand we are live from Vienna at poster 1002! Come by to discuss about training dynamics, splines, and the two stage learning that secretly occurs within your deep networks!

2

14

120

5

6

83

Randall Balestriero

@randall_balestr

11 months

Excited to share our #NeurIPS2023 paper explaining part of the per-class accuracy degradation that data augmentation introduces: it creates asymmetric label-noise between coarse/fine classes of the same object e.g. car and wheel! We also find a remedy⬇️

1

12

82

Randall Balestriero

@randall_balestr

1 year

Training dynamics of surrogate quantities e.g. the loss are well studied but do not provide many insights into the DN's geometry. But linear regions concentration do just that and still exhibit a double descent dynamic that is controlled by regularization

2

18

75

Randall Balestriero

@randall_balestr

2 years

POLICE code is now available: Quick facts: - POLICE only takes 5 lines of code - code is jit/CPU/GPU friendly (PyTorch) - it will only take a few minutes to generate all the figures Eager to see the figures/papers/ideas you will create from it!

Randall Balestriero

@randall_balestr

2 years

Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!

7

83

545

0

10

71

Randall Balestriero

@randall_balestr

10 months

Interestingly the ReLU and Swish relation is well understood from a spline viewpoint akin to the relation between k-NN and isotropic GMM: deterministic vs probabilistic region assignment! The same goes for absolute value vs Mish, and many more! More at

murat 🍥

@mayfer

10 months

with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work

10

20

245

0

10

69

Randall Balestriero

@randall_balestr

2 years

Our #CVPR2023 submission has been accepted! () We develop an exact+fast algo to compute a Deep Network partition characterizing its geometry and decision boundary e.g. to rapidly sample from the latter for viz/active learning! Code:

2

14

67

Randall Balestriero

@randall_balestr

2 years

Not too surprising since e.g. batch-norm with random weights provably aligns the DN's partition to the data geometry: just from its mini-batch statistics!

Dimitris Papailiopoulos

@DimitrisPapail

2 years

"The Expressive Power of Tuning Only the Norm Layers" lead by @AngelikiGiannou & @shashank_r12 We show that large frozen networks maintain expressivity even if we only fine-tune the norm & bias layers.

5

40

263

2

14

62

Randall Balestriero

@randall_balestr

2 years

Awesome website summarizing our latest TMLR paper demonstrating how deep networks pruning can be easily explained/visualized and improved simply by formulating it in terms of the DN's spline partition! Paper: Code:

Haoran You

@ranery1998

2 years

@randall_balestr @TmlrSub @eiclab We just built a website for this project 🤗:

0

1

5

0

11

63

Randall Balestriero

@randall_balestr

2 months

Vision Language Models have fueled recent AI breakthroughs...but the next generation will need to do more than just scale up datasets and models sizes! Dive into our latest preprint and benchmark library to understand why and to stress-test your ideas!

6

10

62

Randall Balestriero

@randall_balestr

2 years

Happy to have four papers accepted to #NeurIPS2022 ! Shoutout to incredible co-authors/colleagues @imisra_ @ylecun @bobak_kiani and Leon Bottou! I will tweet about each in the coming days... but⬇️ TLDR: Never stop improving papers from reviews/comments... perseverance is the key!

0

1

60

Randall Balestriero

@randall_balestr

3 years

Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:

4

12

58

Randall Balestriero

@randall_balestr

1 year

Delighted to share that our work with @garridoq_ and @ylecun got an oral+poster at #ICML2023 ! We enable truly label-free hyper-parameter search for SSL (validated on SimCLR/VICReg/DINO/.. and many datasets) aiming for best linear perf. without fine-tuning!

3

6

56

Randall Balestriero

@randall_balestr

2 years

A (Deep) Network has always been "any computational graph with forward and (optionally) backward data-flow" (see⬇️) this is a big class that includes kernels, trees, k-NN... just by a change of arch. so when people say moving away from DNs do they mean moving away from computers?

6

5

55

Randall Balestriero

@randall_balestr

2 years

Happy to share our accepted @TmlrSub with @eiclab about deep network (DN) pruning from an affine spline perspective! In short, pruning removes/projects the DN partition boundaries (nice avenues to theoretically understand/improve pruning) Some insights ⬇️

2

11

54

Randall Balestriero

@randall_balestr

2 years

Interested in (A) self-supervised learning <-> spectral embedding (B) data-augmentation's unfair per-class impact (C) DA implicit/explicit regularizer (D) fast orthogonal/unitary weight learning? Come see us #NeurIPS2022 4:30/Hall J A: #228 Tue B: #642 Thu C: #322 Wed D: #542 Wed

3

9

54

Randall Balestriero

@randall_balestr

2 years

Amazing work resulting from an amazing collaboration made possible by @forai_ml and @sarahookr ! TLDR: we still have a lot to learn around what brings stochasticity in deep network training (init/batching/DA) and by how much. This paper takes an important step in quantifying them!

Cohere For AI

@CohereForAI

2 years

Our newest Paper Profiles video goes behind the scenes of our recent community-driven research collaboration, "FAIR-Ensemble: When Fairness Naturally Emerges from Deep Ensembling." Thanks to @weiyinko_ml and @mrdanieldsouza for taking the time to chat!

1

10

23

2

15

50

Randall Balestriero

@randall_balestr

3 years

Accepted as an oral #CVPR2022 !🥳 Taking this opportunity to say that this comes as a result of years of work in building a theoretical bridge between deep networks<->continuous piecewise affine operators. Theory is a guide that reduce the set of unknown to be cross-validated 🧑‍🔧!

Randall Balestriero

@randall_balestr

3 years

Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:

4

12

58

0

10

46

Randall Balestriero

@randall_balestr

1 year

SSL and supervised learning unified under one loss (only the inter-sample similarity graph varies between them) at #ICCV23 Friday/10:30/Nord/023 Hello to cheap expert-free active/supervised learning by asking if samples come from the same class, not asking for the class label!

1

6

44

Randall Balestriero

@randall_balestr

2 years

New Constructive Approximation paper: Deep Networks with (leaky-)relu and least-square loss have continuous piecewise quadratic per-layer loss landscape. From that we precisely study how the DN architecture impacts that loss landscape and SGD convergence!

1

3

45

Randall Balestriero

@randall_balestr

2 years

Two ICLR23: - spotlight: new fine-grain labels for Imagenet+insights into failure modes of models/DA/losses - poster: theory unraveling the failures that emerge when deploying self-supervised learning on uncurated data One goal: understanding when/why deep learning can fail ⬇️

1

6

45

Randall Balestriero

@randall_balestr

2 years

3 ICASSP papers! -the infamous POLICE that provably tames the beast (DN) to obey input space constraints -the unifying minimal variations tying SSL/spectral embedding/generative models -DNs' partition enumeration: not out/stay tuned

1

6

41

Randall Balestriero

@randall_balestr

3 years

Delighted to share our latest preprint with Bobak Kiani, @ylecun , and Seth Lloyd where we propose an **efficient and scalable** gradient based training of orthogonal/unitary matrices (e.g. used in each layer of a recurrent network/convolutional network).

5

11

43

Randall Balestriero

@randall_balestr

2 years

+1! Academic programs should favor candidates who did not get that chance, to mentor them and to bring them at the top for post-PhD adventures. Taking PhD candidates to bloat your group's publication counts year-one goes against the academic spirit... (teaching statement anyone?)

Sara Hooker

@sarahookr

2 years

If you have multiple papers before you even began a PhD, it likely means you had access that others didn't. I wish more PhD programs would take a step back and stop this absurd practice of favoring multiple papers before someone even begins a training program.

78

304

3K

1

0

38

Randall Balestriero

@randall_balestr

1 year

RankMe: cheap/fast label-free hparam selection for DNNs will be at ICML - oral: ballroom B 4pm (local) Wed. - poster⬇️: exhibit hall 1 #609 1:30pm Thur. Also includes insights around representations' ranks, their surprising consistency across datasets, ...

1

9

37

Randall Balestriero

@randall_balestr

2 years

Delighted to share our @TmlrOrg paper with F. Bordes and P. Vincent! We use the latest diffusion model to interpret/visualize the features of black-box models (DNNs, ...) by conditioning the generation with the model's features. We obtain many insights⬇️⬇️

1

8

36

Randall Balestriero

@randall_balestr

3 years

"We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. (1/2)

2

3

34

Randall Balestriero

@randall_balestr

1 year

Do you remember the POLICE (fast PrOvable LInear Constraints Enforcement for deep networks ) ? An application to adversarial robustness is now available POLICE can be used as a one-shot robustifier, or during training/fine-tuning!⬇️

Randall Balestriero

@randall_balestr

2 years

Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!

7

83

545

1

4

35

Randall Balestriero

@randall_balestr

3 months

Delighted to share that--by dint of all my coauthors--I will be at #ICML2024 to present our findings! From LLM geometry to adversarial grokking, without forgetting the provable benefits of moving away from reconstruction for representation learning! link:

6

10

33

Randall Balestriero

@randall_balestr

2 years

Self-Supervised Learning methods have strong a priori on the type of data distribution you train on. With @mido_assran et al. we highlight what are those a priori and how to tune them at our advantage e.g. to improve SSL on uncurated and/or imbalanced data

0

6

33

Randall Balestriero

@randall_balestr

6 months

Very happy to share that we will be presenting that work at #ICML2024 ! Moving away from reconstruction is key to learn better semantic abstractions... but we can only do so by first understanding why learning by reconstruction falls short!

Randall Balestriero

@randall_balestr

8 months

Learning by reconstruction ``easily'' provides eye-candy samples...but the learned representation's ability to solve perception tasks is often a letdown. We pinpoint that misalignement, measure it, and show how some denoising tasks (masking) sometimes help

4

28

152

1

2

27

Randall Balestriero

@randall_balestr

2 years

With very large models and/or slow training frameworks (GPT-3, self-supervised learning, ...) I believe that theoretically-backed methods will regain grounds... brute-force cross-validation of everything is no longer an option! MuTransfer of @TheGregYang embodies that perfectly

0

2

26

Randall Balestriero

@randall_balestr

2 years

Another benefit of batch norm lies in the randomness of the mini-batch statistics (from one to the other) inducing a jittering effect in the partition and increasing the decision boundary margin to training samples! batch size controls the jittering strength and can be controlled

1

3

26

Randall Balestriero

@randall_balestr

6 months

We will be presenting this work at #ICML2024 diving deeper into LLMs' geometry and how that can help in understanding their current limitations. For example, increasing a prompt's intrinsic dimension bypasses RLHF! Congrats @Rom_Cosentino @shekkizh

Randall Balestriero

@randall_balestr

11 months

Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!

1

24

146

0

8

24

Randall Balestriero

@randall_balestr

10 months

🔔LLM update! - The few hundred features we extract from Mistral/Llama2-7B to characterize your prompt (e.g. for domain separation or toxicity detection) also work on Llama2-70B - We validate them on the official Jigsaw Kaggle challenge and reach SOTA

Randall Balestriero

@randall_balestr

11 months

Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!

1

24

146

1

4

24

Randall Balestriero

@randall_balestr

2 years

In our latest preprint we show that Deep Ensembles have fairness benefits even when each model uses the same training set/architecture/optimizer. We also characterize by how much random init./data-augmentation/data-ordering impact the learned model between training episodes :)

Wei-Yin Ko

@weiyinko_ml

2 years

Our new preprint is out! In FAIR-Ensemble, we explore per-group performances after predictions averaging of Deep Networks (same architecture, hyper-parameters) and fairness naturally emerges! Paper: Code: 1/8

1

7

39

1

5

23

Randall Balestriero

@randall_balestr

2 years

Thanks to an amazing team ( @byoubii @D_Bouchacourt @marksibrahim et al.) we are releasing fine-grained distribution shift annotations for each Imagenet eval image and many train ones along with controlled robustness analysis of many SOTA models e.g. looking at the impact of DA!

AK

@_akhaliq

2 years

ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations abs: project page: github:

1

22

108

1

3

23

Randall Balestriero

@randall_balestr

18 days

Why you should think twice before setting the `max_new_tokens` parameter!

Wei-Lun Chao

@weilunchao

20 days

NeurIPS 2024 Best AC Award ...

32

38

713

0

22

Randall Balestriero

@randall_balestr

2 years

Even when looking at high-dimensional spaces and architectures with thousands of units per layer+multiple layers, training time of the constrained model is only about 4~5x slower that the unconstrained one which is a small cost to pay for a provable constraint enforcement method!

0

1

20

Randall Balestriero

@randall_balestr

3 years

In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort." Sir Ronald Fisher, (last page)

1

0

20

Randall Balestriero

@randall_balestr

4 months

Very happy to speak at SSL4EO! The core part of my talk will follow our latest papers to (i) provide principled insights into SSL, and (ii) give guidelines to design your own pipeline: 1/4: why do we need to move away from learning by reconstruction () ⬇️

Learning by Reconstruction Produces Uninformative Features For Perception

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by...

arxiv.org

Nico Lang

@nicolangnl

4 months

At University of Copenhagen, we are organizing a summer PhD course on SSL4EO. Registration is now open via: (seats are limited). We are looking forward to hear from: @randall_balestr , @MarcCoru , @kklmmr , @brunosan , @JanDirkWegner1 , @xiaoxiang_zhu

3

44

166

2

20

Randall Balestriero

@randall_balestr

3 years

@shortstein @icmlconf As sad as it is, it seems that such situations have existed across fields and ages, as formulated by Fisher in 1958!

Randall Balestriero

@randall_balestr

3 years

"We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. (1/2)

2

3

34

0

20

Randall Balestriero

@randall_balestr

3 months

Aaaand we are now covering that surprising event happening on poster 705 with guest @shekkizh : diving into LLMs geometry and using those insights to derive features from pretrained models or to bypass RLHF through natural prompt manipulations!

Randall Balestriero

@randall_balestr

3 months

Aaaaand we are back on the ground at poster 602 to cover a breaking news: learning a representation by reconstruction will not produce something useful for perception tasks! They don't have the same taste in features! Come by to learn why and to discuss alternative solutions!

5

6

83

0

2

19

Randall Balestriero

@randall_balestr

2 years

POLICE can also be applied to classification/SSL tasks, you do not need to change your loss, optimizer, or architecture, and its complexity is only determined by the complexity of doing a single forward-pass in your model X the number of vertices defining the constrained region

1

18

Randall Balestriero

@randall_balestr

1 year

Huge thanks to @marksibrahim , Vlad Sobal, @arimorcos , @sshkhr16 , @tomgoldsteincs , Florian Bordes, @AdrienBardes , @mialon_gregoire , @tydsh , @A_v_i__S , @andrewgwils , @jonasgeiping , @garridoq_ , @pierrefdz , @_amirbar , @hpirsiav10 , @ylecun , @micahgoldblum it was a pleasure working with all of you!

1

19

Randall Balestriero

@randall_balestr

2 years

By upgrading the FFCV library, we enable super fast/single GPU training of SSL. We also explore how to cross-validate SSL methods, and show that known failure cases were just the result of poor hyper-parameters! Huge effort by Florian Bordes (MVP!), Pascal Vincent and myself :)

1

0

17

Randall Balestriero

@randall_balestr

3 months

Our geometric characterization of LLMs ( at #ICML2024 ) tied the prompts' intrinsic dimensions to their ability to make a LLM generation toxic. @Tenyx_AI researchers extended our results for reasoning! Q: Can reasoning and safe generation coexist with LLMs?

Sarath Shekkizhar

@shekkizh

3 months

Unlocking better reasoning in LLMs can go beyond just longer context & bigger models! Our recent research () offers a geometric view of the expressive power and reasoning capabilities of LLMs. Stay tuned for more insights! @Rom_Cosentino #LLM #Reasoning

0

2

3

6

18

Randall Balestriero

@randall_balestr

2 years

"Symmetric positive definiteness is arguably one of the highest mathematical accolades to which a matrix can aspire." Prof. Nicholas J. Higham

0

2

17

Randall Balestriero

@randall_balestr

3 years

Impressive program at the upcoming World AI Cannes Festival in France @WAICANNES ! The AI Society / AI Today & Tomorrow track alone has an impressive list of speakers including @ylecun , anyone can attend (for free) with the Discovery Pass !

0

5

18

Randall Balestriero

@randall_balestr

8 months

As a bonus, our findings also explain why long training time is required to `````finally''''' capture the features useful for perception tasks as part of the representation. Our findings open new avenues to speed up training through new denoising tasks!

1

2

17

Randall Balestriero

@randall_balestr

4 months

New preprint + AI4Science #ICML2024 workshop: ScaLES! ScaLES provides a differentiable confidence score for samples generated from pretrained models. Applied to Latent Space Optimization, ScaLES improves the solutions to black-box optimization problems!

1

5

16

Randall Balestriero

@randall_balestr

2 years

kernel <-> neural networks: trees <-> neural networks: k-NN <-> neural networks: .... .... ....

Neural Decision Trees

In this paper we propose a synergistic melting of neural networks and decision trees (DT) we call neural decision trees (NDT). NDT is an architecture a la decision tree where each splitting node...

arxiv.org

1

2

16

Randall Balestriero

@randall_balestr

10 months

You have to train an ensemble of Deep Networks with same training set and architecture. Q: How to maximize the ensemble fairness? 1. vary the weight initialization 2. vary the data sampling 3. vary the data-augmentation seed 4. all the above Answer at the AFT workshop on Friday!

Cohere For AI

@CohereForAI

2 years

@weiyinko_ml @mrdanieldsouza Find out more about what model design choices can mitigate unfair outcomes by reading "FAIR-Ensemble: When Fairness Naturally Emerges from Deep Ensembling." 📜

1

0

4

1

5

16

Randall Balestriero

@randall_balestr

2 years

The DN partition boundary (random weights) shows an higher concentration of regions around the data distribution (from using batch norm alone!). This fitting is proved analytically: batch norm statistics shift/bend the partition boundaries to the data, and depth is crucial!⬇️

1

2

15

Randall Balestriero

@randall_balestr

3 years

We always hear about applied deep learning results relying on many tricks, but never about theoretical results around deep learning that often rely on even more assumptions! Both are highly specialized, need fine-tuning to work on new cases and both struggle to impress each other

0

1

15

Randall Balestriero

@randall_balestr

2 years

Provable control of the quality and diversity of sampling for pre-trained deep generative networks ... without any additional learning! Check us out at #CVPR2022 tomorrow at 8:30 am in Hall B1, Oral Session 3.1.1, and Poster Session 3.1

Randall Balestriero

@randall_balestr

3 years

Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:

4

12

58

0

2

15

Randall Balestriero

@randall_balestr

3 years

Real-world deep learning interview problems and solutions: Interesting resource for everyone! I especially enjoy the thorough references that have been put throughout the set of problems!

2

5

15

Randall Balestriero

@randall_balestr

2 years

@ISusmelj can't anyone add a feature saying that if the type of inferred vehicle changed chaotically during 5 sec, let's just represent it as an "unidentifiable blob" until it stabilizes? at least would not look as buggy on the frontend...

1

14

Randall Balestriero

@randall_balestr

8 months

Furthermore, Batch-Normalization which is known to force the DN partition to concentrate near the training samples () prevents robustness to emerge! In fact, Grokking is all about understanding the DN partition migration dynamics...

1

0

15

Randall Balestriero

@randall_balestr

2 years

We indeed need a more closed-loop SSL e.g. where data-augmentations of positive pairs are guided by the deep network's guess (action) of what new views (sensory inputs) would provide it with new information leading to a sharper understanding of the presented image/scene!

Timoleon (Timos) Moraitis

@timos_m

2 years

#NeuroAI : Could principles of embodied sensorimotor neuroscience unify and improve the various Self-Supervised Learning (SSL) methods? How could the brain self-supervise itself? We are happy to share our #NeurIPS2022 paper with @franz_scherr and Q. Guo🧵:

6

31

148

1

14

Randall Balestriero

@randall_balestr

3 years

@WriteArthur Well, yes there are the natural statistics of the training data that will influence the generation, but the type of DA/regularization that was used during training also plays a huge role. See for example our latest preprint on that point

Randall Balestriero

@randall_balestr

3 years

Latest preprint with Léon Bottou and @ylecun on the impact of regularization/data-augmentation on per-class performances (for better or worse)! Using them improves average generalization but some classes will have worse performance than without them 🧵1/4

4

46

191

0

14

Randall Balestriero

@randall_balestr

1 year

One can only imagine the vast amount of knowledge that was distilled into this, which is why that end-result would have never been possible without all the incredible co-authors that decided to collaborate for one purpose: sharing their knowledge and past experiences!

1

14

Randall Balestriero

@randall_balestr

2 years

Links to the papers: A: B: C: D: Feel free to come to our posters for any questions/discussions/ideas!

projUNN: efficient method for training deep networks with unitary matrices

In learning with recurrent or very deep feed-forward networks, employing unitary matrices in each layer can be very effective at maintaining long-range stability. However, restricting network...

arxiv.org

0

14