Randall Balestriero Profile Banner
Randall Balestriero Profile
Randall Balestriero

@randall_balestr

Followers
3,067
Following
243
Media
124
Statuses
454

AI Researcher: From theory to practice (and back) Postdoc @MetaAI with @ylecun PhD @RiceUniversity with @rbaraniuk Masters @ENS_Ulm @Paris_Sorbonne

USA
Joined April 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@randall_balestr
Randall Balestriero
2 months
Latest preprint gathering some of our papers using affine spline to dive into AI models. The math is exact, intuitive and actionable...allowing us to derive new methods that improved SOTAs. Dive in if you want to make AI less of a trial-and-error science!
Tweet media one
4
33
169
@randall_balestr
Randall Balestriero
2 years
Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!
Tweet media one
7
83
545
@randall_balestr
Randall Balestriero
2 years
Do you want a self-supervised learning based kernel (and embedding) of your data without training a deep network? Here it is... SimCLR and VICReg in the kernel regime (no training), to be used whenever training a deep network is not an option!
Tweet media one
2
75
370
@randall_balestr
Randall Balestriero
8 months
Keep training your Deep Network past the point of perfect training set accuracy and its robustness will increase. Why? Because the spline partition keeps concentrating near the decision boundary ➡️the DN is affine all around the training samples!
Tweet media one
11
50
323
@randall_balestr
Randall Balestriero
2 years
-A new paper explaining batch norm ! -Please... there are already >>1 papers showing it helps optimization, @ylecun even said it in 1998 -Wait! it does more and can be studied from a spline viewpoint e.g. batch norm fits a random weight DN to your data! ⬇️
Tweet media one
3
52
318
@randall_balestr
Randall Balestriero
1 year
If you train your AI system without labels, SSL is probably what you will end up using. But you might hit many walls along the way as SSL builds upon decades of research. To help, we compiled this guide: whether you train/deploy/research, give it a read!
@ylecun
Yann LeCun
1 year
Everything you ever wanted to know about Self-Supervised Learning but were afraid to ask. A giant cookbook of SSL recipes. By a large crowd from Meta-FAIR with various academic collaborators led by @randall_balestr and Mark Ibrahim.
44
684
3K
5
39
236
@randall_balestr
Randall Balestriero
2 years
HUGE diff between Decision Tree (any variant) and Deep Network explaining generalization/extrapolation: DTs only partition the space where there is training data, DNs also partition the space where there is no training data by extrapolating the subdivision
Tweet media one
10
45
245
@randall_balestr
Randall Balestriero
2 years
If you don't use self-supervised learning today, it can only be due to two reasons: 1. you did not hear of SSL yet 2. you don't have time/GPUs to use it Now that we resolved 1., check out our latest preprint where we resolve 2. !!
Tweet media one
4
35
238
@randall_balestr
Randall Balestriero
2 years
Very happy to *finally* share our latest findings with @ylecun tying different SSL methods to known spectral embedding methods (in addition of providing as many insights/ideas as we could)... ⬇️ is a very brief summary of some key results :)
8
42
222
@randall_balestr
Randall Balestriero
2 years
Supervised and self-supervised learning? Two separate methods for different cases... one might say! With @CabannesVivien @ylecun Leon Bottou we show instead that both live on the same continuum... opening the door to novel principled learning strategies!
Tweet media one
2
49
204
@randall_balestr
Randall Balestriero
1 year
Happy that our Active Self-Supervised Learning got accepted at ICCV! We prove that DNNs learn optimal representations only from positive data pairing. Since positive pairs are way cheaper than labels to query we also study that new active learning strategy
Tweet media one
2
40
201
@randall_balestr
Randall Balestriero
3 years
Latest preprint with Léon Bottou and @ylecun on the impact of regularization/data-augmentation on per-class performances (for better or worse)! Using them improves average generalization but some classes will have worse performance than without them 🧵1/4
Tweet media one
4
46
191
@randall_balestr
Randall Balestriero
2 months
Batch-normalization (BN)--used in pretty much all non-transformer AI models--minimize the total least square objective between the training points and the model's input space partition! TLDR: total least square is all you need to dive into AI theory!
Tweet media one
@docmilanfar
Peyman Milanfar
2 months
Every technical person knows about ordinary least-squares (OLS) but most don’t know *total* least-squares (TLS). These measure fitting error differently: OLS minimizes sum of sq. vertical distances whereas TLS minimizes the sum of orthogonal distances from data to fit line 1/2
Tweet media one
15
92
857
2
35
185
@randall_balestr
Randall Balestriero
3 years
Very happy to share our preprint, a joint-work with @imisra_ and @ylecun , which is about data-augmentations (DAs), or rather, the expectation and variance of models' predictions and training losses under randomly augmented samples! (1/5)
Tweet media one
5
37
173
@randall_balestr
Randall Balestriero
1 year
Very happy to share that our paper at the intersection of Information Theory/Self-Supervised Learning/Spline Theory got into #NeurIPS ! We show how to (i) do information theory with deterministic network (ii) derive new SSL guarantees/methods from it!
4
34
173
@randall_balestr
Randall Balestriero
2 years
Affine splines enable you to do deep learning theory without resorting to the linearized/kernel regime i.e. you study what practitioners actually deploy. But even more important, splines provide the coolest viz. of deep networks you could dream of! List of useful spline papers⬇️
@imtiazprio
Imtiaz Humayun
2 years
How are Deep Neural Networks black-boxes if you can visualize them in an 'exact' manner? Our new #CVPR23 paper, presents a fast and scalable PyTorch toolbox to visualize the linear regions, aka partition+decision boundary, of any DNN (red🔻)! 🧵 1/N
7
57
275
1
26
163
@randall_balestr
Randall Balestriero
2 years
Learning good representations using manifold learning? Spectral embedding? Energy based models? Self-supervised learning? All share one goal: learning non-collapsed representations with minimal variations. Join @CabannesVivien @albertobietti for a journey:
Tweet media one
4
33
159
@randall_balestr
Randall Balestriero
3 years
100% true. That is why I strongly recommend anyone learning deep learning to also take a basic digital signal processing course. At least to get the basics of convolution (CNNs), aliasing (sub-sampling/pre-processing), FIR and IIR filters (RNNs), wavelet thresholding (AEs)
@AlexGDimakis
Alex Dimakis
3 years
Here is a very good reason why the Nyquist–Shannon sampling theorem requires that your function is low-pass before you sub-sample to downscale. If you just sub-sample without smoothing, a bad guy can place another image exactly on the pixels you sub-sample. Adversarial aliasing.
8
54
348
5
19
146
@randall_balestr
Randall Balestriero
2 years
Wanna - use Information Theory - but with deterministic deep networks - to study and improve self-supervised learning? We do just that and explain how in our latest preprint with @ziv_ravid @ylecun @timrudner and Kenji! Bonus: it uses affine splines ;)
2
31
151
@randall_balestr
Randall Balestriero
8 months
Learning by reconstruction ``easily'' provides eye-candy samples...but the learned representation's ability to solve perception tasks is often a letdown. We pinpoint that misalignement, measure it, and show how some denoising tasks (masking) sometimes help
Tweet media one
4
28
152
@randall_balestr
Randall Balestriero
11 months
Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!
Tweet media one
1
24
146
@randall_balestr
Randall Balestriero
2 months
Honored to join @BrownCSDept to keep pushing for theoretically grounded AI solutions! From self supervised learning (what else do you need?) to fairness, we have one motto: Prove Once Train Once I want to thank everyone I have talked/pdb/trained/published with... you made me!
Tweet media one
@BrownCSDept
Brown CS
2 months
Please welcome @randall_balestr , joining @BrownCSDept as assistant professor! His research focuses on novel theoretical solutions to guide practitioners, to safeguard users, and to pave the way towards a truly autonomous AI solution. Learn more:
Tweet media one
0
3
52
13
11
145
@randall_balestr
Randall Balestriero
2 years
Decision trees do not combine input dims at each node but an oblique DT does 1. ODTs are not easily interpretable due that fact 2. some deep networks can be turned into ODTs (very deep + lots of nodes) this does not help much for DN interpretability (1+2)
Tweet media one
6
17
132
@randall_balestr
Randall Balestriero
1 year
It has never been simpler to prevent DNs to overfit! Guillotine Regularization (accepted at TMLR) (i) adds a few layers on top of your favorite DN during training, (ii) removes them post-training, (iii) trains a linear layer on top of the frozen DN!
Tweet media one
4
31
129
@randall_balestr
Randall Balestriero
2 years
How to assess SSL models’ downstream performance with no labels, no tuning/training, and in a matter of minutes? With @garridoq_ , @laurentnajman , and @ylecun , we answer this question by introducing RankMe, a simple metric based on the rank of embeddings!
Tweet media one
2
20
125
@randall_balestr
Randall Balestriero
3 months
Aaaand we are live from Vienna at poster 1002! Come by to discuss about training dynamics, splines, and the two stage learning that secretly occurs within your deep networks!
Tweet media one
2
14
120
@randall_balestr
Randall Balestriero
3 years
Very happy to share our preprint that explains why residual connections provably make the loss surface of deep networks everywhere less erratic and eccentric (better conditioned)... hence resnet/densenet are easier to optimize under SGD out-of-the-box. 1/2
3
24
118
@randall_balestr
Randall Balestriero
2 years
Less is more, which is why we put unsupervised learning on a DIET! By predicting the datum index (as if it were its class) DIET learns SOTA representations without labels! + it works without projector/siemese nets/... on resnets/vits/convnexts/.. WYSIWYG⬇️
Tweet media one
9
18
118
@randall_balestr
Randall Balestriero
2 years
Happy to be at #ICML2022 ! And happy to chat/brainstorm about SSL/splines/data-augmentation/... at the @MetaAI booth (Tuesday/Wednesday, 8:30 am until early afternoon)... or DM me!
Tweet media one
4
7
115
@randall_balestr
Randall Balestriero
1 year
We had found that training with a projector (MLP layers topping your DN) reduces the DN's learned biases e.g. to poor data-augmentation. We now found that you can control this effect only by changing the projector's input dimension!
Tweet media one
3
20
110
@randall_balestr
Randall Balestriero
2 years
Self-supervised learning involves many design choices (architecture, data-augmentation, ...) and cross-validation is not always an option. That is why, in our latest paper, we theoretically study the interplay between those choices and provide guidelines:
Tweet media one
2
19
113
@randall_balestr
Randall Balestriero
9 months
How to inject prior knowledge into Self Supervised Learning: -loss -architecture -data augmentation we add a fourth🕑dimension with Guided Positive Sampling: -embedding space to query positive samples removing the need to define strong DA + trains faster!
Tweet media one
1
27
107
@randall_balestr
Randall Balestriero
3 years
Even if the Fourier transform was not explicitly invoked, it has been present for decades as the preferred convolution algorithm for large image and/or filter sizes! Here is yet another classic read from @ylecun on the subject
@JFPuget
JFPuget 🇺🇦
3 years
It looks like Fourier transform is everywhere now in deep learning. at least in the papers I am reading now.
28
34
443
3
17
107
@randall_balestr
Randall Balestriero
4 months
Self Supervised Learning learns informative and organized representations of unlabeled data... but involve many moving pieces... Q:which are necessary and which are sugar coating? A: Bonus: removing the sugar coating makes SSL training stable and reliable
2
25
104
@randall_balestr
Randall Balestriero
1 year
Happy that our work on understanding the interplay between architecture/data-augmentation on Self-Supervised Learning downstream perfs. has been accepted at #ICML2023 ! YES, you can successfully use SSL with ``bad'' DA as long as your DN archit. is right
Tweet media one
1
17
91
@randall_balestr
Randall Balestriero
1 year
We hope you have found all the answers you needed in our cookbook around SOTA representation learning with SSL! But wait, we will be giving away even more tips and tricks at our #ICML2023 tutorial! Monday/1:30pm local/exhibit hall2 speakers include @imisra_ @mcaron31 @endernewton
Tweet media one
@ylecun
Yann LeCun
1 year
Everything you ever wanted to know about Self-Supervised Learning but were afraid to ask. A giant cookbook of SSL recipes. By a large crowd from Meta-FAIR with various academic collaborators led by @randall_balestr and Mark Ibrahim.
44
684
3K
0
15
88
@randall_balestr
Randall Balestriero
2 years
You know a deadline is approaching when you start using np.sqrt and \sqrt interchangeably...!
1
4
86
@randall_balestr
Randall Balestriero
3 months
We previously showed () how many SSL methods could be unified using an inter-sample relationship graph (spectral embedding). From that, we now propose a new SSL method: 𝕏-CLR ()! better loss=less spurious correlations being learned
@vlad_is_ai
Vlad Sobal
3 months
Representation learning is often done by considering samples to be either identical (same class, positive pairs) or not–with no middle ground. We propose 𝕏-CLR to learn from soft inter-sample relationships, and get better accuracy & improved robustness.
Tweet media one
2
20
79
1
20
82
@randall_balestr
Randall Balestriero
3 months
Aaaaand we are back on the ground at poster 602 to cover a breaking news: learning a representation by reconstruction will not produce something useful for perception tasks! They don't have the same taste in features! Come by to learn why and to discuss alternative solutions!
Tweet media one
@randall_balestr
Randall Balestriero
3 months
Aaaand we are live from Vienna at poster 1002! Come by to discuss about training dynamics, splines, and the two stage learning that secretly occurs within your deep networks!
Tweet media one
2
14
120
5
6
83
@randall_balestr
Randall Balestriero
11 months
Excited to share our #NeurIPS2023 paper explaining part of the per-class accuracy degradation that data augmentation introduces: it creates asymmetric label-noise between coarse/fine classes of the same object e.g. car and wheel! We also find a remedy⬇️
Tweet media one
1
12
82
@randall_balestr
Randall Balestriero
1 year
Training dynamics of surrogate quantities e.g. the loss are well studied but do not provide many insights into the DN's geometry. But linear regions concentration do just that and still exhibit a double descent dynamic that is controlled by regularization
Tweet media one
2
18
75
@randall_balestr
Randall Balestriero
2 years
POLICE code is now available: Quick facts: - POLICE only takes 5 lines of code - code is jit/CPU/GPU friendly (PyTorch) - it will only take a few minutes to generate all the figures Eager to see the figures/papers/ideas you will create from it!
Tweet media one
@randall_balestr
Randall Balestriero
2 years
Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!
Tweet media one
7
83
545
0
10
71
@randall_balestr
Randall Balestriero
10 months
Interestingly the ReLU and Swish relation is well understood from a spline viewpoint akin to the relation between k-NN and isotropic GMM: deterministic vs probabilistic region assignment! The same goes for absolute value vs Mish, and many more! More at
Tweet media one
@mayfer
murat 🍥
10 months
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work
Tweet media one
10
20
245
0
10
69
@randall_balestr
Randall Balestriero
2 years
Our #CVPR2023 submission has been accepted! () We develop an exact+fast algo to compute a Deep Network partition characterizing its geometry and decision boundary e.g. to rapidly sample from the latter for viz/active learning! Code:
Tweet media one
2
14
67
@randall_balestr
Randall Balestriero
2 years
Not too surprising since e.g. batch-norm with random weights provably aligns the DN's partition to the data geometry: just from its mini-batch statistics!
Tweet media one
@DimitrisPapail
Dimitris Papailiopoulos
2 years
"The Expressive Power of Tuning Only the Norm Layers" lead by @AngelikiGiannou & @shashank_r12 We show that large frozen networks maintain expressivity even if we only fine-tune the norm & bias layers.
Tweet media one
5
40
263
2
14
62
@randall_balestr
Randall Balestriero
2 years
Awesome website summarizing our latest TMLR paper demonstrating how deep networks pruning can be easily explained/visualized and improved simply by formulating it in terms of the DN's spline partition! Paper: Code:
Tweet media one
@ranery1998
Haoran You
2 years
@randall_balestr @TmlrSub @eiclab We just built a website for this project 🤗:
0
1
5
0
11
63
@randall_balestr
Randall Balestriero
2 months
Vision Language Models have fueled recent AI breakthroughs...but the next generation will need to do more than just scale up datasets and models sizes! Dive into our latest preprint and benchmark library to understand why and to stress-test your ideas!
Tweet media one
6
10
62
@randall_balestr
Randall Balestriero
2 years
Happy to have four papers accepted to #NeurIPS2022 ! Shoutout to incredible co-authors/colleagues @imisra_ @ylecun @bobak_kiani and Leon Bottou! I will tweet about each in the coming days... but⬇️ TLDR: Never stop improving papers from reviews/comments... perseverance is the key!
0
1
60
@randall_balestr
Randall Balestriero
3 years
Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:
4
12
58
@randall_balestr
Randall Balestriero
1 year
Delighted to share that our work with @garridoq_ and @ylecun got an oral+poster at #ICML2023 ! We enable truly label-free hyper-parameter search for SSL (validated on SimCLR/VICReg/DINO/.. and many datasets) aiming for best linear perf. without fine-tuning!
Tweet media one
3
6
56
@randall_balestr
Randall Balestriero
2 years
A (Deep) Network has always been "any computational graph with forward and (optionally) backward data-flow" (see⬇️) this is a big class that includes kernels, trees, k-NN... just by a change of arch. so when people say moving away from DNs do they mean moving away from computers?
6
5
55
@randall_balestr
Randall Balestriero
2 years
Happy to share our accepted @TmlrSub with @eiclab about deep network (DN) pruning from an affine spline perspective! In short, pruning removes/projects the DN partition boundaries (nice avenues to theoretically understand/improve pruning) Some insights ⬇️
Tweet media one
2
11
54
@randall_balestr
Randall Balestriero
2 years
Interested in (A) self-supervised learning <-> spectral embedding (B) data-augmentation's unfair per-class impact (C) DA implicit/explicit regularizer (D) fast orthogonal/unitary weight learning? Come see us #NeurIPS2022 4:30/Hall J A: #228 Tue B: #642 Thu C: #322 Wed D: #542 Wed
3
9
54
@randall_balestr
Randall Balestriero
2 years
Amazing work resulting from an amazing collaboration made possible by @forai_ml and @sarahookr ! TLDR: we still have a lot to learn around what brings stochasticity in deep network training (init/batching/DA) and by how much. This paper takes an important step in quantifying them!
Tweet media one
@CohereForAI
Cohere For AI
2 years
Our newest Paper Profiles video goes behind the scenes of our recent community-driven research collaboration, "FAIR-Ensemble: When Fairness Naturally Emerges from Deep Ensembling." Thanks to @weiyinko_ml and @mrdanieldsouza for taking the time to chat!
Tweet media one
1
10
23
2
15
50
@randall_balestr
Randall Balestriero
3 years
Accepted as an oral #CVPR2022 !🥳 Taking this opportunity to say that this comes as a result of years of work in building a theoretical bridge between deep networks<->continuous piecewise affine operators. Theory is a guide that reduce the set of unknown to be cross-validated 🧑‍🔧!
@randall_balestr
Randall Balestriero
3 years
Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:
4
12
58
0
10
46
@randall_balestr
Randall Balestriero
1 year
SSL and supervised learning unified under one loss (only the inter-sample similarity graph varies between them) at #ICCV23 Friday/10:30/Nord/023 Hello to cheap expert-free active/supervised learning by asking if samples come from the same class, not asking for the class label!
Tweet media one
1
6
44
@randall_balestr
Randall Balestriero
2 years
New Constructive Approximation paper: Deep Networks with (leaky-)relu and least-square loss have continuous piecewise quadratic per-layer loss landscape. From that we precisely study how the DN architecture impacts that loss landscape and SGD convergence!
Tweet media one
1
3
45
@randall_balestr
Randall Balestriero
2 years
Two ICLR23: - spotlight: new fine-grain labels for Imagenet+insights into failure modes of models/DA/losses - poster: theory unraveling the failures that emerge when deploying self-supervised learning on uncurated data One goal: understanding when/why deep learning can fail ⬇️
Tweet media one
1
6
45
@randall_balestr
Randall Balestriero
2 years
3 ICASSP papers! -the infamous POLICE that provably tames the beast (DN) to obey input space constraints -the unifying minimal variations tying SSL/spectral embedding/generative models -DNs' partition enumeration: not out/stay tuned
Tweet media one
1
6
41
@randall_balestr
Randall Balestriero
3 years
Delighted to share our latest preprint with Bobak Kiani, @ylecun , and Seth Lloyd where we propose an **efficient and scalable** gradient based training of orthogonal/unitary matrices (e.g. used in each layer of a recurrent network/convolutional network).
Tweet media one
Tweet media two
5
11
43
@randall_balestr
Randall Balestriero
2 years
+1! Academic programs should favor candidates who did not get that chance, to mentor them and to bring them at the top for post-PhD adventures. Taking PhD candidates to bloat your group's publication counts year-one goes against the academic spirit... (teaching statement anyone?)
@sarahookr
Sara Hooker
2 years
If you have multiple papers before you even began a PhD, it likely means you had access that others didn't. I wish more PhD programs would take a step back and stop this absurd practice of favoring multiple papers before someone even begins a training program.
78
304
3K
1
0
38
@randall_balestr
Randall Balestriero
1 year
RankMe: cheap/fast label-free hparam selection for DNNs will be at ICML - oral: ballroom B 4pm (local) Wed. - poster⬇️: exhibit hall 1 #609 1:30pm Thur. Also includes insights around representations' ranks, their surprising consistency across datasets, ...
Tweet media one
1
9
37
@randall_balestr
Randall Balestriero
2 years
Delighted to share our @TmlrOrg paper with F. Bordes and P. Vincent! We use the latest diffusion model to interpret/visualize the features of black-box models (DNNs, ...) by conditioning the generation with the model's features. We obtain many insights⬇️⬇️
Tweet media one
1
8
36
@randall_balestr
Randall Balestriero
3 years
"We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. (1/2)
2
3
34
@randall_balestr
Randall Balestriero
1 year
Do you remember the POLICE (fast PrOvable LInear Constraints Enforcement for deep networks ) ? An application to adversarial robustness is now available POLICE can be used as a one-shot robustifier, or during training/fine-tuning!⬇️
Tweet media one
@randall_balestr
Randall Balestriero
2 years
Deep Neural Networks are powerful... but how do you provably enforce some constraints into them? With @ylecun we introduce POLICE a simple method that does just that provably without sampling or changes in your loss/training (and it uses affine splines)!
Tweet media one
7
83
545
1
4
35
@randall_balestr
Randall Balestriero
3 months
Delighted to share that--by dint of all my coauthors--I will be at #ICML2024 to present our findings! From LLM geometry to adversarial grokking, without forgetting the provable benefits of moving away from reconstruction for representation learning! link:
Tweet media one
6
10
33
@randall_balestr
Randall Balestriero
2 years
Self-Supervised Learning methods have strong a priori on the type of data distribution you train on. With @mido_assran et al. we highlight what are those a priori and how to tune them at our advantage e.g. to improve SSL on uncurated and/or imbalanced data
Tweet media one
0
6
33
@randall_balestr
Randall Balestriero
6 months
Very happy to share that we will be presenting that work at #ICML2024 ! Moving away from reconstruction is key to learn better semantic abstractions... but we can only do so by first understanding why learning by reconstruction falls short!
@randall_balestr
Randall Balestriero
8 months
Learning by reconstruction ``easily'' provides eye-candy samples...but the learned representation's ability to solve perception tasks is often a letdown. We pinpoint that misalignement, measure it, and show how some denoising tasks (masking) sometimes help
Tweet media one
4
28
152
1
2
27
@randall_balestr
Randall Balestriero
2 years
With very large models and/or slow training frameworks (GPT-3, self-supervised learning, ...) I believe that theoretically-backed methods will regain grounds... brute-force cross-validation of everything is no longer an option! MuTransfer of @TheGregYang embodies that perfectly
0
2
26
@randall_balestr
Randall Balestriero
2 years
Another benefit of batch norm lies in the randomness of the mini-batch statistics (from one to the other) inducing a jittering effect in the partition and increasing the decision boundary margin to training samples! batch size controls the jittering strength and can be controlled
Tweet media one
1
3
26
@randall_balestr
Randall Balestriero
6 months
We will be presenting this work at #ICML2024 diving deeper into LLMs' geometry and how that can help in understanding their current limitations. For example, increasing a prompt's intrinsic dimension bypasses RLHF! Congrats @Rom_Cosentino @shekkizh
@randall_balestr
Randall Balestriero
11 months
Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!
Tweet media one
1
24
146
0
8
24
@randall_balestr
Randall Balestriero
10 months
🔔LLM update! - The few hundred features we extract from Mistral/Llama2-7B to characterize your prompt (e.g. for domain separation or toxicity detection) also work on Llama2-70B - We validate them on the official Jigsaw Kaggle challenge and reach SOTA
Tweet media one
@randall_balestr
Randall Balestriero
11 months
Very happy to introduce our preprint working out the geometry of LLMs... no approximation or simplification! Side effects: we extract informative features from LLMs that can solve various tasks such as toxic prompt detection and we bypass Llama2's RLHF!
Tweet media one
1
24
146
1
4
24
@randall_balestr
Randall Balestriero
2 years
In our latest preprint we show that Deep Ensembles have fairness benefits even when each model uses the same training set/architecture/optimizer. We also characterize by how much random init./data-augmentation/data-ordering impact the learned model between training episodes :)
@weiyinko_ml
Wei-Yin Ko
2 years
Our new preprint is out! In FAIR-Ensemble, we explore per-group performances after predictions averaging of Deep Networks (same architecture, hyper-parameters) and fairness naturally emerges! Paper:   Code: 1/8
Tweet media one
1
7
39
1
5
23
@randall_balestr
Randall Balestriero
2 years
Thanks to an amazing team ( @byoubii @D_Bouchacourt @marksibrahim et al.) we are releasing fine-grained distribution shift annotations for each Imagenet eval image and many train ones along with controlled robustness analysis of many SOTA models e.g. looking at the impact of DA!
Tweet media one
@_akhaliq
AK
2 years
ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations abs: project page: github:
Tweet media one
1
22
108
1
3
23
@randall_balestr
Randall Balestriero
18 days
Why you should think twice before setting the `max_new_tokens` parameter!
@weilunchao
Wei-Lun Chao
20 days
NeurIPS 2024 Best AC Award ...
Tweet media one
32
38
713
0
0
22
@randall_balestr
Randall Balestriero
2 years
Even when looking at high-dimensional spaces and architectures with thousands of units per layer+multiple layers, training time of the constrained model is only about 4~5x slower that the unconstrained one which is a small cost to pay for a provable constraint enforcement method!
Tweet media one
0
1
20
@randall_balestr
Randall Balestriero
3 years
In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort." Sir Ronald Fisher, (last page)
1
0
20
@randall_balestr
Randall Balestriero
4 months
Very happy to speak at SSL4EO! The core part of my talk will follow our latest papers to (i) provide principled insights into SSL, and (ii) give guidelines to design your own pipeline: 1/4: why do we need to move away from learning by reconstruction () ⬇️
@nicolangnl
Nico Lang
4 months
At University of Copenhagen, we are organizing a summer PhD course on SSL4EO. Registration is now open via: (seats are limited). We are looking forward to hear from: @randall_balestr , @MarcCoru , @kklmmr , @brunosan , @JanDirkWegner1 , @xiaoxiang_zhu
Tweet media one
3
44
166
2
2
20
@randall_balestr
Randall Balestriero
3 years
@shortstein @icmlconf As sad as it is, it seems that such situations have existed across fields and ages, as formulated by Fisher in 1958!
@randall_balestr
Randall Balestriero
3 years
"We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. (1/2)
2
3
34
0
0
20
@randall_balestr
Randall Balestriero
3 months
Aaaand we are now covering that surprising event happening on poster 705 with guest @shekkizh : diving into LLMs geometry and using those insights to derive features from pretrained models or to bypass RLHF through natural prompt manipulations!
Tweet media one
@randall_balestr
Randall Balestriero
3 months
Aaaaand we are back on the ground at poster 602 to cover a breaking news: learning a representation by reconstruction will not produce something useful for perception tasks! They don't have the same taste in features! Come by to learn why and to discuss alternative solutions!
Tweet media one
5
6
83
0
2
19
@randall_balestr
Randall Balestriero
2 years
POLICE can also be applied to classification/SSL tasks, you do not need to change your loss, optimizer, or architecture, and its complexity is only determined by the complexity of doing a single forward-pass in your model X the number of vertices defining the constrained region
Tweet media one
1
1
18
@randall_balestr
Randall Balestriero
2 years
By upgrading the FFCV library, we enable super fast/single GPU training of SSL. We also explore how to cross-validate SSL methods, and show that known failure cases were just the result of poor hyper-parameters! Huge effort by Florian Bordes (MVP!), Pascal Vincent and myself :)
Tweet media one
1
0
17
@randall_balestr
Randall Balestriero
3 months
Our geometric characterization of LLMs ( at #ICML2024 ) tied the prompts' intrinsic dimensions to their ability to make a LLM generation toxic. @Tenyx_AI researchers extended our results for reasoning! Q: Can reasoning and safe generation coexist with LLMs?
Tweet media one
@shekkizh
Sarath Shekkizhar
3 months
Unlocking better reasoning in LLMs can go beyond just longer context & bigger models! Our recent research () offers a geometric view of the expressive power and reasoning capabilities of LLMs. Stay tuned for more insights! @Rom_Cosentino #LLM #Reasoning
0
2
3
6
6
18
@randall_balestr
Randall Balestriero
2 years
"Symmetric positive definiteness is arguably one of the highest mathematical accolades to which a matrix can aspire." Prof. Nicholas J. Higham
0
2
17
@randall_balestr
Randall Balestriero
3 years
Impressive program at the upcoming World AI Cannes Festival in France @WAICANNES ! The AI Society / AI Today & Tomorrow track alone has an impressive list of speakers including @ylecun , anyone can attend (for free) with the Discovery Pass !
0
5
18
@randall_balestr
Randall Balestriero
8 months
As a bonus, our findings also explain why long training time is required to `````finally''''' capture the features useful for perception tasks as part of the representation. Our findings open new avenues to speed up training through new denoising tasks!
Tweet media one
1
2
17
@randall_balestr
Randall Balestriero
4 months
New preprint + AI4Science #ICML2024 workshop: ScaLES! ScaLES provides a differentiable confidence score for samples generated from pretrained models. Applied to Latent Space Optimization, ScaLES improves the solutions to black-box optimization problems!
Tweet media one
1
5
16
@randall_balestr
Randall Balestriero
10 months
You have to train an ensemble of Deep Networks with same training set and architecture. Q: How to maximize the ensemble fairness? 1. vary the weight initialization 2. vary the data sampling 3. vary the data-augmentation seed 4. all the above Answer at the AFT workshop on Friday!
@CohereForAI
Cohere For AI
2 years
@weiyinko_ml @mrdanieldsouza Find out more about what model design choices can mitigate unfair outcomes by reading "FAIR-Ensemble: When Fairness Naturally Emerges from Deep Ensembling." 📜
Tweet media one
1
0
4
1
5
16
@randall_balestr
Randall Balestriero
2 years
The DN partition boundary (random weights) shows an higher concentration of regions around the data distribution (from using batch norm alone!). This fitting is proved analytically: batch norm statistics shift/bend the partition boundaries to the data, and depth is crucial!⬇️
Tweet media one
1
2
15
@randall_balestr
Randall Balestriero
3 years
We always hear about applied deep learning results relying on many tricks, but never about theoretical results around deep learning that often rely on even more assumptions! Both are highly specialized, need fine-tuning to work on new cases and both struggle to impress each other
0
1
15
@randall_balestr
Randall Balestriero
2 years
Provable control of the quality and diversity of sampling for pre-trained deep generative networks ... without any additional learning! Check us out at #CVPR2022 tomorrow at 8:30 am in Hall B1, Oral Session 3.1.1, and Poster Session 3.1
@randall_balestr
Randall Balestriero
3 years
Happy to share our #CVPR2022 paper w/ @imtiazprio , @rbaraniuk providing a simple solution to provably sample from the (anti-)modes of pre-trained generative networks... also leading to new StyleGAN2/3/BigGAN FID SOTAs 🧵(1/4) colab:
4
12
58
0
2
15
@randall_balestr
Randall Balestriero
3 years
Real-world deep learning interview problems and solutions: Interesting resource for everyone! I especially enjoy the thorough references that have been put throughout the set of problems!
2
5
15
@randall_balestr
Randall Balestriero
2 years
@ISusmelj can't anyone add a feature saying that if the type of inferred vehicle changed chaotically during 5 sec, let's just represent it as an "unidentifiable blob" until it stabilizes? at least would not look as buggy on the frontend...
1
1
14
@randall_balestr
Randall Balestriero
8 months
Furthermore, Batch-Normalization which is known to force the DN partition to concentrate near the training samples () prevents robustness to emerge! In fact, Grokking is all about understanding the DN partition migration dynamics...
Tweet media one
1
0
15
@randall_balestr
Randall Balestriero
2 years
We indeed need a more closed-loop SSL e.g. where data-augmentations of positive pairs are guided by the deep network's guess (action) of what new views (sensory inputs) would provide it with new information leading to a sharper understanding of the presented image/scene!
Tweet media one
@timos_m
Timoleon (Timos) Moraitis
2 years
#NeuroAI : Could principles of embodied sensorimotor neuroscience unify and improve the various Self-Supervised Learning (SSL) methods? How could the brain self-supervise itself? We are happy to share our #NeurIPS2022 paper with @franz_scherr and Q. Guo🧵:
6
31
148
1
1
14
@randall_balestr
Randall Balestriero
3 years
@WriteArthur Well, yes there are the natural statistics of the training data that will influence the generation, but the type of DA/regularization that was used during training also plays a huge role. See for example our latest preprint on that point
@randall_balestr
Randall Balestriero
3 years
Latest preprint with Léon Bottou and @ylecun on the impact of regularization/data-augmentation on per-class performances (for better or worse)! Using them improves average generalization but some classes will have worse performance than without them 🧵1/4
Tweet media one
4
46
191
0
0
14
@randall_balestr
Randall Balestriero
1 year
One can only imagine the vast amount of knowledge that was distilled into this, which is why that end-result would have never been possible without all the incredible co-authors that decided to collaborate for one purpose: sharing their knowledge and past experiences!
1
1
14