Alex Sablayrolles Profile
Alex Sablayrolles

@alexsablay

Followers
1,587
Following
867
Media
9
Statuses
300

Research Scientist, Mistral AI. Interested in LLMs, deep learning, fast nearest-neighbor search and privacy. ex: @Meta , @NYUniversity , @Polytechnique .

Paris, France
Joined October 2016
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@alexsablay
Alex Sablayrolles
5 years
Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv. TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
82
224
@alexsablay
Alex Sablayrolles
9 months
Our latest release @MistralAI Mixtral 8x7B mixture of experts - performance of a GPT3.5 - inference cost of a 12B model - context length of 32K - speaks English, French, Italian, German and Spanish Blog post
Tweet media one
2
17
131
@alexsablay
Alex Sablayrolles
5 years
"Large Memory Layers with Product Keys" with @GuillaumeLample , @LudovicDenoyer , Marc'Aurelio Ranzato and @hjegou TL;DR We introduce a large key-value memory layer with millions of values for a negligible computational cost. 1/2
Tweet media one
Tweet media two
Tweet media three
1
41
114
@alexsablay
Alex Sablayrolles
11 months
Proud to share Mistral 7B, the first step in Mistral's journey!
0
21
99
@alexsablay
Alex Sablayrolles
10 months
@ClementDelangue @huggingface 4 out of 7 are Mistral-based, it's amazing to see what the community is building on our 7B 🔥
2
0
82
@alexsablay
Alex Sablayrolles
4 years
Last Thursday, I defended my PhD (online), I am now officially a Doctor!
10
0
66
@alexsablay
Alex Sablayrolles
4 years
Our paper "Radioactive data" got accepted to #icml2020 . See you all online!
@alexsablay
Alex Sablayrolles
5 years
Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv. TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
82
224
3
12
53
@alexsablay
Alex Sablayrolles
9 months
I'll be at #NeurIPS2023 , happy to chat about LLM and what we are cooking @MistralAI .
0
0
52
@alexsablay
Alex Sablayrolles
4 years
We are hosting a research internship in differential privacy / privacy assessment of machine learning models with @hastagiri at Facebook AI Paris this summer. If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!
0
14
18
@alexsablay
Alex Sablayrolles
3 years
We ( @PierreStock and I) are hiring a Master student for an internship in Privacy in the summer 2022, with a potential CIFRE PhD following the internship. Details in the proposal attached.
Tweet media one
0
5
14
@alexsablay
Alex Sablayrolles
2 years
Interesting paper by @thegautamkamath @florian_tramer and N. Carlini. In particular, fine-tuning on Imagenet when you pre-train on 4B data is sidestepping privacy in my opinion. Let's choose training from scratch as a reproducible benchmark for privacy 🔥
@thegautamkamath
Gautam Kamath
2 years
🧵New paper w Nicholas Carlini & @florian_tramer : "Considerations for Differentially Private Learning with Large-Scale Public Pretraining." We critique the increasingly popular use of large-scale public pretraining in private ML. Comments welcome. 1/n
Tweet media one
4
20
148
1
2
14
@alexsablay
Alex Sablayrolles
5 years
A 12-layer memory-augmented transformer outperforms a 24 layer transformer while being twice faster. Our key insight is to use product keys, which enable fast and exact nearest neighbor search and reduce the complexity from N to sqrt(N) for a memory with N values. 2/2
1
0
11
@alexsablay
Alex Sablayrolles
9 months
Important point on the mixtral model: the sliding window must be set to 32K, otherwise context is not taken into account properly after 4K.
@WolframRvnwlf
Wolfram Ravenwolf 🐺🐦‍⬛
9 months
Wondered why Mixtral 8x7B Instruct w/ 32K context wasn't summarizing 16K text. Prompt started with instruction to summarize following text, but model ignored it. Sliding Window Attention must have "unattended" my instructions? Set Window from 4K to 32K, et voilà, got the summary!
3
5
70
1
1
11
@alexsablay
Alex Sablayrolles
3 years
We are hosting a research internship in differential privacy / membership inference with @PierreStock at Facebook AI Paris next summer (2022). If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!
0
2
9
@alexsablay
Alex Sablayrolles
5 years
We are at #NeurIPS2019 for the whole week, come check out our poster on Product Key Memory Layers on Thursday at 5pm! Spotlight is at 4:20pm.
@GuillaumeLample
Guillaume Lample @ ICLR 2024
5 years
We are at #NeurIPS2019 this week to present our two papers on Product Key Memory Layers, and Cross-lingual Language Model Pretraining. Please stop by our posters, Thursday at 5pm! Spotlight presentations are at 4:20 and 4:40pm with @alexsablay and @alex_conneau
Tweet media one
Tweet media two
0
17
114
1
3
9
@alexsablay
Alex Sablayrolles
9 months
0
1
7
@alexsablay
Alex Sablayrolles
5 years
For example, if someone merges your data with a 100x larger training set, you can predict with 99.99% confidence that the model was trained using your data. (2/7)
2
0
8
@alexsablay
Alex Sablayrolles
1 year
@gabrielpeyre Isn't this missing additive / multiplicative constants? If I remember correctly the estimator stems from approximating the density by balls around each point that extend to its nearest neighbor (and log(Volume) = C_1 + d * log(distance))
1
0
7
@alexsablay
Alex Sablayrolles
4 years
@arthurmensch It’d be nice if we could set up bullet points as reviewers such that authors can answer below each bullet point (with latex formatting àla MathExchange) and a budget for the total number of characters in the answers.
1
0
7
@alexsablay
Alex Sablayrolles
3 years
Just accepted reviewer invitation for TMLR. Can’t believe it is not called Transactions in Learning Deep Research
0
0
6
@alexsablay
Alex Sablayrolles
4 years
@Crimir4 @PTruq Autre preuve: l’espérance est la série de P(T>k) = P(sum_{i=1}^k xi <1) = 1/k! (Volume du simplexe en dimension n)
1
0
6
@alexsablay
Alex Sablayrolles
5 years
Here's how it works: for each class, we sample a random direction in the feature space (the carrier), and modify pixel values so that the features of each image of this class move in the direction of the carrier. (3/7)
1
0
4
@alexsablay
Alex Sablayrolles
5 years
Even if you retrain a model *from scratch* on these images, we can match the feature spaces, and observe that the classifier will be aligned in the direction of the carrier. This also works across architectures, and with different datasets. (5/7)
1
0
4
@alexsablay
Alex Sablayrolles
5 years
Since the carriers are chosen randomly, we can compute the probability that the classifier aligns with the carrier "by chance" (i.e. the p-value), and show that it is very low (10^{-4} with only 1% of radioactive data). (6/7)
1
0
4
@alexsablay
Alex Sablayrolles
10 months
@RakshitNaidu The original Déjà vu!
0
0
4
@alexsablay
Alex Sablayrolles
4 years
@Crimir4 Le concept d’entropie pour ma part.
2
0
2
@alexsablay
Alex Sablayrolles
5 years
@JiliJeanlouis Sujet très intéressant, cf travaux récents
0
0
3
@alexsablay
Alex Sablayrolles
5 years
If you train a classifier on top of these radioactive images, the classifier will align to the carrier direction as it is correlated with the class label. This works even if a very small part of the training data is radioactive (1%). (4/7)
1
0
3
@alexsablay
Alex Sablayrolles
4 years
@BayesReality C'est le plot de la dernière saison de Baron noir
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@TLesort @JiliJeanlouis J'aurais tendance à dire que si tu as un modele génératif p(x) tu peux juste mesurer -log p(x) sur tes nouvelles données. Tu peux aussi mesurer -log p(y|x) si tu as juste p(y|x), qui est devrait te donner une "indication faible"
1
0
3
@alexsablay
Alex Sablayrolles
5 years
@DanielOberski Differential privacy will degrade into group privacy, so you would need a very small epsilon to protect against radioactivity (on the order of 1/n_radioactive)
0
1
2
@alexsablay
Alex Sablayrolles
2 years
@thegautamkamath Where do you draw the line for tools? I regularly use thesaurus to find synonyms and expect to use chatGPT for stuff like that in the future (i.e. minor rewritings) but I feel like reporting the use of chatGPT makes it sound like it's writing entire sections of the paper...
1
0
2
@alexsablay
Alex Sablayrolles
2 years
@bozavlado @giffmana Great point now I can't unsee a world of weirdly shaped furniture that keeps breaking...but I believe similar effects come into play (more value to recommendations from people you know or people who have a reputation for doing good DIY tutorials)
1
0
2
@alexsablay
Alex Sablayrolles
5 years
@CarTrawler We rented a car to Recordgo through your services but they were unable to deliver (they wanted us to pay 120€ extra in insurance because the caution “did not work”), we would like to get a refund how should we proceed ?
1
0
0
@alexsablay
Alex Sablayrolles
11 months
@dhuynh95 @huggingface This measures captures both capacity of the model and true memorization right? Like if I prompt "The solution to x^2-2=0 is", a model that says "±sqrt(2)" can be actually solving the problem rather than memorizing from the training set.
0
0
2
@alexsablay
Alex Sablayrolles
2 years
@francoisfleuret That works for a verbatim text (I think there is roughly 2 nats per token to hide the watermark assuming you don't do top-k/top-p), but is this robust to people slightly modifying the text afterwards (like add/remove or change a few words)?
1
0
2
@alexsablay
Alex Sablayrolles
4 years
@shortstein If you have a 50% prior that a sample is in the training set, the posterior is bounded to 0.5 + ε/4. My rule: make DP non-vacuous by having ε<2
0
0
2
@alexsablay
Alex Sablayrolles
2 years
ExpandedWeights have been added to Opacus. Simply stated, it creates a virtual 'expanded' weight which first dimension is the batch size, so that each element of the batch has its own corresponding weight. (2/5)
1
0
1
@alexsablay
Alex Sablayrolles
2 years
@thegautamkamath I'm also interested in this. So far my impression is that 1) gradients are seem as much more obfuscated than they actually are and 2) FL updates have this "low bandwidth" feeling like these "anonymized" computer crash reports
0
0
1
@alexsablay
Alex Sablayrolles
4 years
@MonniauxD @ahcohen Meeting people other researchers? Especially for younger researchers who are new to the field. But the carbon footprint could definitely be reduced.
0
0
0
@alexsablay
Alex Sablayrolles
5 years
@SNCF Nous avons du racheter un billet pour le tronçon Berne - Bâle aux CFF (et avons pu prendre le TGV Bâle - Paris)
1
0
1
@alexsablay
Alex Sablayrolles
8 years
@ADssx Variance des échantillons ? .
0
0
1
@alexsablay
Alex Sablayrolles
2 years
@giffmana The ancient meaning of "vit" in French is not really better 😅
1
0
1
@alexsablay
Alex Sablayrolles
2 years
@mrdrozdov Isn't `` robust to context? Regarding compression I'd argue that most of the size is probably the model weights which shouldn't be compressed?
1
0
1
@alexsablay
Alex Sablayrolles
7 years
@drivyFR mon profil a été vérifié puis j'ai reçu un email disant qu'il ne l'était plus...votre service d'aide ne répond pas, que faire ?
0
0
1
@alexsablay
Alex Sablayrolles
4 years
@Aaroth @DorotheaBaur @mkearnsupenn What about getting back to something similar to "randomized response" where you randomly add fake cases & contacts ?
1
0
1
@alexsablay
Alex Sablayrolles
2 years
@mrdrozdov Sounds like you just can’t kill the beast
1
0
1
@alexsablay
Alex Sablayrolles
1 year
@BenTheEgg KoLeo is from this lab as well! Although comes from retrieval / NN search and not SSL
1
0
1
@alexsablay
Alex Sablayrolles
4 years
@dohmatobelvis Rearrangement inequality?
1
0
1
@alexsablay
Alex Sablayrolles
10 months
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@quasimondo Not necessarily if the directions are in a high enough dimension.
0
0
1
@alexsablay
Alex Sablayrolles
2 years
@ChSimonSU @RauxJF Je ne sais pas si chatGPT est entraîné avec beaucoup de textes de lois français, à mon avis il y a une bonne marge de progression en le "fine-tunant" sur des textes de loi français, arrêts, etc.
2
0
1
@alexsablay
Alex Sablayrolles
4 years
@ADssx En plus si c'est une épidémie qui se propage par clusters c'est inefficace (même si on le savait pas avant)
1
0
1
@alexsablay
Alex Sablayrolles
5 years
@roydanroy @frankmcsherry Not an expert but they seem to be missing references to
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@deliprao You can also craft the compressed image such that its decompressed version will be radioactive (using a differentiable operator that approximates JPEG); this is similar to what we do in the paper for data augmentation
0
0
1
@alexsablay
Alex Sablayrolles
4 years
@xbresson I'm actually surprised by Table 8 in : PE are not *that* important
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@freakonometrics For #2 you end up with an exponential distribution right ? (At least when n is large)
0
0
1
@alexsablay
Alex Sablayrolles
2 years
@giffmana I tend to think/hope that if you can fabricate news article their value will just go to zero (nice writing is no longer "proof of work") and the model will shift back to trusted news sources (reputation/proof of stake)
1
0
1
@alexsablay
Alex Sablayrolles
4 years
@ABelgo_optimum @adelaigue Il y a aussi un biais sur ce dont on a beaucoup moins besoin en période de confinement (ex: fabricants de voitures), en plus du biais court terme/long terme (on a d’autant moins besoin de nouvelles voitures qu’il y a un parc existant)
0
0
1
@alexsablay
Alex Sablayrolles
2 years
0
0
1
@alexsablay
Alex Sablayrolles
2 years
@francoisfleuret This one almost fooled me. Bring back Giscardpunk!
0
0
1
@alexsablay
Alex Sablayrolles
2 years
Functorch is also available in Opacus. Functorch is the equivalent of JAX in the Pytorch ecosystem. One way to use functorch is through the "no-op" GradSampleModule: Opacus relies on users to provide the grad_samples, but still takes care of the rest. (3/5)
1
0
1
@alexsablay
Alex Sablayrolles
4 years
@florian_tramer Anecdotally I am not sure that the subset of Tiny images is 100% "private", as it seems Carmon et al. used a model trained on CIFAR-10 to mine it.
1
0
1
@alexsablay
Alex Sablayrolles
6 years
@aympontier @OlivierSlomK le P de plage grecque ?
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@ilyaraz2 @icmlconf I’d like to chat if you have some spare time!
0
0
0
@alexsablay
Alex Sablayrolles
2 years
@_arohan_ I tried to trigger chatGPT to not answer my request but it doesn't have a problem with bypassing ethics review...
Tweet media one
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@yoavgo @GuillaumeLample The alternative view is to see the memory as an approximation of a very large FFN (the first linear layer is the approximated by a "product matrix", and the second linear layer corresponds to the set of values). 1/2
1
0
1
@alexsablay
Alex Sablayrolles
4 years
@Theo_Lacombe_ @gabrielpeyre If you admit continuity of the roots, can’t you say it’s the reciprocal image of open set by continuous function ?
1
0
1
@alexsablay
Alex Sablayrolles
4 years
1
0
1
@alexsablay
Alex Sablayrolles
4 years
@tonyduan_ We haven't yet. The code is available online to play with if you are interested!
0
0
1
@alexsablay
Alex Sablayrolles
2 years
@thegautamkamath @florian_tramer Yes I meant from scratch on a public dataset! 100% agree that we need reproducible research.
0
0
1
@alexsablay
Alex Sablayrolles
5 years
@dohmatobelvis Independently of variance s^2 ?
1
0
1