Alex Sablayrolles @alexsablay profile

Alex Sablayrolles

@alexsablay

Followers

1,587

Following

867

Media

9

Statuses

300

Research Scientist, Mistral AI. Interested in LLMs, deep learning, fast nearest-neighbor search and privacy. ex: @Meta , @NYUniversity , @Polytechnique .

Paris, France

Joined October 2016

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

#NevaPlay • 333898 Tweets

#MEGANxRM • 269622 Tweets

Ravens • 240812 Tweets

The NFL • 239194 Tweets

SAROCHA X MODERN GURU • 154312 Tweets

NAMJOON • 149181 Tweets

Taylor Swift • 110202 Tweets

Lamar • 108230 Tweets

Linkin Park • 106491 Tweets

Dybala • 101050 Tweets

The Final Avatar • 83339 Tweets

Mahomes • 80954 Tweets

Flowers • 75145 Tweets

Julian • 66917 Tweets

Kansas City • 60874 Tweets

Chester • 55152 Tweets

Kelce • 48059 Tweets

Xavier Worthy • 44293 Tweets

Pacheco • 32944 Tweets

#쯔위와_RunAway_할시간 • 30623 Tweets

Derrick Henry • 26522 Tweets

#TZUYU_SOLO_RUNAWAYDAY • 25542 Tweets

ABOUTZU RUN AWAY OUT NOW • 24296 Tweets

サブスク解禁 • 21091 Tweets

Mark Andrews • 20935 Tweets

#BALvsKC • 14018 Tweets

Isaiah Likely • 13692 Tweets

Rashee Rice • 13213 Tweets

モモンガ • 13195 Tweets

ちいかわ寿司 • 12762 Tweets

Chelsie • 12748 Tweets

ハチワレ • 10613 Tweets

Andy Reid

Osorio

女性ボーカル

Gareca

BY A TOE

ショウガタップリ

Collinsworth

Bateman

チェスター

Harbaugh

Chris Jones

Kevin Durant

シーサー

Justin Tucker

リンキン

あと半日

くりまんじゅう

ちいかわくじ

Last Seen Profiles

@WCadei

@krisbowersmusic

@justinzard

@Bryguts

@gehansaeed28

@naevis_caIling

@bronsonlnorton

@mrwhiteee2

@AustinKing59416

@RyanYates97

@ProxyRotator

@Lakshma14792216

@bluesky246642

@TenshoShibuya

@fx_stacie

@cd_valdefierro

@BacktestTools

@honig96885

@jotazoni

@ShortVideos18

Alex Sablayrolles

@alexsablay

5 years

Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv. TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)

4

82

224

Alex Sablayrolles

@alexsablay

6 years

The final version of our ICLR'2019 paper "Spreading vectors for similarity search" is out. Paper: , Code:

GitHub - facebookresearch/spreadingvectors: Open source implementation of "Spreading Vectors for...

Open source implementation of "Spreading Vectors for Similarity Search" - facebookresearch/spreadingvectors

github.com

1

44

182

Alex Sablayrolles

@alexsablay

9 months

Our latest release @MistralAI Mixtral 8x7B mixture of experts - performance of a GPT3.5 - inference cost of a 12B model - context length of 32K - speaks English, French, Italian, German and Spanish Blog post

2

17

131

Alex Sablayrolles

@alexsablay

5 years

"Large Memory Layers with Product Keys" with @GuillaumeLample , @LudovicDenoyer , Marc'Aurelio Ranzato and @hjegou TL;DR We introduce a large key-value memory layer with millions of values for a negligible computational cost. 1/2

1

41

114

Alex Sablayrolles

@alexsablay

11 months

Proud to share Mistral 7B, the first step in Mistral's journey!

Mistral 7B

The best 7B model to date, Apache 2.0

mistral.ai

0

21

99

Alex Sablayrolles

@alexsablay

10 months

@ClementDelangue @huggingface 4 out of 7 are Mistral-based, it's amazing to see what the community is building on our 7B 🔥

2

0

82

Alex Sablayrolles

@alexsablay

4 years

Last Thursday, I defended my PhD (online), I am now officially a Doctor!

10

0

66

Alex Sablayrolles

@alexsablay

4 years

Our paper "Radioactive data" got accepted to #icml2020 . See you all online!

Alex Sablayrolles

@alexsablay

5 years

Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv. TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)

4

82

224

3

12

53

Alex Sablayrolles

@alexsablay

9 months

I'll be at #NeurIPS2023 , happy to chat about LLM and what we are cooking @MistralAI .

0

52

Alex Sablayrolles

@alexsablay

2 years

@mark_riedl Radioactive data! Watermarks to classes in the data that then propagate to the model ()

Radioactive data: tracing through training

We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that...

arxiv.org

1

28

Alex Sablayrolles

@alexsablay

4 years

We are hosting a research internship in differential privacy / privacy assessment of machine learning models with @hastagiri at Facebook AI Paris this summer. If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!

0

14

18

Alex Sablayrolles

@alexsablay

3 years

We ( @PierreStock and I) are hiring a Master student for an internship in Privacy in the summer 2022, with a potential CIFRE PhD following the internship. Details in the proposal attached.

0

5

14

Alex Sablayrolles

@alexsablay

2 years

Interesting paper by @thegautamkamath @florian_tramer and N. Carlini. In particular, fine-tuning on Imagenet when you pre-train on 4B data is sidestepping privacy in my opinion. Let's choose training from scratch as a reproducible benchmark for privacy 🔥

Gautam Kamath

@thegautamkamath

2 years

🧵New paper w Nicholas Carlini & @florian_tramer : "Considerations for Differentially Private Learning with Large-Scale Public Pretraining." We critique the increasingly popular use of large-scale public pretraining in private ML. Comments welcome. 1/n

4

20

148

1

2

14

Alex Sablayrolles

@alexsablay

5 years

A 12-layer memory-augmented transformer outperforms a 24 layer transformer while being twice faster. Our key insight is to use product keys, which enable fast and exact nearest neighbor search and reduce the complexity from N to sqrt(N) for a memory with N values. 2/2

1

0

11

Alex Sablayrolles

@alexsablay

9 months

Important point on the mixtral model: the sliding window must be set to 32K, otherwise context is not taken into account properly after 4K.

Wolfram Ravenwolf 🐺🐦‍⬛

@WolframRvnwlf

9 months

Wondered why Mixtral 8x7B Instruct w/ 32K context wasn't summarizing 16K text. Prompt started with instruction to summarize following text, but model ignored it. Sliding Window Attention must have "unattended" my instructions? Set Window from 4K to 32K, et voilà, got the summary!

3

5

70

1

11

Alex Sablayrolles

@alexsablay

3 years

We are hosting a research internship in differential privacy / membership inference with @PierreStock at Facebook AI Paris next summer (2022). If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!

0

2

9

Alex Sablayrolles

@alexsablay

5 years

We are at #NeurIPS2019 for the whole week, come check out our poster on Product Key Memory Layers on Thursday at 5pm! Spotlight is at 4:20pm.

Guillaume Lample @ ICLR 2024

@GuillaumeLample

5 years

We are at #NeurIPS2019 this week to present our two papers on Product Key Memory Layers, and Cross-lingual Language Model Pretraining. Please stop by our posters, Thursday at 5pm! Spotlight presentations are at 4:20 and 4:40pm with @alexsablay and @alex_conneau

0

17

114

1

3

9

Alex Sablayrolles

@alexsablay

9 months

@StringChaos @dchaplot @MistralAI @AlbertQJiang @Bam4d @GuillaumeLample @Fluke_Ellington @theo_gervet We’re gonna be at the PJ coffee from 2 to 5pm with mistral

0

1

7

Alex Sablayrolles

@alexsablay

5 years

For example, if someone merges your data with a 100x larger training set, you can predict with 99.99% confidence that the model was trained using your data. (2/7)

2

0

8

Alex Sablayrolles

@alexsablay

1 year

@gabrielpeyre Isn't this missing additive / multiplicative constants? If I remember correctly the estimator stems from approximating the density by balls around each point that extend to its nearest neighbor (and log(Volume) = C_1 + d * log(distance))

1

0

7

Alex Sablayrolles

@alexsablay

4 years

@arthurmensch It’d be nice if we could set up bullet points as reviewers such that authors can answer below each bullet point (with latex formatting àla MathExchange) and a budget for the total number of characters in the answers.

1

0

7

Alex Sablayrolles

@alexsablay

3 years

Just accepted reviewer invitation for TMLR. Can’t believe it is not called Transactions in Learning Deep Research

0

6

Alex Sablayrolles

@alexsablay

4 years

@Crimir4 @PTruq Autre preuve: l’espérance est la série de P(T>k) = P(sum_{i=1}^k xi <1) = 1/k! (Volume du simplexe en dimension n)

1

0

6

Alex Sablayrolles

@alexsablay

2 years

Opacus 1.2 is out! Time for a tour of the new functionalities 🧵 (1/5)

GitHub - pytorch/opacus: Training PyTorch models with differential privacy

Training PyTorch models with differential privacy. Contribute to pytorch/opacus development by creating an account on GitHub.

github.com

2

3

6

Alex Sablayrolles

@alexsablay

5 years

@kwinkunks @porestar @facebookai @swung_org First author here, for the learning effect you have , and for removing samples

Certified Data Removal from Machine Learning Models

Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about...

arxiv.org

0

5

Alex Sablayrolles

@alexsablay

5 years

For more details, check out the arXiv paper . (7/7)

Radioactive data: tracing through training

We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that...

arxiv.org

1

0

4

Alex Sablayrolles

@alexsablay

5 years

Here's how it works: for each class, we sample a random direction in the feature space (the carrier), and modify pixel values so that the features of each image of this class move in the direction of the carrier. (3/7)

1

0

4

Alex Sablayrolles

@alexsablay

5 years

Even if you retrain a model *from scratch* on these images, we can match the feature spaces, and observe that the classifier will be aligned in the direction of the carrier. This also works across architectures, and with different datasets. (5/7)

1

0

4

Alex Sablayrolles

@alexsablay

5 years

Since the carriers are chosen randomly, we can compute the probability that the classifier aligns with the carrier "by chance" (i.e. the p-value), and show that it is very low (10^{-4} with only 1% of radioactive data). (6/7)

1

0

4

Alex Sablayrolles

@alexsablay

10 months

@RakshitNaidu The original Déjà vu!

0

4

Alex Sablayrolles

@alexsablay

4 years

@Crimir4 Le concept d’entropie pour ma part.

2

0

2

Alex Sablayrolles

@alexsablay

5 years

@JiliJeanlouis Sujet très intéressant, cf travaux récents

0

3

Alex Sablayrolles

@alexsablay

5 years

If you train a classifier on top of these radioactive images, the classifier will align to the carrier direction as it is correlated with the class label. This works even if a very small part of the training data is radioactive (1%). (4/7)

1

0

3

Alex Sablayrolles

@alexsablay

4 years

@BayesReality C'est le plot de la dernière saison de Baron noir

0

1

Alex Sablayrolles

@alexsablay

2 years

@ThomasScialom @ClementDelangue @YJernite Depends what kind of harm and limitations I guess? Typically radioactive data (and others) allow you to mark data such that this mark is propagated to the trained model. @pierrefdz might know more

Radioactive data: tracing through training

We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that...

arxiv.org

0

3

Alex Sablayrolles

@alexsablay

5 years

@TLesort @JiliJeanlouis J'aurais tendance à dire que si tu as un modele génératif p(x) tu peux juste mesurer -log p(x) sur tes nouvelles données. Tu peux aussi mesurer -log p(y|x) si tu as juste p(y|x), qui est devrait te donner une "indication faible"

1

0

3

Alex Sablayrolles

@alexsablay

2 years

@hiddenmarkov @vitalyFM You can have it for toy examples (e.g. Gaussian data), and approximately with SGLD. A more complete picture is in the « privacy for free » paper () (2/2)

Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo

We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic...

arxiv.org

0

3

Alex Sablayrolles

@alexsablay

5 years

@DanielOberski Differential privacy will degrade into group privacy, so you would need a very small epsilon to protect against radioactivity (on the order of 1/n_radioactive)

0

1

2

Alex Sablayrolles

@alexsablay

2 years

@thegautamkamath Where do you draw the line for tools? I regularly use thesaurus to find synonyms and expect to use chatGPT for stuff like that in the future (i.e. minor rewritings) but I feel like reporting the use of chatGPT makes it sound like it's writing entire sections of the paper...

1

0

2

Alex Sablayrolles

@alexsablay

2 years

@bozavlado @giffmana Great point now I can't unsee a world of weirdly shaped furniture that keeps breaking...but I believe similar effects come into play (more value to recommendations from people you know or people who have a reputation for doing good DIY tutorials)

1

0

2

Alex Sablayrolles

@alexsablay

5 years

@CarTrawler We rented a car to Recordgo through your services but they were unable to deliver (they wanted us to pay 120€ extra in insurance because the caution “did not work”), we would like to get a refund how should we proceed ?

1

0

Alex Sablayrolles

@alexsablay

11 months

@dhuynh95 @huggingface This measures captures both capacity of the model and true memorization right? Like if I prompt "The solution to x^2-2=0 is", a model that says "±sqrt(2)" can be actually solving the problem rather than memorizing from the training set.

0

2

Alex Sablayrolles

@alexsablay

2 years

@francoisfleuret That works for a verbatim text (I think there is roughly 2 nats per token to hide the watermark assuming you don't do top-k/top-p), but is this robust to people slightly modifying the text afterwards (like add/remove or change a few words)?

1

0

2

Alex Sablayrolles

@alexsablay

4 years

@shortstein If you have a 50% prior that a sample is in the training set, the posterior is bounded to 0.5 + ε/4. My rule: make DP non-vacuous by having ε<2

0

2

Alex Sablayrolles

@alexsablay

2 years

There is a couple more improvements in Opacus 1.2, for the whole list see . (5/5)

Release Opacus v1.2 · pytorch/opacus

We're glad to present Opacus v1.2, which contains some major updates to per sample gradient computation mechanisms and includes all the good stuff from the recent PyTorch releases. Highlights F...

github.com

0

1

Alex Sablayrolles

@alexsablay

2 years

ExpandedWeights have been added to Opacus. Simply stated, it creates a virtual 'expanded' weight which first dimension is the batch size, so that each element of the batch has its own corresponding weight. (2/5)

1

0

1

Alex Sablayrolles

@alexsablay

2 years

@thegautamkamath I'm also interested in this. So far my impression is that 1) gradients are seem as much more obfuscated than they actually are and 2) FL updates have this "low bandwidth" feeling like these "anonymized" computer crash reports

0

1

Alex Sablayrolles

@alexsablay

4 years

@MonniauxD @ahcohen Meeting people other researchers? Especially for younger researchers who are new to the field. But the carbon footprint could definitely be reduced.

0

Alex Sablayrolles

@alexsablay

5 years

@SNCF Nous avons du racheter un billet pour le tronçon Berne - Bâle aux CFF (et avons pu prendre le TGV Bâle - Paris)

1

0

1

Alex Sablayrolles

@alexsablay

8 years

@ADssx Variance des échantillons ? .

0

1

Alex Sablayrolles

@alexsablay

2 years

@giffmana The ancient meaning of "vit" in French is not really better 😅

1

0

1

Alex Sablayrolles

@alexsablay

2 years

@mrdrozdov Isn't `` robust to context? Regarding compression I'd argue that most of the size is probably the model weights which shouldn't be compressed?

1

0

1

Alex Sablayrolles

@alexsablay

7 years

@drivyFR mon profil a été vérifié puis j'ai reçu un email disant qu'il ne l'était plus...votre service d'aide ne répond pas, que faire ?

0

1

Alex Sablayrolles

@alexsablay

4 years

@Aaroth @DorotheaBaur @mkearnsupenn What about getting back to something similar to "randomized response" where you randomly add fake cases & contacts ?

1

0

1

Alex Sablayrolles

@alexsablay

2 years

@mrdrozdov Sounds like you just can’t kill the beast

1

0

1

Alex Sablayrolles

@alexsablay

1 year

@BenTheEgg KoLeo is from this lab as well! Although comes from retrieval / NN search and not SSL

1

0

1

Alex Sablayrolles

@alexsablay

4 years

@dohmatobelvis Rearrangement inequality?

1

0

1

Alex Sablayrolles

@alexsablay

10 months

@aurelien_bellet Congrats!

0

1

Alex Sablayrolles

@alexsablay

5 years

@quasimondo Not necessarily if the directions are in a high enough dimension.

0

1

Alex Sablayrolles

@alexsablay

2 years

@ChSimonSU @RauxJF Je ne sais pas si chatGPT est entraîné avec beaucoup de textes de lois français, à mon avis il y a une bonne marge de progression en le "fine-tunant" sur des textes de loi français, arrêts, etc.

2

0

1

Alex Sablayrolles

@alexsablay

4 years

@ADssx En plus si c'est une épidémie qui se propage par clusters c'est inefficace (même si on le savait pas avant)

1

0

1

Alex Sablayrolles

@alexsablay

5 years

@roydanroy @frankmcsherry Not an expert but they seem to be missing references to

0

1

Alex Sablayrolles

@alexsablay

5 years

@deliprao You can also craft the compressed image such that its decompressed version will be radioactive (using a differentiable operator that approximates JPEG); this is similar to what we do in the paper for data augmentation

0

1

Alex Sablayrolles

@alexsablay

4 years

@xbresson I'm actually surprised by Table 8 in : PE are not *that* important

0

1

Alex Sablayrolles

@alexsablay

5 years

@freakonometrics For #2 you end up with an exponential distribution right ? (At least when n is large)

0

1

Alex Sablayrolles

@alexsablay

2 years

@giffmana I tend to think/hope that if you can fabricate news article their value will just go to zero (nice writing is no longer "proof of work") and the model will shift back to trusted news sources (reputation/proof of stake)

1

0

1

Alex Sablayrolles

@alexsablay

4 years

@ABelgo_optimum @adelaigue Il y a aussi un biais sur ce dont on a beaucoup moins besoin en période de confinement (ex: fabricants de voitures), en plus du biais court terme/long terme (on a d’autant moins besoin de nouvelles voitures qu’il y a un parc existant)

0

1

Alex Sablayrolles

@alexsablay

2 years

@hiddenmarkov @vitalyFM Exactly.

0

1

Alex Sablayrolles

@alexsablay

2 years

@francoisfleuret This one almost fooled me. Bring back Giscardpunk!

0

1

Alex Sablayrolles

@alexsablay

2 years

Functorch is also available in Opacus. Functorch is the equivalent of JAX in the Pytorch ecosystem. One way to use functorch is through the "no-op" GradSampleModule: Opacus relies on users to provide the grad_samples, but still takes care of the rest. (3/5)

1

0

1

Alex Sablayrolles

@alexsablay

10 months

@david_sontag @layerhealth @GVteam @generalcatalyst @inceptionhealth Congrats!

0

1

Alex Sablayrolles

@alexsablay

4 years

@florian_tramer Anecdotally I am not sure that the subset of Tiny images is 100% "private", as it seems Carmon et al. used a model trained on CIFAR-10 to mine it.

1

0

1

Alex Sablayrolles

@alexsablay

6 years

@aympontier @OlivierSlomK le P de plage grecque ?

0

1

Alex Sablayrolles

@alexsablay

5 years

@ilyaraz2 @icmlconf I’d like to chat if you have some spare time!

0

Alex Sablayrolles

@alexsablay

2 years

@_arohan_ I tried to trigger chatGPT to not answer my request but it doesn't have a problem with bypassing ethics review...

0

1

Alex Sablayrolles

@alexsablay

2 years

@RyanTScott @mark_riedl There is a « passive » version that doesn’t require adding marks. It’s Section 4 in

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the...

arxiv.org

0

1

Alex Sablayrolles

@alexsablay

5 years

@yoavgo @GuillaumeLample The alternative view is to see the memory as an approximation of a very large FFN (the first linear layer is the approximated by a "product matrix", and the second linear layer corresponds to the set of values). 1/2

1

0

1

Alex Sablayrolles

@alexsablay

4 years

@Theo_Lacombe_ @gabrielpeyre If you admit continuity of the roots, can’t you say it’s the reciprocal image of open set by continuous function ?

1

0

1

Alex Sablayrolles

@alexsablay

4 years

@florian_tramer This is great work @florian_tramer !

1

0

1

Alex Sablayrolles

@alexsablay

4 years

@tonyduan_ We haven't yet. The code is available online to play with if you are interested!

0

1

Alex Sablayrolles

@alexsablay

2 years

@thegautamkamath @florian_tramer Yes I meant from scratch on a public dataset! 100% agree that we need reproducible research.

0

1

Alex Sablayrolles

@alexsablay

5 years

@freakonometrics Isn’t it given by the Boltzman Gibbs distribution ? (like II.C in )

Colloquium: Statistical mechanics of money, wealth, and income

This Colloquium reviews statistical models for money, wealth, and income distributions developed in the econophysics literature since the late 1990s. By analogy with the Boltzmann-Gibbs...

arxiv.org

0

1

Alex Sablayrolles

@alexsablay

5 years

@dohmatobelvis Independently of variance s^2 ?

1

0

1