Alexandre Ramé @ramealexandre profile

Alexandre Ramé

@ramealexandre

Followers

2K

Following

3K

Statuses

560

Research scientist @GoogleDeepMind. PhD @Sorbonne_Univ_. Merging and aligning models.

Joined May 2011

Don't wanna be here? Send us removal request.

Alexandre Ramé

@ramealexandre

4 months

An AI will win a Nobel price someday✨. Yet currently, alignment reduces creativity. Our new @GoogleDeepMind paper "diversity-rewarded CFG distillation" improves quality AND diversity for music, via distillation of test-time compute, RL with a diversity reward, and model merging. arxiv: website:

3

19

155

Alexandre Ramé

@ramealexandre

1 day

RT @cwolferesearch: The trajectory of research for open LLMs and open reasoning models has been shockingly similar, but there are still man…

0

6

0

Alexandre Ramé

@ramealexandre

2 days

RT @neilzegh: Great night hitting the club with @honualx and Hibiki!

0

5

0

Alexandre Ramé

@ramealexandre

2 days

RT @Yoshua_Bengio: While the AI Action Summit was the scene of important discussions, notably about innovations in health and environment,…

0

68

0

Alexandre Ramé

@ramealexandre

3 days

RT @fly51fly: [LG] On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark J Fairoze, G Ortiz-Jiménez, M Vecerik, S Jh…

0

5

0

Alexandre Ramé

@ramealexandre

3 days

RT @danie1marczak: 🚀 What happens when you modify the spectrum of singular values of the merged task vector? 🤔 Apparently, you achieve 🚨st…

0

30

0

Alexandre Ramé

@ramealexandre

3 days

@maxzimmerberlin Yes, all the parameters are averaged: there is nothing specific to the transformer layers.

2

0

Alexandre Ramé

@ramealexandre

3 days

RT @CRSegerie: If you're in Paris tomorrow and don't know what to do after the summit, you can join us for an official side event at Sorbon…

0

2

0

Alexandre Ramé

@ramealexandre

3 days

@maxzimmerberlin This is definitely surprising at first, but you get used to it aha. More seriously, the linear mode connectivity relies on a shared pre-training; see more in "What is being transferred in transfer learning" or in "Model soups'

1

0

Alexandre Ramé

@ramealexandre

3 days

RT @qberthet: 🚨 New paper on regression and classification! Adding to the discussion on using least-squares or cross-entropy, regression o…

0

61

0

Alexandre Ramé

@ramealexandre

4 days

RT @sophiamyang: Breaking - our first AI cluster and it's in France🇫🇷!!!

0

175

0

Alexandre Ramé

@ramealexandre

4 days

RT @fly51fly: [LG] Loss Functions and Operators Generated by f-Divergences V Roulet, T Liu, N Vieillard, M E. Sander... [Google DeepMind] (…

0

6

0

Alexandre Ramé

@ramealexandre

4 days

RT @TheTuringPost: Collective Monte Carlo Tree Search (CoMCTS) This method helps MLLMs think step by step therefore enhances their o1-like…

0

23

0

Alexandre Ramé

@ramealexandre

4 days

@gabriberton @_philschmid @GoogleDeepMind The 2022 paper is "A good teacher is patient and consistent" :) and consistent here means what you said, that logits should be computed on the augmented image (so online). In contrast, our focus is on which data we should distill, and we show this data should be generated online.

0

1

Alexandre Ramé

@ramealexandre

4 days

@gabriberton @_philschmid @GoogleDeepMind That's an interesting connection. Yet, afaik, the 2022 paper shows that the target logits (=output distrib) from the teacher should be "consistent", preventing precomputing logits. In our language task, we actually show that the samples (=input distrib) should be generated online

1

0

Alexandre Ramé

@ramealexandre

5 days

Thanks for the highlight, notably the focus on this fun experiment where increasing the dataset size progressively reduces teacher hacking.

Philipp Schmid

@_philschmid

5 days

Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model distillation with offline datasets! Offline Distillation refers to generating synthetic data (logits) from offline, fixed prompts with a teach model and then training a student model on it. Teacher hacking is when a student optimizes for mimicking, imitate the teacher's imperfect behavior, rather than learning the true underlying task, reducing its generalization on unseen tasks. Solutions: 1️⃣ Online (Recommended): In Online or on-policy Knowledge Distillation, a student learns from a teacher during training by minimizing the distribution of samples dynamically generated by the student with a forward KL Term between the teacher and student. 2️⃣ Diverse Offline (If Online is Infeasible): Create a diverse offline dataset. Prioritize a wide variety of prompts. If prompt variety is limited, generate multiple responses per prompt from the Teacher. Avoid small, static datasets with single responses. Insights 🔍 Teacher hacking emerges when using fixed offline datasets 🌐 Online data generation effectively prevents teacher hacking by maintaining response diversity 🎯 Higher prompt diversity reduces teacher hacking more than multiple responses per prompt ⏰ Limiting training epochs helps avoid teacher hacking with offline data 📈 Using online data generation (Teacher or Student responses generated during training) prevents teacher hacking. 🔄 Multiple completions per prompt (2x-3x) bridges the gap between offline and online performance 🔎 Teacher hacking can be detected by monitoring proxy metrics (Student-Teacher distance).

0

3

27

Alexandre Ramé

@ramealexandre

5 days

RT @_philschmid: Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model di…

0

39

0

Alexandre Ramé

@ramealexandre

5 days

RT @arthurmensch: Self-qualifying oneself as heavyweight while shipping nothing of significance looks like hubris to me

0

103

0

Alexandre Ramé

@ramealexandre

5 days

RT @mblondel_ml: What I find interesting is that a paper whose potential impact didn't get noticed can become major 10 years later due to a…

0

8

0

Alexandre Ramé

@ramealexandre

5 days

RT @dtiapkin: 1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads t…

0

12

0

Alexandre Ramé

@ramealexandre

6 days

Discover more in the paper:

0

2