ramealexandre Profile Banner
Alexandre Ramé Profile
Alexandre Ramé

@ramealexandre

Followers
2K
Following
3K
Statuses
560

Research scientist @GoogleDeepMind. PhD @Sorbonne_Univ_. Merging and aligning models.

Joined May 2011
Don't wanna be here? Send us removal request.
@ramealexandre
Alexandre Ramé
4 months
An AI will win a Nobel price someday✨. Yet currently, alignment reduces creativity. Our new @GoogleDeepMind paper "diversity-rewarded CFG distillation" improves quality AND diversity for music, via distillation of test-time compute, RL with a diversity reward, and model merging. arxiv: website:
3
19
155
@ramealexandre
Alexandre Ramé
1 day
RT @cwolferesearch: The trajectory of research for open LLMs and open reasoning models has been shockingly similar, but there are still man…
0
6
0
@ramealexandre
Alexandre Ramé
2 days
RT @neilzegh: Great night hitting the club with @honualx and Hibiki!
0
5
0
@ramealexandre
Alexandre Ramé
2 days
RT @Yoshua_Bengio: While the AI Action Summit was the scene of important discussions, notably about innovations in health and environment,…
0
68
0
@ramealexandre
Alexandre Ramé
3 days
RT @fly51fly: [LG] On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark J Fairoze, G Ortiz-Jiménez, M Vecerik, S Jh…
0
5
0
@ramealexandre
Alexandre Ramé
3 days
RT @danie1marczak: 🚀 What happens when you modify the spectrum of singular values of the merged task vector? 🤔 Apparently, you achieve 🚨st…
0
30
0
@ramealexandre
Alexandre Ramé
3 days
@maxzimmerberlin Yes, all the parameters are averaged: there is nothing specific to the transformer layers.
2
0
0
@ramealexandre
Alexandre Ramé
3 days
RT @CRSegerie: If you're in Paris tomorrow and don't know what to do after the summit, you can join us for an official side event at Sorbon…
0
2
0
@ramealexandre
Alexandre Ramé
3 days
@maxzimmerberlin This is definitely surprising at first, but you get used to it aha. More seriously, the linear mode connectivity relies on a shared pre-training; see more in "What is being transferred in transfer learning" or in "Model soups'
1
0
0
@ramealexandre
Alexandre Ramé
3 days
RT @qberthet: 🚨 New paper on regression and classification! Adding to the discussion on using least-squares or cross-entropy, regression o…
0
61
0
@ramealexandre
Alexandre Ramé
4 days
RT @sophiamyang: Breaking - our first AI cluster and it's in France🇫🇷!!!
Tweet media one
0
175
0
@ramealexandre
Alexandre Ramé
4 days
RT @fly51fly: [LG] Loss Functions and Operators Generated by f-Divergences V Roulet, T Liu, N Vieillard, M E. Sander... [Google DeepMind] (…
0
6
0
@ramealexandre
Alexandre Ramé
4 days
RT @TheTuringPost: Collective Monte Carlo Tree Search (CoMCTS) This method helps MLLMs think step by step therefore enhances their o1-like…
0
23
0
@ramealexandre
Alexandre Ramé
4 days
@gabriberton @_philschmid @GoogleDeepMind The 2022 paper is "A good teacher is patient and consistent" :) and consistent here means what you said, that logits should be computed on the augmented image (so online). In contrast, our focus is on which data we should distill, and we show this data should be generated online.
0
0
1
@ramealexandre
Alexandre Ramé
4 days
@gabriberton @_philschmid @GoogleDeepMind That's an interesting connection. Yet, afaik, the 2022 paper shows that the target logits (=output distrib) from the teacher should be "consistent", preventing precomputing logits. In our language task, we actually show that the samples (=input distrib) should be generated online
1
0
0
@ramealexandre
Alexandre Ramé
5 days
Thanks for the highlight, notably the focus on this fun experiment where increasing the dataset size progressively reduces teacher hacking.
@_philschmid
Philipp Schmid
5 days
Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model distillation with offline datasets! Offline Distillation refers to generating synthetic data (logits) from offline, fixed prompts with a teach model and then training a student model on it. Teacher hacking is when a student optimizes for mimicking, imitate the teacher's imperfect behavior, rather than learning the true underlying task, reducing its generalization on unseen tasks. Solutions: 1️⃣ Online (Recommended): In Online or on-policy Knowledge Distillation, a student learns from a teacher during training by minimizing the distribution of samples dynamically generated by the student with a forward KL Term between the teacher and student. 2️⃣ Diverse Offline (If Online is Infeasible): Create a diverse offline dataset. Prioritize a wide variety of prompts. If prompt variety is limited, generate multiple responses per prompt from the Teacher. Avoid small, static datasets with single responses. Insights 🔍 Teacher hacking emerges when using fixed offline datasets 🌐 Online data generation effectively prevents teacher hacking by maintaining response diversity 🎯 Higher prompt diversity reduces teacher hacking more than multiple responses per prompt ⏰ Limiting training epochs helps avoid teacher hacking with offline data 📈 Using online data generation (Teacher or Student responses generated during training) prevents teacher hacking. 🔄 Multiple completions per prompt (2x-3x) bridges the gap between offline and online performance 🔎 Teacher hacking can be detected by monitoring proxy metrics (Student-Teacher distance).
Tweet media one
0
3
27
@ramealexandre
Alexandre Ramé
5 days
RT @_philschmid: Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model di…
0
39
0
@ramealexandre
Alexandre Ramé
5 days
RT @arthurmensch: Self-qualifying oneself as heavyweight while shipping nothing of significance looks like hubris to me
0
103
0
@ramealexandre
Alexandre Ramé
5 days
RT @mblondel_ml: What I find interesting is that a paper whose potential impact didn't get noticed can become major 10 years later due to a…
0
8
0
@ramealexandre
Alexandre Ramé
5 days
RT @dtiapkin: 1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads t…
0
12
0
@ramealexandre
Alexandre Ramé
6 days
Discover more in the paper:
0
0
2