![Alexandre Ramé Profile](https://pbs.twimg.com/profile_images/1021673222796967936/k-zAF8Jj_x96.jpg)
Alexandre Ramé
@ramealexandre
Followers
2K
Following
3K
Statuses
560
Research scientist @GoogleDeepMind. PhD @Sorbonne_Univ_. Merging and aligning models.
Joined May 2011
An AI will win a Nobel price someday✨. Yet currently, alignment reduces creativity. Our new @GoogleDeepMind paper "diversity-rewarded CFG distillation" improves quality AND diversity for music, via distillation of test-time compute, RL with a diversity reward, and model merging. arxiv: website:
3
19
155
RT @cwolferesearch: The trajectory of research for open LLMs and open reasoning models has been shockingly similar, but there are still man…
0
6
0
RT @Yoshua_Bengio: While the AI Action Summit was the scene of important discussions, notably about innovations in health and environment,…
0
68
0
RT @fly51fly: [LG] On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark J Fairoze, G Ortiz-Jiménez, M Vecerik, S Jh…
0
5
0
RT @danie1marczak: 🚀 What happens when you modify the spectrum of singular values of the merged task vector? 🤔 Apparently, you achieve 🚨st…
0
30
0
@maxzimmerberlin Yes, all the parameters are averaged: there is nothing specific to the transformer layers.
2
0
0
RT @CRSegerie: If you're in Paris tomorrow and don't know what to do after the summit, you can join us for an official side event at Sorbon…
0
2
0
@maxzimmerberlin This is definitely surprising at first, but you get used to it aha. More seriously, the linear mode connectivity relies on a shared pre-training; see more in "What is being transferred in transfer learning" or in "Model soups'
1
0
0
RT @qberthet: 🚨 New paper on regression and classification! Adding to the discussion on using least-squares or cross-entropy, regression o…
0
61
0
RT @fly51fly: [LG] Loss Functions and Operators Generated by f-Divergences V Roulet, T Liu, N Vieillard, M E. Sander... [Google DeepMind] (…
0
6
0
RT @TheTuringPost: Collective Monte Carlo Tree Search (CoMCTS) This method helps MLLMs think step by step therefore enhances their o1-like…
0
23
0
@gabriberton @_philschmid @GoogleDeepMind The 2022 paper is "A good teacher is patient and consistent" :) and consistent here means what you said, that logits should be computed on the augmented image (so online). In contrast, our focus is on which data we should distill, and we show this data should be generated online.
0
0
1
@gabriberton @_philschmid @GoogleDeepMind That's an interesting connection. Yet, afaik, the 2022 paper shows that the target logits (=output distrib) from the teacher should be "consistent", preventing precomputing logits. In our language task, we actually show that the samples (=input distrib) should be generated online
1
0
0
Thanks for the highlight, notably the focus on this fun experiment where increasing the dataset size progressively reduces teacher hacking.
Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model distillation with offline datasets! Offline Distillation refers to generating synthetic data (logits) from offline, fixed prompts with a teach model and then training a student model on it. Teacher hacking is when a student optimizes for mimicking, imitate the teacher's imperfect behavior, rather than learning the true underlying task, reducing its generalization on unseen tasks. Solutions: 1️⃣ Online (Recommended): In Online or on-policy Knowledge Distillation, a student learns from a teacher during training by minimizing the distribution of samples dynamically generated by the student with a forward KL Term between the teacher and student. 2️⃣ Diverse Offline (If Online is Infeasible): Create a diverse offline dataset. Prioritize a wide variety of prompts. If prompt variety is limited, generate multiple responses per prompt from the Teacher. Avoid small, static datasets with single responses. Insights 🔍 Teacher hacking emerges when using fixed offline datasets 🌐 Online data generation effectively prevents teacher hacking by maintaining response diversity 🎯 Higher prompt diversity reduces teacher hacking more than multiple responses per prompt ⏰ Limiting training epochs helps avoid teacher hacking with offline data 📈 Using online data generation (Teacher or Student responses generated during training) prevents teacher hacking. 🔄 Multiple completions per prompt (2x-3x) bridges the gap between offline and online performance 🔎 Teacher hacking can be detected by monitoring proxy metrics (Student-Teacher distance).
0
3
27
RT @_philschmid: Keep this in mind when you do Model distillation! New paper from @GoogleDeepMind confirms that teacher hacking in model di…
0
39
0
RT @arthurmensch: Self-qualifying oneself as heavyweight while shipping nothing of significance looks like hubris to me
0
103
0
RT @mblondel_ml: What I find interesting is that a paper whose potential impact didn't get noticed can become major 10 years later due to a…
0
8
0
RT @dtiapkin: 1/ If you’re familiar with RLHF, you likely heard of reward hacking —where over-optimizing the imperfect reward model leads t…
0
12
0