Bittersweet goodbye to the Farm 🌲
Successfully defended my PhD thesis 🤺 grateful for my advisors
@tsachyw
@sanmikoyejo
and everyone I met along the way for the amazing journey at Stanford.
I have joined
@GoogleAI
as a research scientist. I will continue to work on efficient and trustworthy AI, LLMs, safety, and privacy.
Stay tuned for updates 👀
In 2009, Google created the PhD Fellowship Program to recognize and support outstanding graduate students pursuing exceptional research in computer science and related fields. Today, we congratulate the recipients of the 2023 Google PhD Fellowship!
Very excited to share the paper from my last
@GoogleAI
internship: Scaling Laws for Downstream Task Performance of LLMs.
w/ Natalia Ponomareva,
@hazimeh_h
, Dimitris Paparas, Sergei Vassilvitskii, and
@sanmikoyejo
1/6
Super excited about new work Lottery Ticket Adaptation (LoTA):
We propose a sparse adaptation method that finetunes only a sparse subset of the pre-trained weights. LoTA mitigates catastrophic forgetting and enables model merging by breaking the
Excited to share Lottery Ticket Adaptation (LoTA)! We propose a sparse adaptation method that finetunes only a sparse subset of the weights. LoTA mitigates catastrophic forgetting and enables model merging by breaking the destructive interference between tasks.
🧵👇
“Sparse Random Networks for Communication-Efficient Federated Learning” has been accepted at
#ICLR2023
! Code coming soon.
Looking forward to seeing many of you
@iclr_conf
in Rwanda.
Happy to share the second paper from my
@GoogleAI
internship: Sandwiched Video Compression with Neural Wrappers.
The sandwich framework is more efficient than most other neural video compression methods (details below 👇). 1/3
Excited to share our
@NeurIPSConf
'23 paper "Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation":
Looking forward to presenting it in person and seeing many of you in New Orleans! 🙂🎷🎶
Details 👇
The first paper from my Google internship has been accepted to Frontiers in Signal Processing. This is the first work to compress volumetric functions represented by local coordinate-based neural networks.
Paper link:
Code coming soon.
LoRA is great. It’s fast, it’s (mostly) accurate. But is the efficiency a free lunch? Do side effects surface in the fine-tuned model?
We didn’t quite know so we played with ViT/Swin/Llama/Mistral & focused on subgroup fairness.
🧵: takeaways below
📄:
@srush_nlp
In more recent work, we show that scaling laws for downstream behavior depend highly on (1) the metric, (2) the 'alignment' between the pretaining and finetuning data, and (3) the size of the finetuning data.
paper:
a quick highlight 👇
Excited to share our new work with
@FrancescoPase
,
@DenizGunduz1
,
@sanmikoyejo
,Tsachy Weissman, and Michele Zorzi.
We reduce the communication cost in FL by exploiting the side information correlated with the local updates and available to the server.1/3
I will be at AISTATS and ICLR in the following weeks. Let me know if you'd like to chat about efficient and trustworthy ML.
Also, check out our work:
- [AISTATS, May 3rd 5 pm Valencia] Adaptive Compression in Federated Learning via Side Information:
1/2
Excited to share our
#AISTATS2022
paper titled "An Information-Theoretic Justification for Model Pruning":
Come say hi at the conference during our poster session on Wednesday, March 30th, 8:30-10 am PST.
1/6
I will be
@icmlconf
for the whole week. Text me if you want to meet up! (Papers 👇)
PS: Don't forget to stop by our workshop
@neural_compress
on Saturday.
Looking forward to the Neural Compression Workshop
@icmlconf
this year. Please consider attending and submitting your latest work. Deadline is May 27th.
Submissions due in one week!
We welcome submissions on efficient & responsible foundation models and the principled foundations of large models.
CfP:
See you in Vienna in July
@icmlconf
!
🚨 Submissions due on May 29! 🚨
Do you have exciting work on efficient & responsible foundation models or the principled foundations of large models? Submit your work now!
We welcome submissions of work recently published or currently under review at other ML venues.
@icmlconf
I will be at
#NeurIPS2023
all week. Text me if you'd like to chat about trustworthy & responsible AI at scale! I'll present two works:
Tue afternoon: Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation ()
👇
Please join our social at Maui Brewing Co. Waikiki at 6pm after the workshop. Everyone, especially compression and information theory enthusiasts, is welcome!
@icmlconf
Excited to share the program and list of accepted papers for our
@icmlconf
workshop
@tf2m_workshop
:
Looking forward to discussing efficiency, responsibility, and principled foundations of foundation models in Vienna soon!
We are excited to announce that 58 excellent papers will be presented at the
@icmlconf
TF2M Workshop. List of accepted papers:
You can find the detailed schedule on our website (and below 👇):
A must-read for supervisors and managers👇
Sexual harassment is far more common than discussed because victims often experience fear, not anger, and may freeze rather than confront.
I will give an in-person talk on our work "Efficient Federated Random Subnetwork Training" at the NeurIPS Federated Learning Workshop.
Looking forward to seeing many of you in New Orleans. Drop me a message if you want to meet up!
#neurips2022
Check out our new paper titled “Learning under Storage and Privacy Constraints”. We propose a novel data pre-processing framework, LCoN, which simultaneously boosts data efficiency, privacy, accuracy, and robustness. 1/4
#compression
#privacy
#learning
LoRA is great. It’s fast, it’s (mostly) accurate. But is the efficiency a free lunch? Do side effects surface in the fine-tuned model?
We didn’t quite know so we played with ViT/Swin/Llama/Mistral & focused on subgroup fairness.
🧵: takeaways below
📄:
We will be in
#NeurIPS2020
WiML and Deep Learning through Information Geometry workshops with our work on neural network compression for noisy storage systems:
We are thrilled to announce that the
#DMLRWorkshop
on "Datasets for Foundation Models" will take place at the
@icmlconf
in July!
This marks the 5th edition of our
#DMLR
workshop series! Join the DMLR community at
We are excited to announce that Workshop on Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning will take place
@icmlconf
. We have an excellent line of speakers, including a recent Shannon award winner!
More details:
"Neural Network Compression for Noisy Storage Devices" will appear at the ACM Transactions on Embedded Computing Systems (TECS):
We propose ways to provide robustness to neural networks against noise present in storage or communication environments.
1/3
Scaling Laws for Downstream Task Performance of Large Language Models
Studies how the choice of the pretraining data and its size affect downstream cross-entropy and BLEU score
Scaling Laws for Downstream Task Performance of Large Language Models
paper page:
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for
Registration and poster abstract submissions for the Stanford Compression Workshop 2021 are now being accepted!
Date: 25-26th February 2021
Website:
Poster abstract submission deadline: 21 Feb 2021
Lottery Ticket Adaptation (LoTA) is a new adaptation method that achieves best-in-class performance on challenging tasks, mitigates catastrophic forgetting, and enables model merging across different tasks.
Paper:
Code:
Feedback
Tomorrow at the FLOW seminar, I will talk about our
@iclr_conf
2023 paper "Sparse Random Networks for Communication-Efficient Federated Learning".
Looking forward to your feedback and questions. 🙌
📢: The 99th FLOW talk is on Wednesday (22th March) at **5 pm UTC**.
Berivan Isik (Stanford) will discuss "Sparse Random Networks for Communication-Efficient Federated Learning."
Register to our mailing list:
@tianle_cai
Very cool work! 💫 we have a NeurIPS 2023 workshop paper with a similar idea and observations. The delta between the finetuned and pretrained model is extremely compressible with quantization and even with simple magnitude-based sparsification:
The framework consists of a neural pre- and post-processor with a standard video codec between them. The networks are trained jointly to optimize a rate-distortion loss function with the goal of significantly improving over the standard codec in various compression scenarios. 2/3
We can go up to 99% sparsity without any costly steps to find the sparsity mask. We can find the mask with just one dense training step with a significantly small portion of the dataset, followed by magnitude thresholding to find the most important weights for the task. This way,
@miniapeur
There is an (not very tight) upper bound on the output distortion when pruning a single connection that helps with adjusting layer-wise sparsity in a greedy manner:
LoTA is also incredibly helpful for model merging. Existing model merging methods mostly do post-hoc sparsification to their dense adapters, which usually hurts the performance. LoTA does not require this post-hoc sparsification since the task vectors are already sparse. 4/5
LoTA successfully mitigates catastrophic forgetting since sparse updates overlap less than dense updates or LoRA updates. We can even impose this further by restricting the updates of future tasks on non-overlapping weights from previous tasks and eliminate interference between
We also developed a novel model compression method (called SuRP), guided by this information-theoretic formulation, which indeed outputs a sparse model without an explicit pruning step.
Come say hi during our poster sessions if you're interested:
Monday 12:30-2:30 pm PST (WiML)
Wednesday 4-5 am PST (WiML)
Saturday 5-6:30 pm PST (DL-IG)
@united
agent after 17 hours: The bags are with Swiss Airlines in your departure point
- I didn’t fly with Swiss Airlines
@united
: I know, but they hold your bags because your flight was canceled
- My flight was not canceled?!
What’s going on?
@united
@FrancescoPase
@DenizGunduz1
@sanmikoyejo
We show the existence of highly natural choices of pre-data distribution (side information at the server) and post-data distribution (local updates at the clients) in FL that we can use to reduce the communication cost significantly -- up to 50 times more than the baselines. 2/3
Compared to other neural video compression methods, the sandwich framework is much more efficient as it requires pre- and post-processors formed by modestly-parameterized, lightweight networks.
Joint work with Philip A. Chou, Onur Guleryuz, Danhang Tang, and Jonathan Taylor. 3/3
TLDR: The size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. 3/6
We propose Federated Probabilistic Mask Training (FedPM) that does not update the randomly initialized weights at all. Instead, FedPM freezes the weights at their initial random values and learns how to sparsify the random network for the best performance. 2/6
I will present two papers at the Federated Learning Workshop:
1) Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation:
2) Communication-Efficient Federated Learning through Importance Sampling:
We derived the information-theoretical limit of model compression and showed that this limit can only be achieved when the reconstructed model is sparse (pruned).
@srush_nlp
Cross-entropy (CE) loss always improves with more pertaining data regardless of the degree of alignment. But BLEU/COMET/ROUGE scores on the downstream task sometimes drop with more pertaining data when alignment is not sufficient.
@FrancescoPase
@DenizGunduz1
@sanmikoyejo
We also show how to adaptively adjust the bitrate across the model parameters and training rounds to achieve the fundamental communication cost -- the KL divergence between the pre-data and post-data distributions. 3/3
@abeirami
@savvyRL
And there is a simple way to boost the student’s performance by pruning the teacher network before distilling (which acts as a regularizer):
FedPM reduces the communication cost to less than 1 bit per parameter (bpp), reaches higher accuracy with faster convergence than the relevant baselines, outputs a final model with size less than 1 bpp, and can potentially amplify privacy. 4/6
However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. 5/6
To this end, the clients collaborate in training a stochastic binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights – or a subnetwork inside the dense random network. 3/6
- [ICLR DMFM & ME-FoMo] Scaling Laws for Downstream Task Performance of Large Language Models:
- [ICLR SeT LLM, Me-FoMo, R2-FM, PML4LRS] On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models:
2/2
Throughout the manuscript, we highlighted the advantages of having a stochastic mask training approach rather than a deterministic one in terms of accuracy, bitrate, and privacy. 5/6
We use an analog storage technology (PCM) as an example to show that the noise added by the PCM cells is detrimental to the performance of neural networks and that we can recover full accuracy with our robust coding strategies.
2/3
We study the mean estimation problem under communication and local differential privacy constraints. As opposed to the order-optimal solutions in prior work,we characterize exact optimality conditions and develop an algorithm that is exact-optimal for a large family of codebooks.
We study the scaling behavior in a transfer learning setting, where LLMs are finetuned for translation tasks, and investigate how the choice of the pretraining data and its size affect downstream performance as judged by two metrics: downstream cross-entropy and BLEU score. 2/6
We investigated the theoretical tradeoff between the compression ratio and output perturbation of neural network models and found out that the rate-distortion theoretic formulation introduces a theoretical foundation for pruning. 2/6