Introducing Zero-Shot Tokenizer Transfer (ZeTT) ⚡
ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training.
Super excited to (finally!) share the first project of my PhD🧵
I am delighted to share that I will be joining
@EdinburghNLP
at
@EdinburghUni
from 2022 as a lecturer in Natural Language Processing. I am currently recruiting PhD students, so if you are passionate... (1/6)
We scaled sparse fine-tuning (SFT) to LLMs (such as Llama 2) by making it both parameter- and memory-efficient!
(q)SFT instruction tuning performance is often better than (q)LoRA with comparable speed and memory load.
Paper:
Code:
Today I am joining
@nvidia
part-time as a visiting professor
I could not imagine a better place to explore new efficient architectures for LLMs and diffusion
I am looking forward to collaborating with so many talented researchers!
Multitask learning by decomposing tasks into sets of fine-grained skills (discrete, reusable, and autonomous facets of knowledge).
New work with Yoshua Bengio
@sivareddyg
from
@Mila_Quebec
and
@murefil
from
@MSFTResearch
📘:
💻:
I am still looking for PhD students starting in September 2024! The deadline to apply for the CDT in NLP is the 11th of March.
If you wish to do research in modular and efficient LLMs, here are some highlights of my lab's research from the past year ⬇️🧵
Interested in training with future leaders in NLP to engage with the cutting edge of the technical, social, design, and legal aspects of these systems? Then apply for our new Centre for Doctoral Training in Designing Responsible NLP! Deadline 11 March 2024
We connect inaccuracies of merging fine-tuned models to the mismatch between their gradients (through a target model), minimising which directly improves the performance.
New paper with
@ndaheim_
@tmoellenhoff
@IGurevych
@EmtiyazKhan
Large language models often generate hallucinated responses.
We introduce Elastic Weight Removal (EWR), a novel method for faithful *and* abstractive dialogue.
📃
💻 +other methods!
🧑🔬
@ndaheim_
@nouhadziri
@IGurevych
@mrinmayasachan
I am looking for PhD students to join my group at
@EdinburghNLP
@EdinburghUni
and work on modular NLP, grounding, and typology!
The deadline for international applicants is Nov 25th for fully funded PhD programmes at CDT NLP and ILCC.
For more info:
A new method for the adaptation of pre-trained models that is modular, expressive, and parameter-efficient: Lottery Ticket Sparse Fine-Tuning
👨🔬 Alan Ansell, me,
@licwu
, and
@annalkorhonen
📄
👩💻
Can we increase the efficiency *and* performance of auto-regressive models?
We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation.
@p_nawrot
*
@AdrianLancucki
@JChorowski
📜
🧑💻
Can open-source LLMs execute *chains of instructions* in a single query? Not so well, we found.
However, they can learn this ability by:
- augmenting examples from public SFT mixtures with chains of instructions automatically
- performing *sequential instruction tuning* on them.
I am attending
#ACL2024
in Bangkok and I am giving a keynote talk at RepL4NLP on Thursday (15 Aug), "Efficiency as an Inductive Bias for Language Models"
Here is a preview with some hot takes and ideas!
Just passed my viva with minor corrections! Many thanks to my examiners, my supervisors
@annalkorhonen
and
@licwu
, and all those who supported me throughout the PhD
Polytropon is now available on the
@huggingface
peft library!
Consider using it for better generalisation when instruction tuning your LLM
Minimal example here (multi-task learning):
Many thanks to
@taosunvoyage
for the implementation!
Multitask learning by decomposing tasks into sets of fine-grained skills (discrete, reusable, and autonomous facets of knowledge).
New work with Yoshua Bengio
@sivareddyg
from
@Mila_Quebec
and
@murefil
from
@MSFTResearch
📘:
💻:
Adaper parameters are all you need in modular LLMs!
You can *build* inventories of experts by clustering tasks based on their LoRA params
You can *reuse* experts by routing zero-shot based on right singular vectors of their LoRA params
Towards Modular LLMs by Building and Reusing a Library of LoRAs
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to
New preprint :
To promote generalisation to new tasks, modular LLMs reuse and adapt previously acquired skills.
We propose a more expressive “multi-head” routing strategy, which achieves consistent gains.
Code:
Paper:
We introduce the idea of zero-shot *tokenizer* transfer
Our vision is to combine your favourite LLM with an arbitrary tokenizer on the fly
This means
- More efficient encoding for non-English text
- Mix experts with different tokenizers
Check
@bminixhofer
's thread for details!
Introducing Zero-Shot Tokenizer Transfer (ZeTT) ⚡
ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training.
Super excited to (finally!) share the first project of my PhD🧵
We have created XCOPA, a dataset for commonsense reasoning and knowledge transfer across 11 languages (including Quechua and Haitian Creole).
@gg42554
O Majewska
@qianchul
@licwu
@annalkorhonen
Download: Paper:
We have re-opened 2 PhD studentships for *2023/24* at
@EdinburghNLP
(1 home, 1 international), please send me a message by tomorrow if you are interested in this opportunity!
Join us today at 9:20am (Irish time) for
@MML_WKSP
, the first Multilingual Multimodal Workshop at
#acl2022nlp
! We have a fantastic line-up of speakers:
During the workshop on efficient generative AI at
@InfAtEd
,
we discussed methods to reduce AI's energy costs and environmental impact while fostering AI democratisation and scientific discovery.
Here are some lessons I learned from the speakers: 🧵
Corpus-based measures reliably discriminate morphological inflection and derivation cross-linguistically!
@colemanhaley22
is presenting today at
@sig_typ
the first large-scale computational study (26 languages from
@unimorph_
) on this topic
The applications for the
@ELLISforEurope
PhD programme are now open! If you'd like to join
@EdinburghNLP
and do research on modular deep learning (parameter-efficient fine-tuning, routing in mixture-of-experts, model merging, ...) or computational typology, drop me a message!
We retrofit LLMs by learning to compress their memory dynamically
I find this idea very promising as it creates a middle ground between vanilla Transformers and SSMs in terms of memory/performance trade-offs
I'd like to give a shout-out to
@p_nawrot
and
@AdrianLancucki
for the
The memory in Transformers grows linearly with the sequence length at inference time.
In SSMs it is constant, but often at the expense of performance.
We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance
In our new survey “Modular Deep Learning”, we provide a unified taxonomy of the building blocks of modular neural nets and connect disparate threads of research.
📄
📢
🌐
w/
@PfeiffJo
@licwu
@PontiEdoardo
Tomorrow at
@icmlconf
, together with
@PontiEdoardo
and
@AdrianLancucki
, we'll present an updated version of "Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference".
You can find an updated paper at . Among others - 1) We trained DMC to
Multilingual task-oriented dialogue is authentic if it displays natural fluency 🌊 and familiar entities 🛥️. In Cross-Lingual Outline-based Dialogue (COD 🐟), we set out to achieve exactly this!
💻
📝
"Differentiable Generative Phonology", in collaboration with
@EzraWu
and
@ryandcotterell
, is finally out!
Tired: Asking linguists to posit discrete underlying forms
Wired: learning continuous underlying forms end-to-end
Interested in integrating deep learning with symbolic algorithms, knowledge bases, and programmes?
Apply for a 2-year postdoc position with me,
@PMinervini
, and
@tetraduzione
at ELIAI
@EdinburghUni
on gradient-based learning of complex latent structures.
This paper required a Herculean effort, but it was worth it! The aspect that I like the most is that it enables transfer learning along 3 different axes: languages, tasks, and modalities
Voilà IGLUE🧊 The Image-Grounded Language Understanding Evaluation benchmark 📈
IGLUE brings together 4 vision-and-language tasks across 20 languages
And, brr, is it cold outside the Anglosphere 🥶
📄
👩💻
🌐
If you are curious to discover more about Dynamic Memory Compression, I will give a preview during my keynote talk at the MOOMIN workshop
@eaclmeeting
See you on Thursday, March 21st at 9:30 AM!
The memory in Transformers grows linearly with the sequence length at inference time.
In SSMs it is constant, but often at the expense of performance.
We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance
The best part is: you can adapt models from
@huggingface
with our SFTs in just 3 lines of code:
from sft import SFT
sft_model = SFT(sft_model_name)
sft_model.apply(pretrained_model)
I am committed to selecting a diverse set of candidates with high potential, as the various communities of speakers around the world should also find representation in the NLP & ML scientific communities
@Khipu_AI
@DeepIndaba
@MasakhaneNLP
(5/6)
...about multilingual and low-resource NLP, sample-efficient and modular machine learning, computational typology, or grounded language learning, consider applying to my group! (2/6)
Given the paucity of annotated data, how can we perform sample-efficient generalization on unseen task-language combinations? Possible solution: a generative model of the neural parameter space, factorized into variables for several languages and tasks. 1/2
Many *fully funded* studentships (from September 2022) are available:
👩🏻🎓12 for a 4-year PhD with integrated study from the NLP CDT:
👨🏿🎓10 for a 3-year PhD from ILCC:
(3/6)
Grammatical markers are implicitly aligned in pre-trained multilingual encoders by encoding the same grammatical functions through the same subset of neurons across languages. This may help explain the "unreasonable" effectiveness of zero-shot cross-lingual transfer.
A little gem from my student
@p_nawrot
: nanoT5, or how to pre-train T5 on 1 GPU, in less than 1 day, in Pytorch.
Now it is more important than ever to keep research accessible and reproducible.
He conceived the idea and executed it all by himself, quite a remarkable feat!
Introducing *nanoT5*
Inspired by
@jonasgeiping
's Cramming and
@karpathy
's nanoGPT, we fill the gap of a repository for pre-training T5-style "LLMs" under a limited budget (1xA100 GPU, ~20 hours) in PyTorch
🧑💻
@EdinburghNLP
Fantastic work from my student
@yifuqiu98
:
- the first metric to measure hallucinations in generated text for *any* language
- an empirical study of how cross-lingual transfer amplifies hallucinations
- a new method of "soft filtering" / loss weighting to promote faithfulness
[1/5] Our paper "Detecting and Mitigating Hallucinations for Multilingual Summarisation" is currently available on Arxiv!
📃
💻
🤝
@YftahZ
@annalkorhonen
@PontiEdoardo
and Shay B. Cohen
Meet me at
@eaclmeeting
! At 11:15 I am presenting Polytropon, a method for multi-task modular adaptation of LLMs
The code is part of a 🚨new repo for multi-task transfer learning 🚨developed with
@LucasPCaccia
@murefil
@tallinzen
If anything, there is increasing evidence to the contrary. For instance, LLMs lack self-consistent world models as they believe contradicting timelines to be true:
For any enquiry, feel free to reach out to me via email or talk to me virtually at
#EMNLP2021
(and attend our team's best paper award presentation!). I hope there will be a chance to meet some of you and discuss exciting research directions! (6/6)
Third (and last) paper at
#EMNLP2018
(actually TACL):
@dasgerz
and
@licwu
carefully explaining our novel Language Modeling architecture with output matrix refinement
Are you working on Natural Language Understanding? Then have a look here:
@CambridgeLTL
has just released the post-specialised word embeddings for GloVe, fastText, and SGNS. Pre-trained models to specialise new (cross-lingual) WEs are also available!
The school of informatics at the University of Edinburgh and DeepMind are offering an ML PhD scholarship for students who identify as gender/racial/ethnic minorities in 2022/23. See thread for details. (1/n)
By the way,
@AlanAnsell5
(the first author) is graduating from
@Cambridge_Uni
and will be on the job market soon.
He did amazing research on PEFT and multilingual NLP, make sure to reach out to him if you have a position open!
We scaled sparse fine-tuning (SFT) to LLMs (such as Llama 2) by making it both parameter- and memory-efficient!
(q)SFT instruction tuning performance is often better than (q)LoRA with comparable speed and memory load.
Paper:
Code:
Don't miss the tutorial at
@emnlp2019
with
@licwu
,
@gg42554
, and me for the latest developments in semantic specialization (knowledgeable unsupervised pretraining, cross-lingual transfer, and more). Registration is now open:
Fantastic new work from
@nouhadziri
: data-centric + modelling solutions can remove most hallucinations from knowledge-grounded dialogue and increase its quality (e.g. abstractiveness)!
I have just accepted an offer for a position as an
#ML
/
#NLP
Engineering Intern at
#Apple
in Cupertino, California. Looking forward to this new adventure! (And curious to admire Norman Foster's
#applepark
)
You can easily load XCOPA from
@huggingface
's dataset library:
from datasets import load_dataset
xcopa = load_dataset('xcopa')
It contains extremely under-documented languages like Southern Quechua and Haitian Creole.
Come and meet me at the
#EMNLP2020
Q&A session (6B) about XCOPA, a novel multilingual dataset for common-sense reasoning.
When? Nov 17, 9:00 UTC (tomorrow!)
Data and leaderboard:
To keep the memory load proportional to the PEFT size instead, we alternate among:
1) updating deltas wrt LLM weights
2) dropping old indices based on their magnitude of change
3) growing new indices based on newly introduced criteria: AG and MA
This alternation is inspired
Until 50 years ago, CO₂ emissions developed in lockstep with economic growth in France.
Since the early 1970s, the opposite has been true: emissions declined as people in France got richer.
Momentum Approximation (SFT-MA) for even higher memory efficiency
- reuses approximate momenta from efficient optimizers like
@_arohan_
's SM3
- performs a dot product between row-wise and column-wise weight statistics
- selects the arg top-k subset of indices for growth
MA is
We compare different methods to learn an auto-regressive boundary predictor:
- end-to-end (Gumbel)
- supervision from subword tokenizers (Unigram)
- data boundaries (Whitespaces)
We also propose a new segmentation method based on the entropy spikes of the model’s prediction.
A group of researchers from
@AIMS_Next
AMMI has devised a promising research project on modelling text and speech in 10 Ghanaian languages. Are you aware of any source of funding (in addition to
@LacunaFund
) they could apply to for this project?
@nlpnoah
@yoavgo
The very concept of business / occupation in Latin and Ancient Greek are only defined by negation: "negotium" and "ἀσχολία" are literally "not leisure" :)
How well do neural models generalise to new image domains, concepts, and languages?
Check out MaRVL, a benchmark for grounded language learning created to better reflect the world's cultural and linguistic diversity.
🌐
SFT (bottom) scatter-adds a sparse matrix to the LLM pre-trained weights
LoRA adds a low-rank matrix (top).
While more expressive and composable, the memory needed for SFT (DiffPruning, FISH Mask, Lottery Ticket) previously scaled with the model size.
This made SFT
Let's talk research over a glass of wine if you're curious about my lab's recent work on dynamic memory compression, sparse fine-tuning, mixtures of adapters, and zero-shot tokenizer transfer!
Thanks to
@ShiweiLiu9
@zhengyuan_nlp
@oanacamb
@annalkorhonen
for the invites!
@NandoDF
My lab proposed an auto-regressive Transformer architecture that dynamically merges tokens in intermediate layers
Promising for multimodal data as 1) tokenizer-free, 2) discards uninformative bits, 3) can learn abstractions at different granularities
Can we increase the efficiency *and* performance of auto-regressive models?
We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation.
@p_nawrot
*
@AdrianLancucki
@JChorowski
📜
🧑💻
We are excited to announce that on May 24th and 25th,
@InfAtEd
will host the *International Workshop on Efficient Generative AI*
The event will feature invited talks, panels, posters, and networking sessions.
Website and programme:
LT-SFT achieves large gains over adapters (such as MAD-X) in zero-shot transfer to unseen and low-resource languages, including African and American languages
@MasakhaneNLP
@AmericasNLP
In ordering events, even SOTA GPT-4 lags behind human performance *and* TemporalBART, a small-scale LM fine-tuned on abundant data for this task
Still, conversational tuning of LLaMA 2, instruction tuning with Alpaca, and RLHF are broadly helpful to temporal reasoning
In particular, we learn end-to-end how to 1) allocate subsets of latent skills to multiple tasks; 2) specialise an inventory of parameter-efficient model sub-networks towards individual skills; 3) combine these to dense pre-trained or randomly initialised models.
I am in Hong Kong for
@emnlp2019
, feel free to get in touch if you are interested in few-shot (multilingual) learning, Bayesian neural models, or semantic specialization: I'd be curious to hear your opinions! On a related note, I have a couple of talks on these topics tomorrow 👇
How to suppress negative (or encourage positive) behaviours with EWR?
- Create task vectors as the change between (anti)experts fine-tuned on behaviour exemplars and initialisation
- Subtract (or add) the task vectors from a pre-trained model, weighted by their Fisher Information
The gains from dynamic pooling Transformers do not vanish with higher numbers of layers.
Hence, they hold promise to further facilitate scaling in language models.
@TonyZador
Brilliant paper! The core idea is reminiscent of Konrad Lorenz's 'Behind the Mirror'. In AI, the inductive bias can also be conceived as a prior over neural parameters . E.g. for learning languages (to appear at
@emnlp2019
):
We can’t probe temporal grounding directly as LLMs are incapable of action or perception.
So we probe LLMs on textual tasks that require an implicit temporal model:
- commonsense knowledge about events
- ordering events along a timeline
- self-consistency in the temporal model
As baselines, we adapt a series of techniques to faithful dialogue generation and we offer their first systematic comparison
Task Arithmetic, CaPE, Quark, DExperts, and CTRL are all available in our repository!
We welcome external contributions and plan to add more techniques
Inflection and derivation are crucial comparative concepts; yet, their definition is contentious.
Linguists proposed several criteria: e.g., Plank (1994) lists 28, which yield contradictory results.
@haspelmath
even argued that their distinction carries no theoretical weight.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to
Tomorrow I am giving a talk on AI and language at my alma mater
@unipv
, in a conference hosted by
#CollegiodelMaino
. If you are in Pavia, hope to see you there!
Giovedì si terrà la conferenza “Prospettive dell’Intelligenza Artificiale”, che si pone l’obiettivo di mettere a confronto punti di vista applicativi diversi dello stesso fenomeno, ormai attualissimo, dell’
#IntelligenzaArtificiale
#CollegiodelMaino
#unipv
@nathanbenaich
We had a similar finding with Dynamic Memory Compression: the KV cache in deeper layers can be compressed to a high degree without degrading performance -very fascinating!
In natural languages, units of meaning (such as words) vary in size.
Our model predicts their boundaries, average-pools representations in the same unit, and processes them more efficiently.
For a shortening factor K of the input length, attention complexity reduces by K^2.
@karpathy
A solution would be to swap tokenizer on the fly to avoid glitch tokens.
We've just released our work on zero-shot tokenizer transfer which (coincidentally) does exactly this!
Introducing Zero-Shot Tokenizer Transfer (ZeTT) ⚡
ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training.
Super excited to (finally!) share the first project of my PhD🧵
@nouhadziri
Also, it contains the largest-scale audit of gold-standard benchmarks to date, revealing that e.g. 71.4% of turns in Wizards of Wikipedia are hallucinated. Even worse, language models tend to not only 🦜 but even amplify this noise.
We train linear and MLP classifiers on these features and recover most (86% and 90%, respectively) of the classes of the constructions in
@unimorph_
(which we take to reflect the intuitions of linguists on what constitutes inflection and derivation)
A simple modification of
@jefrankle
and
@mcarbin
's algorithm to find "winning tickets" allows for composing (rather than pruning) pre-trained models with sparse, real-valued masks that represent different facets of knowledge (languages, tasks, ...)