🚨 New
@GoogleDeepMind
paper 🚨
We trained Foundational Large Autorater Models (FLAMe) on extensive human evaluations, achieving the best RewardBench perf. among generative models trained solely on permissive data, surpassing both GPT-4 & 4o.
📰:
🧵:👇
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
Enormous LMs like GPT-3 exhibit impressive few-shot performance, but w/ self-training a BERT base sized model can achieve much better results! W/ a new implementation of STraTA, we were able to get ~93% acc on SciTail w/ 8 examples per class! Checkout our recent work
@GoogleAI
👇
📢 🌟PhD Openings🌟:
I am recruiting PhD students this cycle at Virginia Tech. If you want to dive into:
- in-context learning & tool-use LLMs
- instruction tuning
- parameter-efficient transfer learning
- few-shot learning
please apply by Dec 15!
👉
Sharing my internship work
@GoogleAI
: 1) w/ Soft Prompt Transfer, Prompt Tuning matches or significantly outperforms Model Tuning across model sizes, 2) tasks can help each other via their prompts & task prompts can be used as task embeddings to formalize task similarity.
🧵 1/8
📢📢 I am looking for a student researcher to work with me and my colleagues at
@GoogleAI
Research on instruction-based text embedding representations and evaluation. Please apply () and reach out to me (ttvu
@google
.com) if interested.
Great advice for early-career PhD students from the awesome
@mrdrozdov
. Really liked the saying: “The typical PhD takes 5-7 years to complete, but if you really focus, ignore your friends and family, work late into the night, and dedicate your whole self to your work then it only
🌟 PhD Thesis Defended 🌟
1️⃣ Title: Unlocking Natural Language Generalization through Adaptive Retrieval-based Methods
2️⃣ Joining Databricks as a Research Scientist w. focus on generative retrieval / RAG
3️⃣ New Blog Post: Advice for PhD Students
Excited to announce our
#EMNLP2021
paper that shows how to turn a pre-trained language model or even a randomly initialized model into a strong few-shot learner.
Paper:
w/ amazing collaborators:
@lmthang
,
@quocleix
,
@GradySimon
,
@MohitIyyer
1/9👇
I successfully defended my Ph.D. thesis. A special thank you to the members of my thesis committee: my wonderful advisor
@MohitIyyer
,
@MajiSubhransu
,
@HamedZamani
,
@lmthang
, and
@colinraffel
for their insightful feedback and advice on my research and career plans.
Based on our latest evaluation, LLMs today are still struggle to dynamically adapt to our ever-changing world. Strikingly, open source LLMs such as Mixtral 8x7B, when combined w/ FreshPrompt, can be competitive with closed source models and commercial APIs on search-augmented QA.
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
Excited to share our
@emnlp2020
paper on task transferability:
1) a large-scale empirical study w/ over 3,000 combinations of NLP tasks and data regimes within and across different classes of problems
2) task embedding methods to predict task transferability
1/12👇
I somehow missed this great paper by
@tuvuumass
et al.: They learn "task embeddings" (a la task2vec) for NLP tasks and show how they can be used to predict the effectiveness of intermediate-task transfer. Lots of experiments and a promising direction!
Q: As of today, what's the best “open-source” LLM for both few-shot prompting & fine-tuning?
A: I’d recommend FLAN-T5 if it fits your budget.
Q: What if I want to train my own model?
A: You should fine-tune it on the FLAN dataset collection!
Check out our new work
@GoogleAI
👇
✨New Paper✨What’s the best completely public competitor to
#ChatGPT
?
Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B
We release the
@GoogleAI
🌟Flan Collection🌟data + methods for Instruction Tuning!
1/
While parameter-efficient tuning methods are originally proposed to reduce computation & storage costs, it turns out they can help overcome catastrophic forgetting and thus improve performance on zero-shot cross-lingual generation. Checkout our work
@GoogleAI
@emnlpmeeting
👇1/10
I will also be co-hosting a summer research intern at Google Bard with
@TsendeeMTS
working on long-context modeling. Please reach out to me (ttvu
@google
.com) if interested.
📢📢 I am looking for a student researcher to work with me and my colleagues at
@GoogleAI
Research on instruction-based text embedding representations and evaluation. Please apply () and reach out to me (ttvu
@google
.com) if interested.
Please help repost!
My team
@GoogleAI
is looking for a research scientist. Our focus areas are multimodal/multilingual/multipod models. Past projects incl. prompt tuning, mT5/ByT5/sentence-T5/longT5, universal sentence encoders.
Email me (ttvu
@google
.com) if interested
#NLProc
AlphaGeometry's results are groundbreaking, yet I find
@thtrieu_
's hard work and dedication to the project even more impressive. It's rare for a PhD student to persistently work on a single project for 4 years. AFAIK, Trieu is on the job market, so hire him before it's too late.
Excited to announce our
#EMNLP2021
paper that shows how to turn a pre-trained language model or even a randomly initialized model into a strong few-shot learner.
Paper:
w/ amazing collaborators:
@lmthang
,
@quocleix
,
@GradySimon
,
@MohitIyyer
1/9👇
🌟LLMs' token-level probabilities are well-calibrated. So why not sample multiple responses (e.g., A, B, C) from an LLM and ask it to choose a single letter?
@jessierenjie
found this method improved both performance & calibration for open-ended generation. Check out our work!👇
Struggling with LLM calibration for open-ended generation?
Check out our methods (Sample & Select / Sample & Eval) that reformulate open-ended generation into multiple choice or true/false evaluation to leverage LLMs’ better calibration at the token level.
Moving forward, I will be splitting my time as a research scientist at
@GoogleAI
and an assistant professor
@VT_CS
.
I will also be recruiting Ph.D. students starting in Fall 2024 to work on effective and efficient transfer learning in the era of LLMs, please come join me!
Happy to share our soft prompt transfer (SPoT) paper made it to
#ACL2022
🎉.
On the SuperGLUE leaderboard, SPoT is the first parameter-efficient approach that is competitive with methods that tune billions of parameters.
w/
@blester125
,
@noahconst
,
@aboSamoor
,
@daniel_m_cer
Sharing my internship work
@GoogleAI
: 1) w/ Soft Prompt Transfer, Prompt Tuning matches or significantly outperforms Model Tuning across model sizes, 2) tasks can help each other via their prompts & task prompts can be used as task embeddings to formalize task similarity.
🧵 1/8
Excited to share our
#acl2019nlp
paper () which improves paragraph classification by pretraining the encoder on unlabeled data using our sentence content objective. Work done with my advisor
@MohitIyyer
. Code:
. Summary below [1/5]
📢 Want to adapt your outdated LLM to our ever-changing world? 🌏
Check out our code for FreshPrompt at .
Colab: .
🙏 We are grateful to
@serp_api
for their generous sponsorship of 5000 searches for FreshPrompt's users.
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
💡Let's raise the bar for LLM's factuality! 🚀
We introduce FreshQA:
📚 a dynamic QA benchmark w/ 600 diverse questions, incl. those testing real-time knowledge and debunking false premises.
🔎 a two-mode eval procedure: relaxed & strict (no hallucination & outdated info).
Check out
@ContextualAI
's great work on RAG 2.0 that trains a RAG system end-to-end. I'm glad to see more and more work using freshness (w/ FreshQA) as one of the evaluation criteria.
Our first set of RAG 2.0 models, Contextual Language Models (CLMs), significantly improve performance over current systems across axes critical for enterprise work: open-domain question answering, faithfulness, and freshness.
2022:
- oh nooo!!! you can't run language models on cpu! you need an expensive nvidia GPU and special CUDA kernels and–
- *one bulgarian alpha chad sits down and writes some c++ code to run LLMs on cpu*
- code works fine (don't need a GPU), becomes llama.cpp
2023:
- oh noo!!
I would also like to thank all of my labmates
@UMass_NLP
and friends at
@UMassAmherst
, my mentors and collaborators at
@GoogleAI
and
@MSFTResearch
, and my family and friends all over the world who gave me support and encouragement throughout my Ph.D. journey.
Glad to see FreshLLMs/FreshQA got mentioned in
@perplexity_ai
&
@youSearchEngine
's recent blogs (, )
To facilitate future work, we've developed FreshEval, a reliable automatic evaluation metric for FreshQA
👉
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
We present FreshPrompt: Improving LLMs on FreshQA 🚀:
📚 Incorporates all relevant & up-to-date evidences from Google Search, incl. those from relevant questions
💡 Sort evidences chronologically
🧠 Reasons over the evidences to figure out the most relevant & current answer
Large-scale instruction tuning is the key to unlocking the power of Mixture of Experts (MoEs) models.
Check out our recent work led by the awesome
@shengs1123
and
@Hou_Le
👇
A Winning Combination for Large Language Models
TL;DR: Did you find MoE models generalize worse than dense models on downstream tasks? Not any more at the age of instruction tuning!
Surprisingly, we see the “1 + 1 > 2” effect when it comes to MoE + Instruction Tuning. [1/4]
Note that FreshQA has been regularly updated on a weekly basis since its release and our new autorater FreshEval allows for quick evaluation and comparison
🚨 50K human judgments to evaluate LLM factuality:
💥No surprise: bigger models ≠ reliable gains on fast-changing facts.
📉flat scaling curves on false-premise questions, though some LLMs can debunk false premises if prompted to VERIFY first 🤯🔍.
💥 Insights from FreshPrompt's analysis:
🧐 Number of retrieved evidences and their order shape the correctness of LLM's answers.
🧐 Encouraging concise answers = less hallucination (less is more for precision!)
We show that task prompts can be interpreted as task embeddings to construct a semantic space of tasks and formalize the similarity between tasks (see Figure 3 👇). 6/8
🚀📊📈Our experiments show that FreshPrompt substantially boosts the performance of an LLM on FreshQA, outperforming both competing search engine-augmented prompting methods such as Self-Ask as well as commercial systems such as Perplexity AI.
STraTA starts with task augmentation that uses unlabeled texts from the target domain to synthesize a large amount of in-domain training data for an auxiliary task (i.e., natural language inference), which is then used for intermediate fine-tuning (see the figure below).
Finally, this work was done with a few hundred thousand GPU jobs in several months. We couldn’t have completed it without the awesome GPU cluster operating on renewable energy at
@umasscs
. So, please consider doing a Ph.D. here. 😀
Finally, we propose a simple yet efficient retrieval algorithm that measures task embedding similarity, allowing practitioners to identify source tasks that are likely to yield positive transferability for a given novel target task (see Figure 2 👆, right). 7/8
Paper:
w/ awesome collaborators Aditya Barua,
@blester125
,
@daniel_m_cer
,
@MohitIyyer
, and
@noahconst
We also release LM-adapted mT5 checkpoints , which we hope will spur more research into multilingual prompt-based learning. 10/10
Scale is not necessary for Prompt Tuning to match Model Tuning's performance: Prompt Tuning w/ SPoT yields competitive or significantly better results than Model Tuning across all model sizes while being more parameter-efficient (up to 20Kx fewer task-specific parameters). 4/8
To explicitly tackle catastrophic forgetting, we present two approaches: 1) mixing in unlabeled multilingual data during learning the task, and (2) factoring soft prompts into “task” and “language” components that can be recombined in novel pairings at inference time. 7/10
Lester et al. (2021) show that, as model size increases, Prompt Tuning (which learns soft prompts to condition a frozen model to perform tasks) becomes competitive with Model Tuning (a.k.a fine-tuning). However, there are still large gaps between them at small model sizes. 2/8
SpanBERT: a new pre-training objective that predicts the content of masked spans of text, significantly outperforming BERT on span selection tasks e.g., question answering and coreference resolution
We show that standard model fine-tuning (Model Tuning) and parameter-efficient Prompt Tuning methods suffer from catastrophic forgetting on a novel zero-shot cross-lingual summarization task, causing them to often generate text in the wrong language. 3/10
Additionally, we conduct a large-scale and systematic study on task transferability with 26 NLP tasks and 160 combinations of source-target tasks, which demonstrates that tasks can often benefit each other via prompt transfer. 5/8
Can current transfer learning methods extend successfully to a zero-shot cross-lingual generation (XGen) setting that requires a multilingual model to learn a generative task from labeled data in one language and then perform this task in another language at inference time? 2/10
Through qualitative analysis, we find that Prompt Tuning tends to stay within the target language, whereas Model Tuning is more prone to code-switching between English and the target language. 9/10
We propose SPOT: Soft Prompt Transfer, a novel prompt-based transfer learning approach that first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task (see Figure 2 👆, left). 3/8
We show that both of our approaches can help prevent catastrophic forgetting and provide substantially better results when there is severe catastrophic forgetting, suggesting that robust zero-shot cross-lingual generation is within reach. 8/10
Starting from FLAN-T5 not only confers improved downstream performance but also provides significant speedup during fine-tuning. So, we would highly recommend FLAN-T5 as a starting point for your own specific task.
Finally, if you want to use your own model (e.g., a smaller pre-trained LM), we would recommend fine-tuning it on the FLAN 2022 data collection with 1.8K datasets phrased as instructions:
We find an interesting “paradox of capacity” for Prompt Tuning. On the one hand, greater capacity (longer prompts) helps to better learn the summarization task. On the other hand, the greater the capacity to learn from English, the more the model forgets other languages. 6/10
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
Our experiments show that for both held-in and held-out tasks, fine-tuning FLAN-T5 significantly outperforms fine-tuning the vanilla T5, and even FLAN-T5 without fine-tuning can confer improved performance at times.
We show that task augmentation alone can significantly improve downstream performance across different tasks, generally outperforming other competing fine-tuning approaches in both high- and low-data regimes.
Other interesting results:
1) randomly initialized model + STraTA outperforms BERT_BASE by a large margin on SST-2 while being competitive on SciTail.
2) BERT_BASE + STraTA substantially outperforms BERT_LARGE on both SST-2 and SciTail.
New paper on adapting pretrained language models to downstream tasks by
@mattthemathman
,
@seb_ruder
, and
@nlpnoah
, showing that the effectiveness of fine-tuning depends on the language model architecture and the similarity of the pretraining/target tasks.
Interestingly, Prompt Tuning can confer a significant boost in performance over Model Tuning during zero-shot inference on languages that are less related to English, e.g., non-Latin script languages like Russian and Thai. 4/10
Our experiments show that positive transfer can occur in a diverse array of settings. Contrary to the common wisdom, transfer gains are possible even when the source dataset is small. Also, out-of-class transfer succeeds in many cases, some of which are unintuitive.
We propose STraTA, which stands for Self-Training with Task Augmentation, an approach that combines two complementary methods, task augmentation and self-training, to effectively leverage task-specific unlabeled data, which is comparatively cheaper to obtain.
@SongWang_SW
Great survey!! Our recent work aligns with this theme. We inject factual and up-to-date knowledge into LLMs through few-shot in-context learning.
🚨 New
@GoogleAI
paper:
🤖 LLMs are game-changers, but can they help us navigate a constantly changing world? 🤔
As of now, our work shows that LLMs, no matter their size, struggle when it comes to fast-changing knowledge & false premises.
📰:
👇
Introducing UDA, our new work on "Unsupervised data augmentation" for semi-supervised learning (SSL) with Qizhe Xie, Zihang Dai, Eduard Hovy, &
@quocleix
. SOTA results on IMDB (with just 20 labeled examples!), SSL Cifar10 & SVHN (30% error reduction)!
✨New Paper✨What’s the best completely public competitor to
#ChatGPT
?
Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B
We release the
@GoogleAI
🌟Flan Collection🌟data + methods for Instruction Tuning!
1/
Excited to share our
@emnlp2020
paper on task transferability:
1) a large-scale empirical study w/ over 3,000 combinations of NLP tasks and data regimes within and across different classes of problems
2) task embedding methods to predict task transferability
1/12👇
Despite their strong performance on many tasks, large-scale pre-trained language models do not perform as well when limited labeled data is available (e.g., on small datasets or in few-shot settings). Collecting more labeled data can help but can also be prohibitively expensive.
STraTA further uses the auxiliary-task model created by task augmentation as a base model for self-training, where it is fine-tuned on the available labeled data for the target task and is then used to infer predictions (pseudo labels) on unlabeled data for subsequent training.
@MohitIyyer
Our sentence content objective substantially boosts accuracy and generalization: on Yelp, with only 500 labeled examples, it outperforms training from scratch on 200× more data, which we hope will spur more linguistically-informed research into paragraph embedding methods. [5/5]
With STraTA, we are able to substantially improve sample efficiency across 12 NLP benchmark datasets. Remarkably, when given only 8 labeled examples per class from the SST-2 sentiment dataset, our approach is competitive with standard fine-tuning on all 67K labeled examples.
@alexjc
Thanks for the question! If you are curious about the performance of larger models, here are the results on MMLU:
Flan-T5 XL (3B): 52.4
Flan-T5 XXL (11B): 55.1
Flan-PaLM (540B): 73.5
Flan-U-PaLM (540B): 74.1
@swartchris8
@umasscs
Thanks,
@swartchris8
! Yeah, it could be a potential direction. Just found out that a follow-up work has tried out-of-class transfer from natural language inference to biomedical QA and observed a considerable boost in performance.
@MohitIyyer
How well do paragraph embeddings encode whether or not a given sentence appears in the paragraph? We extend the notion of probe tasks to the paragraph level and formulate a sentence content task to probe for this basic linguistic property. [2/5]