Mengzhou Xia Profile
Mengzhou Xia

@xiamengzhou

Followers
3,143
Following
715
Media
25
Statuses
228

PhD student @princeton_nlp , MS @CarnegieMellon , Undergrad at Fudan.

Princeton, NJ
Joined May 2015
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@xiamengzhou
Mengzhou Xia
1 month
🌟 Exciting update! Gemma2-9b + SimPO ranks at the top of AlpacaEval 2 (❗LC 72.4) and leads the WildBench leaderboard among similar-sized models 🚀 SimPO is at least competitive as (and often outperforms) DPO across all benchmarks, despite its simplicity. ✨ Recipe: on-policy
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
42
175
@xiamengzhou
Mengzhou Xia
11 months
We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: [1/n]
Tweet media one
20
142
767
@xiamengzhou
Mengzhou Xia
7 months
Lots of instruction tuning data out there...but how to best adapt LLMs for specific queries? Don’t use ALL of the data, use LESS! 5% beats the full dataset. Can even use one small model to select data for others! Paper: Code: [1/n]
Tweet media one
13
97
437
@xiamengzhou
Mengzhou Xia
2 years
How do language models of different sizes learn during the course of pre-training? We study the training trajectories with training checkpoints of language model from 125M to 175B for a better understanding! Check out our new paper 📜: (1/N)
11
75
402
@xiamengzhou
Mengzhou Xia
6 months
I am honored to receive the Apple Scholars in AIML fellowship! Very grateful to my advisor, mentors and collaborators along the way :) Excited to keep exploring the Pareto-frontier of capabilities and efficiency of foundation models!
@PrincetonCS
Princeton Computer Science
6 months
Congrats to @xiamengzhou on receiving an Apple Scholars in AIML fellowship! 🎉🍏 The fellowship recognizes graduate students doing innovative and cutting-edge research in machine learning. Xia is part of @princeton_nlp , advised by @danqi_chen .
Tweet media one
0
13
46
16
4
208
@xiamengzhou
Mengzhou Xia
3 months
We train and evaluate extensively with various offline preference optimization algorithms, including DPO, KTO, ORPO, RDPO, and more. Hyperparameter tuning significantly impacts algorithm effectiveness. DPO performs consistently well, but SimPO is better!
@yumeng0818
Yu Meng
3 months
Introducing SimPO: Simpler & more effective Preference Optimization!🎉 Significantly outperforms DPO w/o a reference model!📈 Llama-3-8B-SimPO ranked among top on leaderboards!💪 ✅44.7% LC win rate on AlpacaEval 2 ✅33.8% win rate on Arena-Hard 🧵[1/n]
Tweet media one
9
79
440
1
27
194
@xiamengzhou
Mengzhou Xia
2 years
Check out our #acl2022 paper on CoFi☕️! Structured pruning is competitive compared to knowledge distillation but requires much less training time and zero unlabeled data. Joint work w/ @ZexuanZhong , @danqi_chen Paper: Code: (1/5)
Tweet media one
5
36
145
@xiamengzhou
Mengzhou Xia
10 months
🌟We release the code for training Sheared-LLaMA here at . We're excited to see even stronger sheared models emerging in the future! 🤩 For more details, check out our preprint at .
@xiamengzhou
Mengzhou Xia
11 months
We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: [1/n]
Tweet media one
20
142
767
2
34
146
@xiamengzhou
Mengzhou Xia
2 years
Check out our preprint on Prompting ELECTRA! We show that discriminative models like ELECTRA outperform generative MLMs like BERT and RoBERTa on zero-shot and few-shot prompting. Joint work w/ @artetxem , @JefferyDuu , @danqi_chen , @vesko_st Paper:
5
21
146
@xiamengzhou
Mengzhou Xia
2 years
I'm pleased and honored to receive the fellowship and thanks to @TechAtBloomberg for supporting my research 😀
@TechAtBloomberg
Tech At Bloomberg
2 years
Congratulations to @PrincetonCS + @princeton_nlp 's @xiamengzhou on being named one of the 2022-2023 @Bloomberg #DataScience Ph.D. Fellows! Learn more about her research focus and the other Fellows in our newest cohort: #AI #ML #NLProc
Tweet media one
0
5
58
6
3
125
@xiamengzhou
Mengzhou Xia
1 year
Our LLM trajectory paper got accpted to #ACL2023 😊! Code and results are at Looking forward to future work to analyze trajectories not only in pre-training but also in the more accessible yet mysterious process of instruction tuning with human feedback.
@xiamengzhou
Mengzhou Xia
2 years
How do language models of different sizes learn during the course of pre-training? We study the training trajectories with training checkpoints of language model from 125M to 175B for a better understanding! Check out our new paper 📜: (1/N)
11
75
402
3
16
114
@xiamengzhou
Mengzhou Xia
9 months
This is my first time attending #NeurIPS 🥳 I’d love to chat about efficient approaches for LLMs, learning dynamics/trajectories and more! DM me to grab a coffee together :)
5
2
90
@xiamengzhou
Mengzhou Xia
2 months
Excited to release CharXiv, a new benchmark that effectively reveals multimodal language models' true capabilities in understanding charts! Check out the fun video for a brief overview 🧵!
@zwcolin
Zirui "Colin" Wang
2 months
🤨 Are Multimodal Large Language Models really as 𝐠𝐨𝐨𝐝 at 𝐜𝐡𝐚𝐫𝐭 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 as existing benchmarks such as ChartQA suggest? 🚫 Our ℂ𝕙𝕒𝕣𝕏𝕚𝕧 benchmark suggests NO! 🥇Humans achieve ✨𝟖𝟎+% correctness. 🥈Sonnet 3.5 outperforms GPT-4o by 10+ points,
7
33
143
3
10
80
@xiamengzhou
Mengzhou Xia
16 days
More pruned models come out from @NVIDIAAI 🦙! Structured pruning provides a highly compute-efficient way to create competitive small models from larger ones without training them from scratch, and its effectiveness could be amplified when paired with the right data 🌟!
@PavloMolchanov
Pavlo Molchanov
16 days
🚀 We've pruned LLaMa3.1 down to 4B parameters, delivering a smaller and more efficient model! Based on our recent paper: 📖 Learn all about it in our blog: 🔗 META's announcement: 👐 Checkpoints at HF this
Tweet media one
8
91
313
1
2
48
@xiamengzhou
Mengzhou Xia
1 month
Sometimes users want to retrieve documents beyond semantic meanings🤔, such as searching for - a document that presents a side argument to a question - a math problem that employs the same underlying theorem as another problem - a code snippet that utilizes a similar algorithm.
@hongjin_su
Hongjin Su
1 month
Retrieval benchmarks saturated? Introducing BRIGHT✨, a realistic and challenging benchmark that requires intensive reasoning to retrieve relevant documents. 🧠📚 Key features: 🔍Reasoning-intensive: Low keyword and semantic overlap between queries and documents. Intensive
Tweet media one
4
53
190
0
3
47
@xiamengzhou
Mengzhou Xia
11 months
Our Sheared-LLaMA-3B model, pruned from LLaMA2-7B model and further pre-trained for 50B tokens, outperforms the strongest open-source 3B model OpenLLaMA-v2 trained with 1T tokens. Sheared-LLaMA can further improve with more compute. [2/n]
Tweet media one
5
5
45
@xiamengzhou
Mengzhou Xia
10 months
❓Ever wondered if an article or a book snippet has been pre-trained by LLMs? We develop Min-K Prob% to find out! 📚 We show strong evidence that text-davinci-003 models have been trained on copyrighted books! 🤗Absolutely enjoyed working together on this project!
@WeijiaShi2
Weijia Shi
10 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
16
139
663
1
4
45
@xiamengzhou
Mengzhou Xia
11 months
Simply changing the generation setting (top-p, top-k, temperature) breaks the alignment of open-source safety tuned models like LLaMA2-Chat 😰
@YangsiboHuang
Yangsibo Huang
11 months
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]
Tweet media one
7
45
342
0
2
35
@xiamengzhou
Mengzhou Xia
8 months
Excited to see that the performant 2.7B VinaLLaMA model is pruned from its 7B counterpart! And it was developed with the structured pruning technique used by Sheared-LLaMA! ☺️
@stablequan
qnguyen3
9 months
Today, I released my first paper, VinaLLaMA. The state-of-the-art LLM for Vietnamese, based on LLaMA-2. Continued pretrain and SFT 100% with synthetic data. Special thanks to @Teknium1 & @ldjconfirmed . Their OpenHermes and Capybara datasets helped me a lot
12
21
137
0
1
32
@xiamengzhou
Mengzhou Xia
5 months
Fine-tuning on benign data (e.g., List 3 planets in our solar system) significantly breaks model safety 😨
@LuxiHeLucy
Luxi (Lucy) He
5 months
Fine-tuning on benign data (e.g. Alpaca) can jailbreak models unexpectedly. We study this problem through a data-centric perspective and find that some seemingly benign data could be more harmful than explicitly malicious data! ⚠️🚨‼️ Paper: [1/n]
Tweet media one
6
32
156
0
4
32
@xiamengzhou
Mengzhou Xia
11 months
How do we get there? Our pruning algorithm LLM-Shearing has two components: 1) Targeted structured pruning, where we prune the source model to a specified target architecture (e.g., an existing LM) while maximizing its performance. [4/n]
Tweet media one
3
2
27
@xiamengzhou
Mengzhou Xia
11 months
2) Dynamic batch loading, where we dynamically load data from each domain to enable an efficient use of pre-training data. The procedure does not produce any overhead compared to standard pre-training! [5/n]
Tweet media one
2
0
24
@xiamengzhou
Mengzhou Xia
1 year
Analyzing training trajectories is a fascinating way to gain insights of LLMs, and thanks to the open-sourced checkpoints and data from @AiEleuther , it has become easier to explore!
@BlancheMinerva
Stella Biderman
1 year
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal
12
183
888
1
0
23
@xiamengzhou
Mengzhou Xia
1 month
Strong llama-3 based long-context models made by my amazing labmates!! Carefully curated data recipes lead to consistently strong performance across the board 🤩
@gaotianyu1350
Tianyu Gao
1 month
Meet ProLong, a Llama-3 based long-context chat model! (64K here, 512K coming soon) ProLong uses a simple recipe (short/long pre-training data + short UltraChat, no synthetic instructions) and achieves top performance on a series of long-context tasks.
Tweet media one
4
24
140
0
0
22
@xiamengzhou
Mengzhou Xia
11 months
Our Sheared-LLaMA series also outperform existing models when instruction tuned on SharedGPT, demonstrating that pruning does not compromise the long-text generation and instruction following. [3/n]
Tweet media one
1
1
20
@xiamengzhou
Mengzhou Xia
11 months
It was an awesome collaboration with @gaotianyu1350 @ZhiyuanZeng_ @danqi_chen ! Stay tuned with the codebase, which is built on top of the Composer package for efficiency. We’d like to extend our sincere gratitude to the engineers @mosaicml for their help :) [n/n]
1
0
20
@xiamengzhou
Mengzhou Xia
9 months
Yangsibo is an absolutely awesome researcher and amazing collaborator 🤩. She works in the important field of AI safety and security. Consider hiring her!
@YangsiboHuang
Yangsibo Huang
9 months
I am at #NeurIPS2023 now. I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency. Appreciate any reposts & happy to chat in person! CV+statements: Find me at ⬇️
3
32
132
0
0
20
@xiamengzhou
Mengzhou Xia
2 years
Not all tokens' perplexity decreases during pre-training! We find that when 10% of the next-token predictions’ perplexity surprisingly increases for 1.3B, larger models present a double descent trend where the perplexity increases then decreases on the same set of tokens.(2/N)
Tweet media one
1
1
18
@xiamengzhou
Mengzhou Xia
5 months
Check out our new blog post on data selection! Important future problems to think: - How to select synthetic data? agent trajectory data? - for any differentiable objective?
@SadhikaMalladi
Sadhika Malladi
5 months
Dataset choice is crucial in today's ML training pipeline. We ( @xiamengzhou and I) introduce desiderata for "good" data and explain how our recent algorithm, LESS, fits into the picture. Huge review of data selection algs for pre-training and fine-tuning!
Tweet media one
2
53
203
0
1
19
@xiamengzhou
Mengzhou Xia
7 months
LESS doesn’t rely on heuristics, so it doesn’t fall for superficial similarities! It identifies datapoints with the same reasoning type as the provided examples. Given a Bengali QA example: LESS selects an English QA example! (Others select Bengali examples in other tasks) [4/n]
Tweet media one
2
0
17
@xiamengzhou
Mengzhou Xia
11 months
Correction: We made an error in our original post - the latest StableLM-3B outperforms our ShearedLLaMA-3B on the Open LLM Leaderboard, but we were not aware of it at the time of writing. Additionally, BTLM-3B achieves similar results to ours. Thanks for pointing this out!
1
0
18
@xiamengzhou
Mengzhou Xia
11 months
We show interesting results in the paper about - Comparing to further finetuning an existing LLM - Coding and math abilities of the models - Comparing to other pruning techniques - Different source models to prune from … [6/n]
1
0
17
@xiamengzhou
Mengzhou Xia
4 months
@OhadRubin We are on it!! Stay tuned :) It will be named as Llama3-Sheared though
1
1
14
@xiamengzhou
Mengzhou Xia
10 months
Min-K Prob% also helps to identify questions or book snippets that may have been thought to be removed by recent works but still persist within LLMs! 📚🔍
@YangsiboHuang
Yangsibo Huang
10 months
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]
Tweet media one
4
50
244
0
1
13
@xiamengzhou
Mengzhou Xia
4 months
Congratulations to @shuyanzhxyc ! Can't wait to see what you will build next 🤩
@shuyanzhxyc
Shuyan Zhou
4 months
I am thrilled to announce that I will be joining @DukeU @dukecompsci as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
Tweet media one
111
29
558
1
0
13
@xiamengzhou
Mengzhou Xia
7 months
LESS outperforms baselines! [3/n]
Tweet media one
1
0
11
@xiamengzhou
Mengzhou Xia
10 days
@jefrankle For criterion-based pruning solutions, both LLM-Pruner (…) and wanda () are very easy to use. For a learning-based pruning solution, we have a repo written with an old version of composer! …
0
0
11
@xiamengzhou
Mengzhou Xia
2 years
Finally, we look at in-context learning on downstream tasks! We evaluate 74 BIG-Bench tasks on intermediate model checkpoints of up to 175B parameters, and find that validation perplexity is a better predictor of in-context learning ability than FLOPs! (5/N)
Tweet media one
1
1
11
@xiamengzhou
Mengzhou Xia
11 months
@Teknium1 We weren't aware of the release of cerebras 3B or StableLM when the work was done. The field is moving too fast to catch up! But as @HanchungLee pointed out , the point of our work is to show structured pruning can be an efficient approach in producing strong small-scale LLMs.
1
0
10
@xiamengzhou
Mengzhou Xia
2 years
Recently @AiEleuther just released hundreds of intermediate checkpoints of language models up to 13B parameters, and we think it will be exciting to continue researching on understanding language models using these open-sourced checkpoints! (7/N)
1
0
10
@xiamengzhou
Mengzhou Xia
7 months
4 easy and efficient steps. Crucial component: theoretically motivated influence formulation, specialized to instruction tuning with Adam. [2/n]
Tweet media one
1
0
10
@xiamengzhou
Mengzhou Xia
7 months
Joint work with my awesome collaborators @SadhikaMalladi (equal contribution) @ssgrn @prfsanjeevarora @danqi_chen at @princeton_nlp , @PrincetonPLI , and @uwnlp ! [5/n]
0
0
10
@xiamengzhou
Mengzhou Xia
2 years
We decode such texts and find them grammatically but hallucinating. They do follow an inverse scaling trend on final model checkpoints (left)! The small model (125M, Blue) and other large models’ trajectories diverge as training goes (right). (4/N)
Tweet media one
2
0
9
@xiamengzhou
Mengzhou Xia
2 years
The perplexity of human-generated sequences decreases as the model scale increases. Even texts with noise and factually wrong prompts still follow this scaling pattern. LMs' probability assignment is a zero-sum game; what texts do small models favor more than large ones? (3/N)
Tweet media one
1
0
9
@xiamengzhou
Mengzhou Xia
1 month
@srush_nlp Here are some preliminary ablations! For SimPO, a learning rate between 6e-7 and 8e-7 appears relatively safe, while DPO requires a range between 3e-7 and 5e-7. The llama3-instruct models seem more brittle, showing significant variance across different lrs.
Tweet media one
2
1
8
@xiamengzhou
Mengzhou Xia
1 month
@srush_nlp Hi @srush_nlp , we noticed that this issue results from training llama3-instruct with a large learning rate. When training llama3-instruct with a smaller lr, this issue could be mitigated at the cost of reducing the chat scores. But we find that gemma models present much less
Tweet media one
Tweet media two
Tweet media three
1
0
7
@xiamengzhou
Mengzhou Xia
3 months
@fe1ixxu Hi Haoran, thank you for bringing this issue to our attention! We would like to clarify that SimPO and CPO differ significantly. SimPO includes a length normalization term and a target reward margin, which are the two major designs of our objective. Ablation studies in Table 5
0
0
7
@xiamengzhou
Mengzhou Xia
2 years
Our experiments are mainly conducted on intermediate checkpoints of OPT models. There is much more in our paper, and please check it out for details if you are interested! 😀😀😀 (6/N)
1
0
6
@xiamengzhou
Mengzhou Xia
11 months
@ocolegro We didn't compare against Phi primarily due to its reliance on non-public datasets that may have data leakage concerns.
1
0
5
@xiamengzhou
Mengzhou Xia
11 months
@Teknium1 @HanchungLee We added a correction post, thanks for bringing this to our attention!
1
0
5
@xiamengzhou
Mengzhou Xia
2 years
Analysis shows that MLMs like RoBERTa could assign comparable probabilities to antonyms like terrible and great, which in turn feeds the ideal negatives to ELECTRA for contrastive training. (4/4)
Tweet media one
1
0
5
@xiamengzhou
Mengzhou Xia
2 years
We propose CoFi ☕️ (Coarse- and Fine-grained) pruning to prune heads, intermediate dimensions, hidden dimensions, multi-head attention layers and feed-forward layers all together! We also propose a dynamic layer distillation loss to further guide the pruning process. (2/5)
Tweet media one
1
0
5
@xiamengzhou
Mengzhou Xia
2 years
ELECTRA is pre-trained to discriminate if tokens are original or replaced. We adapt this objective to do template-based prompting. The higher the probability of the token to be original, the more likely it is the right answer. (1/4)
Tweet media one
1
0
3
@xiamengzhou
Mengzhou Xia
11 months
@_joaogui1 I think we are explaining from a development point of view, where if there exists strong large models, the more cost efficient way to build small LMs is by pruning and continue pre-training.
1
0
4
@xiamengzhou
Mengzhou Xia
3 months
@natolambert Thanks for sharing the insights! What would you suggest is the best way to check model abilities at this point?
1
0
3
@xiamengzhou
Mengzhou Xia
11 months
@SebastienBubeck @ocolegro Thanks for the reply and sorry about the inaccurate expression in the original thread. We meant to compare to models trained on public web data. Training on high quality synthetic data is orthogonal to our approach. We believe combining both can lead to stronger smaller models :)
0
0
3
@xiamengzhou
Mengzhou Xia
2 years
With the base sized model, ELECTRA outperforms BERT and RoBERTa by 7.9 and 3.5 points on zero-shot prediction and 10.2 and 3.1 points on few-shot prompt-based fine-tuning (√ denotes with prompt). The margin on standard finetuning is 3.3 and 1.2 points. (2/4)
Tweet media one
1
0
3
@xiamengzhou
Mengzhou Xia
10 months
@LChoshen @srush_nlp @sebschu @boazbaraktcs @artetxem @LukeZettlemoyer @vesko_st I recently worked on a project to prune a Llama2-7b model to 1.3B and 2.7B and also have intermediate checkpoints of those pruned models during the continued pre-training stage. I would like to share them with you if you think they could be helpful.
1
0
3
@xiamengzhou
Mengzhou Xia
8 months
@ssgrn @Meta Congrats!!! 🎊🎉🥳
0
0
2
@xiamengzhou
Mengzhou Xia
2 years
CoFi outperforms a series of distillation and pruning baselines when comparing under the same speedup ratio or model size, especially on the high-sparsity regime! (3/5)
Tweet media one
1
0
2
@xiamengzhou
Mengzhou Xia
2 years
We find that for highly compressed models, the middle layers are more likely to be pruned but the first and last few layers are largely retained. (5/5)
Tweet media one
0
0
2
@xiamengzhou
Mengzhou Xia
2 years
We also extend the framework to tasks with multi-token options by aggregating either the representations or the output probabilities and find that ELECTRA outperforms RoBERTa as well. (3/4)
Tweet media one
Tweet media two
1
0
2
@xiamengzhou
Mengzhou Xia
11 months
@mrm8488 @BramVanroy We are still working on the repo, stay tuned!
0
0
2
@xiamengzhou
Mengzhou Xia
2 years
In particular, CoFi closes the gap between structured pruning and knowledge distillation with much less computation and only task-specific data. (4/5)
Tweet media one
1
0
2
@xiamengzhou
Mengzhou Xia
11 months
@mermolenko Thanks for your interest! We plan to release the repo soon.
0
0
2
@xiamengzhou
Mengzhou Xia
4 years
@anas_ant @VolgenauSchool Congrats!!! 👏👏👏
0
0
1
@xiamengzhou
Mengzhou Xia
11 months
@taolei15949106 @gaotianyu1350 We didn’t try other objectives, as the min-max objective is surprisingly robust and leads to good results. Would be nice to see if other objectives could possibly be more compute/data efficient for pruning!
0
0
1
@xiamengzhou
Mengzhou Xia
11 months
@max_paperclips Thanks for your interest — We will release the code soon, stay tuned :)
0
0
1
@xiamengzhou
Mengzhou Xia
10 months
@BlancheMinerva Will also add the FPR soon!
0
0
1
@xiamengzhou
Mengzhou Xia
11 months
@AlbalakAlon We use a held out validation set of 2M tokens per domain, and RP has 7 domains, and it amounts to a total of 14M tokens. When continuing pretraining the pruned 3B model on 16 GPUs, validation only takes around 1% of the wall clock time. Each time it takes ~0.9min.
1
0
1
@xiamengzhou
Mengzhou Xia
1 year
@BlancheMinerva @AiEleuther The entire code base is unfortunately not compatible with HF transformers. But I can help sort out the key model-free functions!
0
0
1
@xiamengzhou
Mengzhou Xia
2 years
@cindyxinyiwang @gneubig @seb_ruder Congratulations Cindy!!! 🎈🎉🎊
0
0
1
@xiamengzhou
Mengzhou Xia
8 years
0
0
0
@xiamengzhou
Mengzhou Xia
11 months
@tngc029 @Tim_Dettmers We are still working on the repo! Will publicize it soon :)
0
0
1
@xiamengzhou
Mengzhou Xia
7 months
@taiwei_shi Actually, AlpaGusus and LESS operate in different settings (blog coming soon!). We study a transfer setting, whereas AlpaGasus (and KNN-based methods) can access ample in-domain data and resemble quality filtering for coreset selection.
0
0
1
@xiamengzhou
Mengzhou Xia
1 month
@WenhuChen Thanks @WenhuChen , MMLU-Pro looks amazing, will give it a try!
1
0
1
@xiamengzhou
Mengzhou Xia
7 months
@taiwei_shi Our experiments show instruction tuning boosts performance on MMLU and BBH. Also, LESS is easy to adapt to new tasks, so we will try others in the future!
0
0
1
@xiamengzhou
Mengzhou Xia
11 months
@alignment_lab Sequence lens is the same as llama2, which is 4K.
0
0
1
@xiamengzhou
Mengzhou Xia
11 months
@Euclaise_ We added a correction post, thanks for pointing this out!
0
0
1
@xiamengzhou
Mengzhou Xia
10 months
@BlancheMinerva Thanks for your interest in our new work :) We greatly appreciate Pythia's openness in data transparency, model sharing, and checkpoint accessibility. And we will make sure to include discussions around it in our next revision!
0
0
1
@xiamengzhou
Mengzhou Xia
5 years
@anas_ant Sweet sweet
0
0
1
@xiamengzhou
Mengzhou Xia
4 years
@belinda_nlp Congrats Belinda!
0
0
1