Zexuan Zhong Profile
Zexuan Zhong

@ZexuanZhong

Followers
1,549
Following
635
Media
28
Statuses
128
Explore trending content on Musk Viewer
@ZexuanZhong
Zexuan Zhong
16 days
Grok-2 is here! 🚀 It's been incredibly exciting working with the brightest minds since joining. so proud of the team @xAI !
@xai
xAI
17 days
1K
1K
9K
4
11
270
@ZexuanZhong
Zexuan Zhong
2 years
Very excited to share a preprint “Training Language Models with Memory Augmentation”! We propose a new training objective TRIME for language modeling—inspired by contrastive learning—which aligns with both token embeddings and *in-batch memories*. 1/n
Tweet media one
Tweet media two
4
52
249
@ZexuanZhong
Zexuan Zhong
4 months
Introducing Lory, a fully-differentiable MoE arch for decoder LM pre-training! Lory merges expert FFNs by computing a weighted average in the parameter space, and computes the output through the merged FFNs. But training naively is infeasible, how to make it work? Details in🧵
Tweet media one
4
40
230
@ZexuanZhong
Zexuan Zhong
3 years
Dense retrieval models (e.g. DPR) achieve SOTA on various datasets. Does this really mean dense models are better than sparse models (e.g. BM25)? No! Our #EMNLP2021 paper shows dense retrievers even fail to answer simple entity-centric questions. (1/6)
Tweet media one
6
27
158
@ZexuanZhong
Zexuan Zhong
1 year
If we use model editors to update the British Prime Minister from Boris Johnson to Rishi Sunak, can the edited LMs answer Who is married to the British Prime Minister? Releasing MQuAKE to assess knowledge editing methods on multi-hop Qs! Paper: [1/n]
Tweet media one
2
23
86
@ZexuanZhong
Zexuan Zhong
10 months
💡You can do speculative decoding without a small LM or any additional training! Check out Retrieval-Based Speculative Decoding (REST)! paper: blog: code:
@tianle_cai
Tianle Cai
10 months
If training's got you in a stew, take a REST and speed right through! 😎 Thrilled to introduce Retrieval-Based Speculative Decoding (REST), a plug-and-play method for accelerating language model decoding. 👇
Tweet media one
5
33
215
0
8
67
@ZexuanZhong
Zexuan Zhong
3 years
Excited to share our #NAACL2021 paper on factual probing! “Factual Probing is [MASK]: Learning vs. Learning to Recall” Paper: Code: Joint work with @danfriedman0 and @danqi_chen .
Tweet media one
2
14
66
@ZexuanZhong
Zexuan Zhong
9 months
At #EMNLP2023 🇸🇬! I will be presenting our projects on benchmarking knowledge editing () and attacking dense retrievers (). DM if you want to grab a coffee together! would like to chat about any interesting things!
3
1
37
@ZexuanZhong
Zexuan Zhong
2 years
TRIME has been accepted at #emnlp2022 !😃 The updated version includes new/stronger results on domain adaptation, MT, etc. We have made our code and pre-trained models publicly available! Paper: Code: w/ @taolei15949106 @danqi_chen
@ZexuanZhong
Zexuan Zhong
2 years
Very excited to share a preprint “Training Language Models with Memory Augmentation”! We propose a new training objective TRIME for language modeling—inspired by contrastive learning—which aligns with both token embeddings and *in-batch memories*. 1/n
Tweet media one
Tweet media two
4
52
249
0
5
36
@ZexuanZhong
Zexuan Zhong
2 years
Heading to Abu Dhabi for attending #emnlp2022 ! Can’t wait to meet with new and old friends!!
0
1
32
@ZexuanZhong
Zexuan Zhong
4 months
Two key techniques 1) Causal segment routing ⚠️ Merging per token is too expensive ✅ We merge experts per segment & keep the autoregressive property 2) Sim-based batching ⚠️ Training on concatenated rand docs leads to bad experts ✅ We concat sim docs to get training instances
Tweet media one
1
0
15
@ZexuanZhong
Zexuan Zhong
4 months
Please find more details in our preprint: Shoutout to my amazing collaborators @xiamengzhou @danqi_chen @ml_perception This was done during my internship at Meta. Excited to finally share it!!
2
0
12
@ZexuanZhong
Zexuan Zhong
2 years
Joint work with @taolei15949106 and @danqi_chen . Code and models coming soon! n/n
1
0
10
@ZexuanZhong
Zexuan Zhong
2 years
We also devise novel ways for data batching and constructing training memories, so that our models can leverage *long-range contexts* and *external datastore* effectively. 3/n
Tweet media one
Tweet media two
1
0
8
@ZexuanZhong
Zexuan Zhong
2 years
We show that simply replacing the standard language modeling objective with ours can improve perplexity significantly! 2/n
Tweet media one
1
1
8
@ZexuanZhong
Zexuan Zhong
4 months
Exciting results from Lory models! Trained with 150B tokens from scratch, Lory models with 0.3/1.5B active parameters and up to 32 experts achieve the same loss as dense models in 2.5x fewer steps!
Tweet media one
2
0
8
@ZexuanZhong
Zexuan Zhong
4 months
What’s more? Lory not only excels in performance but also learns *domain-level expert specialization*; while previous token-level MoEs rely on shallow features for routing! 🤔They may be complementary -- more possibilities ahead!
Tweet media one
2
1
8
@ZexuanZhong
Zexuan Zhong
4 months
How does Lory compare to other MoE models? Despite using segment-level routing, Lory achieves competitive performance compared to SoTA conventional token-level MoEs, such as expert-choice (EC)!
Tweet media one
1
0
7
@ZexuanZhong
Zexuan Zhong
2 years
We show significant gains over kNN-LM/kNN-MT and models that explicitly leverage long-range context (e.g., Transformer-XL) — all we change is the training objective and data batching! 4/n
Tweet media one
1
0
5
@ZexuanZhong
Zexuan Zhong
4 months
Lory also achieves great downstream performance with ICL!
Tweet media one
1
0
6
@ZexuanZhong
Zexuan Zhong
9 months
I have been working with Yangsibo on several projects. She is super strong -- don't miss her if you are hiring!
@YangsiboHuang
Yangsibo Huang
9 months
I am at #NeurIPS2023 now. I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency. Appreciate any reposts & happy to chat in person! CV+statements: Find me at ⬇️
3
32
132
1
0
6
@ZexuanZhong
Zexuan Zhong
1 year
A big shout out to my collaborators! This is a joint work with @ZhengxuanZenWu , @chrmanning , @ChrisGPotts and @danqi_chen . @princeton_nlp @stanfordnlp [n/n]
0
0
4
@ZexuanZhong
Zexuan Zhong
1 year
We build MQuAKE by first creating a knowledge graph based on Wikidata, encompassing entities and relations among them. We then create multi-hop questions based on chains of facts from the knowledge graph along with edits. [2/n]
Tweet media one
1
0
3
@ZexuanZhong
Zexuan Zhong
3 years
(2) Data-driven methods are able to exploit this information. We design control experiments where we apply these methods on random initialized models. We show that even with random initialization, these methods can find prompts that recover a non-trivial number of “facts”.
Tweet media one
1
0
3
@ZexuanZhong
Zexuan Zhong
3 years
We fine-tune DPR on these simple questions and find updating the passage encoder is particularly crucial to get good results. Our visualization also shows that gold passage vectors for these questions are clustered together, so that it is difficult to discriminate them. (4/6)
Tweet media one
1
0
3
@ZexuanZhong
Zexuan Zhong
3 years
We decouple the two distinct aspects of these questions: the entities and the question patterns. We find that dense retrieval models can only generalize to common entities or the question patterns that have been observed during training. (3/6)
Tweet media one
1
0
2
@ZexuanZhong
Zexuan Zhong
3 years
Our results suggest that one should not only interpret the accuracy of a data-driven prompt on LAMA as a lower bound on how much knowledge a language model stores. The control experiments allow us to form a more detailed understanding of the behavior of different probes.
0
0
2
@ZexuanZhong
Zexuan Zhong
9 months
@danqi_chen oops.. thanks!
0
0
1
@ZexuanZhong
Zexuan Zhong
3 years
We study two simple techniques aiming at fixing the issue. We find (1) data augmentation is unable to consistently improve performance on new questions; (2) fixing a robust passage index and specializing question encoder leads to memory-efficient transfer to new domains. (5/6)
Tweet media one
1
0
2
@ZexuanZhong
Zexuan Zhong
3 years
We construct EntityQuestions, consisting of simple, entity-rich questions such as “Where was Arve Furset born?”. We find dense retrieval models drastically underperform sparse models! (2/6)
1
0
2
@ZexuanZhong
Zexuan Zhong
4 months
1
0
2
@ZexuanZhong
Zexuan Zhong
4 months
@FlyingKid16 The difference is that SMEAR only works for encoder models. It fine-tunes T5 (with adapters) on text classification tasks, where instance-level routing decision is used naturally. We pre-train a decoder model and manage to handle per-token prediction in an autoregressive way.
0
0
0
@ZexuanZhong
Zexuan Zhong
1 year
Surprisingly, existing knowledge editing methods can inject facts and recall them accurately (high edit-wise accuracy), but they fail catastrophically on multi-hop questions (low multi-hop accuracy)! [4/n]
Tweet media one
1
0
1
@ZexuanZhong
Zexuan Zhong
1 year
We propose a simple yet effective model that serves as a strong baseline for future work! MeLLo requires no training, and stores edits in memory which are accessible by any retriever. It prompts LLMs with self-check to do model editing on the fly! [5/n]
Tweet media one
1
0
1
@ZexuanZhong
Zexuan Zhong
1 year
To evaluate a knowledge editing technique, we allow it to see a set of edited facts (e.g. The current British Prime Minister is Rishi Sunak). Then, we ask multi-hop questions that are related to the edited fact (e.g. Who's married to the British Prime Minister?) [3/n]
1
0
1
@ZexuanZhong
Zexuan Zhong
3 months
0
0
1
@ZexuanZhong
Zexuan Zhong
4 months
@rosinality Good question. It does not back-prop to all FFNs for each token, but only for each segment - the gradients of the merged FFN will be aggregated before back-propagating to each expert. Only the communication costs might be an issue when models get larger (see solutions in sec H)
0
0
1
@ZexuanZhong
Zexuan Zhong
30 days
0
0
1
@ZexuanZhong
Zexuan Zhong
4 months
1
0
1
@ZexuanZhong
Zexuan Zhong
9 months
@ShunyuYao12 lol if you book a flight to Singapore now, we can do on Sunday!
0
0
1
@ZexuanZhong
Zexuan Zhong
1 year
MeLLo shows great performance across different settings, and outperforms SoTA methods. It works well with LMs at different scales! [6/n]
Tweet media one
1
0
1