Yuhui Xu
@xyh6666
Followers
27
Following
83
Statuses
17
Excited to introduce Reward-Guided Speculative Decoding (RSD)—a novel framework designed to enhance the efficiency of large language model (LLM) inference by strategically balancing computational cost and output quality.
Check out our work on Reward-Guided Speculative Decoding! 🚀 • Use PRM for reward-guided sampling — a mixture distribution • Prove binary weighting is optimal under budget constraints • Saves 4.4× FLOPs in STEM • Outperform speculative decoding 🔥💡
3
1
3
RT @hendrydong: Check out our work on Reward-Guided Speculative Decoding! 🚀 • Use PRM for reward-guided sampling — a mixture distribution •…
0
17
0
RT @SFResearch: 💡 We revamped ThinK! 💡 Want to run bigger LLM batches on your GPU? 📎 Paper: 💻 Code: https://t.co/…
0
7
0
RT @NobelPrize: BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Chemistry with one half to…
0
9K
0
RT @NobelPrize: BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Physics to John J. Hopfiel…
0
14K
0
RT @silviocinguetta: Long sequences can be the Achilles' heel of LLMs. The ThinK method's 20% memory reduction without performance loss red…
0
4
0
RT @SFResearch: Increase #AIEfficiency with ThinK: the first channel pruning method designed for KV cache. By pruning 40-50% of key cache…
0
11
0
RT @CaimingXiong: It is very important to reduce KV cache memory consumption during long context inference. We introduce ThinK, a method t…
0
22
0
RT @ZeyuanAllenZhu: Incredibly honored and humbled by the overwhelming response to my tutorial, and thank you everyone who attended in pers…
0
188
0
Thanks for introducing our recent optimization method on KV cache optimization. The low-rank structure of attention weights is well known. Based on it, we find that a large portion of the channels of Key cache channels are redundant.
This work proposes an approach to address inefficiencies in KV cache memory consumption. It focuses on the long-context scenarios and the inference side of things. It presents a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least significant channels. "Our approach not only maintains or enhances model accuracy but also achieves a reduction in memory costs by over 20% compared with vanilla KV cache eviction methods."
0
3
2