![Hanshi Sun Profile](https://pbs.twimg.com/profile_images/1548448355054804993/l7tPPGWw_x96.jpg)
Hanshi Sun
@preminstrel
Followers
76
Following
142
Statuses
47
MLSys | MS @CMU_ECE | Research Intern @BytedanceOSS
Pittsburgh, US
Joined December 2019
RT @BeidiChen: 🆘 Worrying about long sequence encoding time? Wanna do prefix- or RAG caching but autoregressive nature of LLMs requires re-…
0
13
0
RT @InfiniAILab: ❓Struggling with serving high-throughput long-context LLMs? 📢 Introducing ShadowKV! 🚀 Achieve high-throughput long-contex…
0
11
0
RT @BeidiChen: Come and join us at poster #15!!! Also I’ll be here Mon-Thur #COLM2024 Excited to chat about recent research with old and…
0
3
0
Visit our poster at @COLM_conf on Monday morning to discover our lossless acceleration method 𝑻𝒓𝒊𝑭𝒐𝒓𝒄𝒆 for long sequence generation using speculative decoding! Feel free to DM me if you want to chat during the conference. Excited to connect with new and familiar faces!
❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔 📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟 🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on chip), 8x faster than baseline. 💡 TriForce outperforms DeepSpeed by 5x on a single RTX 4090 and boosts Llama2-7B-128K by 2.3x on an A100. 👇 Curious to dive deeper? Explore more about TriForce! 🌍 🔗 Blog: 📜 Paper: 💻 Code:
0
4
16
RT @InfiniAILab: 🌕🌑Introducing Sirius: Contextual Sparsity with Correction for Efficient LLMs 🚫🚫🚫Do you know that Sparse LLMs struggle wit…
0
10
0
@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta I suggest you read this blog: which will give you a good understanding and some background. Then you can delve into speculative decoding to see what kind of bottleneck it tackles. Finally, you can come to read our work smoothly! 😃
1
0
1
@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta Hello! Thank you for your interest in our work. You can find our code and detailed usage instructions, along with related resources such as our paper and blog posts, on our GitHub page:
1
0
1
RT @BeidiChen: 📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️S…
0
29
0
@YouJiacheng I think it is related to writing, before sec 3.2 we did not introduce the chunk retrieval method, then it is only an observation.
2
0
0
@YouJiacheng Oh yes. Then we only retrieve once here. However, as you say, it is only a recovery rate, not acceptance rate.
0
0
1
@YouJiacheng Yes. However, this will lead to the growing latency for drafting, especially for on chip experiments. Since we are using cuda graphs to speed up drafting , we hope the cache budget is static. It may need some engineering work for dynamic budgets.
0
0
1
@YouJiacheng Basically you do not need to select chunks for each query. Just like prefill phase, you can just use the latest query.
1
0
1
@YouJiacheng Wait, are you talking about batching, let us ensure that we are in the same page.
1
0
0
@YouJiacheng The reason for chunking is that if you want to retrieve several times, it is better to keep a avg k cache, which will optimize your retrieval latency.
1
0
0