preminstrel Profile Banner
Hanshi Sun Profile
Hanshi Sun

@preminstrel

Followers
76
Following
142
Statuses
47

MLSys | MS @CMU_ECE | Research Intern @BytedanceOSS

Pittsburgh, US
Joined December 2019
Don't wanna be here? Send us removal request.
@preminstrel
Hanshi Sun
1 day
RT @BeidiChen: 🆘 Worrying about long sequence encoding time? Wanna do prefix- or RAG caching but autoregressive nature of LLMs requires re-…
0
13
0
@preminstrel
Hanshi Sun
2 months
0
0
1
@preminstrel
Hanshi Sun
3 months
RT @InfiniAILab: ❓Struggling with serving high-throughput long-context LLMs? 📢 Introducing ShadowKV! 🚀 Achieve high-throughput long-contex…
0
11
0
@preminstrel
Hanshi Sun
4 months
RT @BeidiChen: Come and join us at poster #15!!! Also I’ll be here Mon-Thur #COLM2024 Excited to chat about recent research with old and…
0
3
0
@preminstrel
Hanshi Sun
4 months
Visit our poster at @COLM_conf on Monday morning to discover our lossless acceleration method 𝑻𝒓𝒊𝑭𝒐𝒓𝒄𝒆 for long sequence generation using speculative decoding! Feel free to DM me if you want to chat during the conference. Excited to connect with new and familiar faces!
@BeidiChen
Beidi Chen
10 months
❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔 📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟 🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on chip), 8x faster than baseline. 💡 TriForce outperforms DeepSpeed by 5x on a single RTX 4090 and boosts Llama2-7B-128K by 2.3x on an A100. 👇 Curious to dive deeper? Explore more about TriForce! 🌍 🔗 Blog: 📜 Paper: 💻 Code:
0
4
16
@preminstrel
Hanshi Sun
5 months
RT @InfiniAILab: 🌕🌑Introducing Sirius: Contextual Sparsity with Correction for Efficient LLMs 🚫🚫🚫Do you know that Sparse LLMs struggle wit…
0
10
0
@preminstrel
Hanshi Sun
6 months
0
0
1
@preminstrel
Hanshi Sun
8 months
@jxyintheflesh 昨晚打到凌晨五点😭今天早上直接上班迟到了
0
0
0
@preminstrel
Hanshi Sun
8 months
@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta I suggest you read this blog: which will give you a good understanding and some background. Then you can delve into speculative decoding to see what kind of bottleneck it tackles. Finally, you can come to read our work smoothly! 😃
1
0
1
@preminstrel
Hanshi Sun
8 months
@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta Hello! Thank you for your interest in our work. You can find our code and detailed usage instructions, along with related resources such as our paper and blog posts, on our GitHub page:
1
0
1
@preminstrel
Hanshi Sun
10 months
RT @Xinyu2ML: Excited to announce the Workshop on Foundation Models in the Wild at @icmlconf 2024 (hybrid workshop). We welcome submi…
0
17
0
@preminstrel
Hanshi Sun
10 months
RT @BeidiChen: 📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️S…
0
29
0
@preminstrel
Hanshi Sun
10 months
@YouJiacheng I think it is related to writing, before sec 3.2 we did not introduce the chunk retrieval method, then it is only an observation.
2
0
0
@preminstrel
Hanshi Sun
10 months
@YouJiacheng Oh yes. Then we only retrieve once here. However, as you say, it is only a recovery rate, not acceptance rate.
0
0
1
@preminstrel
Hanshi Sun
10 months
@YouJiacheng Yes. However, this will lead to the growing latency for drafting, especially for on chip experiments. Since we are using cuda graphs to speed up drafting , we hope the cache budget is static. It may need some engineering work for dynamic budgets.
0
0
1
@preminstrel
Hanshi Sun
10 months
@YouJiacheng Basically you do not need to select chunks for each query. Just like prefill phase, you can just use the latest query.
1
0
1
@preminstrel
Hanshi Sun
10 months
@YouJiacheng No, every decoding step. Therefore it’s a theoretical upper bound.
1
0
0
@preminstrel
Hanshi Sun
10 months
@YouJiacheng Wait, are you talking about batching, let us ensure that we are in the same page.
1
0
0
@preminstrel
Hanshi Sun
10 months
@YouJiacheng I think every queries’ KV cache are independent.
0
0
0
@preminstrel
Hanshi Sun
10 months
@YouJiacheng The reason for chunking is that if you want to retrieve several times, it is better to keep a avg k cache, which will optimize your retrieval latency.
1
0
0