Hanshi Sun @preminstrel profile

Hanshi Sun

@preminstrel

Followers

76

Following

142

Statuses

47

MLSys | MS @CMU_ECE | Research Intern @BytedanceOSS

Pittsburgh, US

Joined December 2019

Don't wanna be here? Send us removal request.

Hanshi Sun

@preminstrel

1 day

RT @BeidiChen: 🆘 Worrying about long sequence encoding time? Wanna do prefix- or RAG caching but autoregressive nature of LLMs requires re-…

0

13

0

Hanshi Sun

@preminstrel

2 months

@iofu728 @BaotongL22 @fanyang 🎉

0

1

Hanshi Sun

@preminstrel

3 months

RT @InfiniAILab: ❓Struggling with serving high-throughput long-context LLMs? 📢 Introducing ShadowKV! 🚀 Achieve high-throughput long-contex…

0

11

0

Hanshi Sun

@preminstrel

4 months

RT @BeidiChen: Come and join us at poster #15!!! Also I’ll be here Mon-Thur #COLM2024 Excited to chat about recent research with old and…

0

3

0

Hanshi Sun

@preminstrel

4 months

Visit our poster at @COLM_conf on Monday morning to discover our lossless acceleration method 𝑻𝒓𝒊𝑭𝒐𝒓𝒄𝒆 for long sequence generation using speculative decoding! Feel free to DM me if you want to chat during the conference. Excited to connect with new and familiar faces!

Beidi Chen

@BeidiChen

10 months

❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔 📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟 🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on chip), 8x faster than baseline. 💡 TriForce outperforms DeepSpeed by 5x on a single RTX 4090 and boosts Llama2-7B-128K by 2.3x on an A100. 👇 Curious to dive deeper? Explore more about TriForce! 🌍 🔗 Blog: 📜 Paper: 💻 Code:

0

4

16

Hanshi Sun

@preminstrel

5 months

RT @InfiniAILab: 🌕🌑Introducing Sirius: Contextual Sparsity with Correction for Efficient LLMs 🚫🚫🚫Do you know that Sparse LLMs struggle wit…

0

10

0

Hanshi Sun

@preminstrel

6 months

@jxyintheflesh 😭

0

1

Hanshi Sun

@preminstrel

8 months

@jxyintheflesh 昨晚打到凌晨五点😭今天早上直接上班迟到了

0

Hanshi Sun

@preminstrel

8 months

@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta I suggest you read this blog: which will give you a good understanding and some background. Then you can delve into speculative decoding to see what kind of bottleneck it tackles. Finally, you can come to read our work smoothly! 😃

1

0

1

Hanshi Sun

@preminstrel

8 months

@EigendorfVon @BeidiChen @chenzhuoming911 @Xinyu2ML @tydsh @AIatMeta Hello! Thank you for your interest in our work. You can find our code and detailed usage instructions, along with related resources such as our paper and blog posts, on our GitHub page:

1

0

1

Hanshi Sun

@preminstrel

10 months

RT @Xinyu2ML: Excited to announce the Workshop on Foundation Models in the Wild at @icmlconf 2024 (hybrid workshop). We welcome submi…

0

17

0

Hanshi Sun

@preminstrel

10 months

RT @BeidiChen: 📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️S…

0

29

0

Hanshi Sun

@preminstrel

10 months

@YouJiacheng I think it is related to writing, before sec 3.2 we did not introduce the chunk retrieval method, then it is only an observation.

2

0

Hanshi Sun

@preminstrel

10 months

@YouJiacheng Oh yes. Then we only retrieve once here. However, as you say, it is only a recovery rate, not acceptance rate.

0

1

Hanshi Sun

@preminstrel

10 months

@YouJiacheng Yes. However, this will lead to the growing latency for drafting, especially for on chip experiments. Since we are using cuda graphs to speed up drafting , we hope the cache budget is static. It may need some engineering work for dynamic budgets.

0

1

Hanshi Sun

@preminstrel

10 months

@YouJiacheng Basically you do not need to select chunks for each query. Just like prefill phase, you can just use the latest query.

1

0

1

Hanshi Sun

@preminstrel

10 months

@YouJiacheng No, every decoding step. Therefore it’s a theoretical upper bound.

1

0

Hanshi Sun

@preminstrel

10 months

@YouJiacheng Wait, are you talking about batching, let us ensure that we are in the same page.

1

0

Hanshi Sun

@preminstrel

10 months

@YouJiacheng I think every queries’ KV cache are independent.

0

Hanshi Sun

@preminstrel

10 months

@YouJiacheng The reason for chunking is that if you want to retrieve several times, it is better to keep a avg k cache, which will optimize your retrieval latency.

1

0