Hao AI Lab @haoailab profile

Hao AI Lab

@haoailab

Followers

638

Following

185

Media

22

Statuses

38

Hao AI Lab at UCSD. Our mission is to democratize large machine learning models, algorithms, and their underlying systems.

https://t.co/18QoMAmntJ

Joined March 2024

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Vance • 1189935 Tweets

Cannon • 572921 Tweets

Carvajal • 313672 Tweets

Jack Smith • 303425 Tweets

Milwaukee • 266291 Tweets

Trump's VP • 223307 Tweets

Vivek • 127629 Tweets

Ohio • 120132 Tweets

America's Hitler • 102415 Tweets

Mimi • 99025 Tweets

Pence • 89049 Tweets

Jack Black • 74903 Tweets

Haley • 67864 Tweets

NCAA • 55428 Tweets

Rubio • 48604 Tweets

Don Jr • 41983 Tweets

連休明け • 31991 Tweets

Tim Scott • 30464 Tweets

Peter Thiel • 29888 Tweets

Ramón Jesurún • 29791 Tweets

Tulsi • 23224 Tweets

Trumper • 23088 Tweets

Youngkin • 22898 Tweets

Laurence Tubiana • 21231 Tweets

#めざましテレビ • 20961 Tweets

Hillbilly Elegy • 20725 Tweets

Cibeles • 20477 Tweets

#渡辺直美 • 20106 Tweets

Ben Carson • 20057 Tweets

Yale • 15498 Tweets

Aitana • 15396 Tweets

Witzel

Lester Holt

先輩の弟

ニャオハ

スカベリ

DeWine

Ron Howard

インサイドヘッド

応援サポーター

Grand Prix

副大統領候補

紫耀くんMC

Fahrettin Altun

Appalachia

期間限定ソース

直美ちゃん

#حتي_لا_ننسي

#الوادي_الكبير

#Number_i_ナゲットおトク夏

Last Seen Profiles

@swxords

@INISHfest

@VaultSMP

@coolguyblake420

@vikasmeena509

@tyonzito

@george_ud

@Tho_ModernMissy

@lcrworld

@turk_ifsa2019

@HI_Luxembourg

@AlhanSotari

@Soy_Aprista

@JamieMoore777

@CaringSharingNS

@Counselor_Diane

@Kari_Argenta

@Healnathome

@_mmrgl

@limoverseglobal

Pinned Tweet

Hao AI Lab

@haoailab

21 days

Multiple LLM serving has emerged as a crucial and costly demand. Want to co-serve multiple LLMs with better utilization? Introducing MuxServe - flexible spatial-temporal multiplexing - up to 1.8x higher throughput Blog: Paper:

2

23

74

Hao AI Lab

@haoailab

2 months

People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀 Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory - LLM can fast forward on token generation.

8

42

210

Hao AI Lab

@haoailab

4 months

Still optimizing throughput for LLM Serving? Think again: Goodput might be a better choice! Splitting prefill from decode to different GPUs yields - up to 4.48x goodput - up to 10.2x stricter latency criteria Blog: Paper:

4

51

180

Hao AI Lab

@haoailab

4 months

🎉 Excited to announce that our paper DistServe was accepted to #OSDI #OSDI24 ! Congrats to our great team: @YinminZhong , Shengyu Liu, @Junda_Chen_ , Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, @haozhangml

Hao AI Lab

@haoailab

4 months

Still optimizing throughput for LLM Serving? Think again: Goodput might be a better choice! Splitting prefill from decode to different GPUs yields - up to 4.48x goodput - up to 10.2x stricter latency criteria Blog: Paper:

4

51

180

0

8

33

Hao AI Lab

@haoailab

2 months

In the human cognitive process, we have the ability of forming complete sentences in mind before articulating word by word. Can LLMs be taught to acquire this capability as well? Our results show the answer is YES! ✅ We teach CLLMs to map any point on a Jacobi trajectory to

2

19

Hao AI Lab

@haoailab

2 months

To encourage CLLMs to output the fixed point sequence from any point on a Jacobi trajectory, we minimize: - consistency loss: to quantify the distance between the two points. - auto-regressive loss: to prevent output distributions from drifting away from the generated result from

1

0

12

Hao AI Lab

@haoailab

2 months

We also identify that such acceleration is likely to stem from the existence of: - Fast forwarding, where multiple consecutive tokens are correctly predicted in a single forward pass. - Stationary tokens, which are correctly predicted and remain unaltered through subsequent

2

0

12

Hao AI Lab

@haoailab

2 months

CLLM training cost is moderate: - it’s a one-time overhead. - CLLMs introduce no extra memory cost at inference time while achieving comparable or even better speedup in comparison with Medusa2/Eagle. In the cases where the dataset size is large, for example for

1

0

8

Hao AI Lab

@haoailab

2 months

The paper is accepted to ICML 24' and is available on: . Check out our repo where you can find training recipes and CLLM checkpoints: The work is done with amazing collaborators: @kou_siqi , Zhijie Deng, @Lanxiang_Hu , @haozhangml

GitHub - hao-ai-lab/Consistency_LLM: [ICML 2024] CLLMs: Consistency Large Language Models

[ICML 2024] CLLMs: Consistency Large Language Models - hao-ai-lab/Consistency_LLM

github.com

0

3

10

Hao AI Lab

@haoailab

2 months

Our experiments show CLLMs are up to 3.4x faster than the pre-trained models across a variety of benchmarks. In comparison with other SOTA methods like Medusa2, CLLMs achieve roughly the same speedup as Medusa2, with comparable scores on MT-bench. However, CLLM offers higher

1

0

9

Hao AI Lab

@haoailab

4 months

Flexible GPU allocation. Disaggregation enables flexible GPU allocation for prefill and decode. Here, 2 GPU for prefill + 1 GPU for decode (right) achieves 3.3 rps (per GPU), compared to (left) 1.6 rps with colocation. Simple disaggregation yields 2x goodput!

2

0

6

Hao AI Lab

@haoailab

4 months

High throughput ≠ High goodput. Systems optimizing throughput can have low goodput under certain latency criteria. With a simple latency constraint, a system with throughput = 10 requests per second can only serve goodput = 3 requests per second. That is 3x less QoS!

1

0

5

Hao AI Lab

@haoailab

4 months

Why existing systems fail to achieve high goodput? Interference! Existing systems use continuous batching to colocate prefill and decode requests on the same GPU. This method causes significant delay for decoding requests, and also increases the latency for prefill requests.

1

0

5

Hao AI Lab

@haoailab

4 months

We introduce disaggregation to fundamentally eliminate interference by splitting prefill from decode into different GPUs. Requests first enter a prefill worker to generate the first token. Then, it migrates to a decode worker, and generates token-by-token until it finishes.

1

0

5

Hao AI Lab

@haoailab

4 months

Goodput = # completed requests per second within latency criteria It measures both cost💰 and UX🩷. Apps today have diverse latency criteria. Most important ones being: - TTFT: Initial respond time (aka prefill) - TPOT: Average time between subsequent output tokens (aka decode)

1

5

Hao AI Lab

@haoailab

4 months

Super glad to learn @dzhulgakov and @FireworksAI_HQ are already deploy it 😆

0

4

Hao AI Lab

@haoailab

4 months

Other techniques such as dynamic splitFuse (or chunked prefill + piggybacking) suffer from prefill-decode interference. Compared to chunked prefill + piggybacking, disaggregation does not require application to tradeoff between TTFT and TPOT but to adhere to both.

1

0

3

Hao AI Lab

@haoailab

4 months

Checkout our Paper: . Blog post: The work is done with amazing collaborators: Yinmin Zhong, Shengyu Liu, @Junda_Chen_ , Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, @haozhangml

Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggrega...

TL;DR: LLM apps today have diverse latency requirements. For example, a chatbot may require a fast initial response (e.g., under 0.2 seconds) but moderate speed in decoding which only needs to match...

hao-ai-lab.github.io

0

1

4

Hao AI Lab

@haoailab

4 months

Our research prototype DistServe achieves - up to 4.48x goodput - up to 10.2x tighter latency requirement compared to vLLM across different workloads with distinct latency requirements.

1

0

3

Hao AI Lab

@haoailab

4 months

Colocation introduces Costly Tradeoff. Company has to choose between - Loosen latency criteria -> Less satisfied user 💔 - Over-provision more resources -> More money spent 💰💰💰

1

0

3

Hao AI Lab

@haoailab

2 months

@zraytam Yes. Essentially the takeaway is we find LLMs can be trained with new loss (consistency loss + AR loss) to make parallel decoding much more efficient.

1

0

2

Hao AI Lab

@haoailab

2 months

@MrCatid CLLMs use Jacobi decoding, Medusa uses extra LM heads + tree-based verification.

0

Hao AI Lab

@haoailab

2 months

@thomasahle We follow causal masks, the same architecture & design as decoder-only autoregressive LLMs to validate our idea. This is because it requires minimal training cost and it works. BERT-like architecture might require some more extensive training experiments, but could be a

0

Hao AI Lab

@haoailab

2 months

@zraytam No. For now we follow the same exact architecture, causal mask and decoding process as auto-regressive models. Other architectures could be a future direction.

1

0

1

Hao AI Lab

@haoailab

2 months

@thomasahle We use customized forward function where we feed the tokens back in. Feeding hidden states could be an interesting idea but we haven’t tried.

0

1