Hao AI Lab Profile Banner
Hao AI Lab Profile
Hao AI Lab

@haoailab

Followers
638
Following
185
Media
22
Statuses
38

Hao AI Lab at UCSD. Our mission is to democratize large machine learning models, algorithms, and their underlying systems.

Joined March 2024
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@haoailab
Hao AI Lab
21 days
Multiple LLM serving has emerged as a crucial and costly demand. Want to co-serve multiple LLMs with better utilization? Introducing MuxServe - flexible spatial-temporal multiplexing - up to 1.8x higher throughput Blog: Paper:
2
23
74
@haoailab
Hao AI Lab
2 months
People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀 Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory - LLM can fast forward on token generation.
Tweet media one
8
42
210
@haoailab
Hao AI Lab
4 months
Still optimizing throughput for LLM Serving? Think again: Goodput might be a better choice! Splitting prefill from decode to different GPUs yields - up to 4.48x goodput - up to 10.2x stricter latency criteria Blog: Paper:
4
51
180
@haoailab
Hao AI Lab
4 months
🎉 Excited to announce that our paper DistServe was accepted to #OSDI #OSDI24 ! Congrats to our great team: @YinminZhong , Shengyu Liu, @Junda_Chen_ , Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, @haozhangml
@haoailab
Hao AI Lab
4 months
Still optimizing throughput for LLM Serving? Think again: Goodput might be a better choice! Splitting prefill from decode to different GPUs yields - up to 4.48x goodput - up to 10.2x stricter latency criteria Blog: Paper:
4
51
180
0
8
33
@haoailab
Hao AI Lab
2 months
In the human cognitive process, we have the ability of forming complete sentences in mind before articulating word by word. Can LLMs be taught to acquire this capability as well? Our results show the answer is YES! ✅ We teach CLLMs to map any point on a Jacobi trajectory to
Tweet media one
2
2
19
@haoailab
Hao AI Lab
2 months
To encourage CLLMs to output the fixed point sequence from any point on a Jacobi trajectory, we minimize: - consistency loss: to quantify the distance between the two points. - auto-regressive loss: to prevent output distributions from drifting away from the generated result from
1
0
12
@haoailab
Hao AI Lab
2 months
We also identify that such acceleration is likely to stem from the existence of: - Fast forwarding, where multiple consecutive tokens are correctly predicted in a single forward pass. - Stationary tokens, which are correctly predicted and remain unaltered through subsequent
Tweet media one
2
0
12
@haoailab
Hao AI Lab
2 months
CLLM training cost is moderate: - it’s a one-time overhead. - CLLMs introduce no extra memory cost at inference time while achieving comparable or even better speedup in comparison with Medusa2/Eagle. In the cases where the dataset size is large, for example for
Tweet media one
1
0
8
@haoailab
Hao AI Lab
2 months
The paper is accepted to ICML 24' and is available on: . Check out our repo where you can find training recipes and CLLM checkpoints: The work is done with amazing collaborators: @kou_siqi , Zhijie Deng, @Lanxiang_Hu , @haozhangml
0
3
10
@haoailab
Hao AI Lab
2 months
Our experiments show CLLMs are up to 3.4x faster than the pre-trained models across a variety of benchmarks. In comparison with other SOTA methods like Medusa2, CLLMs achieve roughly the same speedup as Medusa2, with comparable scores on MT-bench. However, CLLM offers higher
1
0
9
@haoailab
Hao AI Lab
4 months
Flexible GPU allocation. Disaggregation enables flexible GPU allocation for prefill and decode. Here, 2 GPU for prefill + 1 GPU for decode (right) achieves 3.3 rps (per GPU), compared to (left) 1.6 rps with colocation. Simple disaggregation yields 2x goodput!
Tweet media one
2
0
6
@haoailab
Hao AI Lab
4 months
High throughput ≠ High goodput. Systems optimizing throughput can have low goodput under certain latency criteria. With a simple latency constraint, a system with throughput = 10 requests per second can only serve goodput = 3 requests per second. That is 3x less QoS!
Tweet media one
1
0
5
@haoailab
Hao AI Lab
4 months
Why existing systems fail to achieve high goodput? Interference! Existing systems use continuous batching to colocate prefill and decode requests on the same GPU. This method causes significant delay for decoding requests, and also increases the latency for prefill requests.
Tweet media one
1
0
5
@haoailab
Hao AI Lab
4 months
We introduce disaggregation to fundamentally eliminate interference by splitting prefill from decode into different GPUs. Requests first enter a prefill worker to generate the first token. Then, it migrates to a decode worker, and generates token-by-token until it finishes.
1
0
5
@haoailab
Hao AI Lab
4 months
Goodput = # completed requests per second within latency criteria It measures both cost💰 and UX🩷. Apps today have diverse latency criteria. Most important ones being: - TTFT: Initial respond time (aka prefill) - TPOT: Average time between subsequent output tokens (aka decode)
Tweet media one
1
1
5
@haoailab
Hao AI Lab
4 months
Super glad to learn @dzhulgakov and @FireworksAI_HQ are already deploy it 😆
0
0
4
@haoailab
Hao AI Lab
4 months
Other techniques such as dynamic splitFuse (or chunked prefill + piggybacking) suffer from prefill-decode interference. Compared to chunked prefill + piggybacking, disaggregation does not require application to tradeoff between TTFT and TPOT but to adhere to both.
1
0
3
@haoailab
Hao AI Lab
4 months
Our research prototype DistServe achieves - up to 4.48x goodput - up to 10.2x tighter latency requirement compared to vLLM across different workloads with distinct latency requirements.
Tweet media one
1
0
3
@haoailab
Hao AI Lab
4 months
Colocation introduces Costly Tradeoff. Company has to choose between - Loosen latency criteria -> Less satisfied user 💔 - Over-provision more resources -> More money spent 💰💰💰
Tweet media one
1
0
3
@haoailab
Hao AI Lab
2 months
@zraytam Yes. Essentially the takeaway is we find LLMs can be trained with new loss (consistency loss + AR loss) to make parallel decoding much more efficient.
1
0
2
@haoailab
Hao AI Lab
2 months
@MrCatid CLLMs use Jacobi decoding, Medusa uses extra LM heads + tree-based verification.
0
0
0
@haoailab
Hao AI Lab
2 months
@thomasahle We follow causal masks, the same architecture & design as decoder-only autoregressive LLMs to validate our idea. This is because it requires minimal training cost and it works. BERT-like architecture might require some more extensive training experiments, but could be a
0
0
0
@haoailab
Hao AI Lab
2 months
@zraytam No. For now we follow the same exact architecture, causal mask and decoding process as auto-regressive models. Other architectures could be a future direction.
1
0
1
@haoailab
Hao AI Lab
2 months
@thomasahle We use customized forward function where we feed the tokens back in. Feeding hidden states could be an interesting idea but we haven’t tried.
0
0
1