Multiple LLM serving has emerged as a crucial and costly demand.
Want to co-serve multiple LLMs with better utilization?
Introducing MuxServe
- flexible spatial-temporal multiplexing
- up to 1.8x higher throughput
Blog:
Paper:
People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forward on token generation.
Still optimizing throughput for LLM Serving?
Think again: Goodput might be a better choice!
Splitting prefill from decode to different GPUs yields
- up to 4.48x goodput
- up to 10.2x stricter latency criteria
Blog:
Paper:
Still optimizing throughput for LLM Serving?
Think again: Goodput might be a better choice!
Splitting prefill from decode to different GPUs yields
- up to 4.48x goodput
- up to 10.2x stricter latency criteria
Blog:
Paper:
In the human cognitive process, we have the ability of forming complete sentences in mind before articulating word by word. Can LLMs be taught to acquire this capability as well?
Our results show the answer is YES! ✅
We teach CLLMs to map any point on a Jacobi trajectory to
To encourage CLLMs to output the fixed point sequence from any point on a Jacobi trajectory, we minimize:
- consistency loss: to quantify the distance between the two points.
- auto-regressive loss: to prevent output distributions from drifting away from the generated result from
We also identify that such acceleration is likely to stem from the existence of:
- Fast forwarding, where multiple consecutive tokens are correctly predicted in a single forward pass.
- Stationary tokens, which are correctly predicted and remain unaltered through subsequent
CLLM training cost is moderate:
- it’s a one-time overhead.
- CLLMs introduce no extra memory cost at inference time while achieving comparable or even better speedup in comparison with Medusa2/Eagle.
In the cases where the dataset size is large, for example for
The paper is accepted to ICML 24' and is available on: .
Check out our repo where you can find training recipes and CLLM checkpoints:
The work is done with amazing collaborators:
@kou_siqi
, Zhijie Deng,
@Lanxiang_Hu
,
@haozhangml
Our experiments show CLLMs are up to 3.4x faster than the pre-trained models across a variety of benchmarks. In comparison with other SOTA methods like Medusa2, CLLMs achieve roughly the same speedup as Medusa2, with comparable scores on MT-bench.
However, CLLM offers higher
High throughput ≠ High goodput. Systems optimizing throughput can have low goodput under certain latency criteria.
With a simple latency constraint, a system with throughput = 10 requests per second can only serve goodput = 3 requests per second.
That is 3x less QoS!
Why existing systems fail to achieve high goodput? Interference!
Existing systems use continuous batching to colocate prefill and decode requests on the same GPU. This method causes significant delay for decoding requests, and also increases the latency for prefill requests.
We introduce disaggregation to fundamentally eliminate interference by splitting prefill from decode into different GPUs. Requests first enter a prefill worker to generate the first token. Then, it migrates to a decode worker, and generates token-by-token until it finishes.
Goodput = # completed requests per second within latency criteria
It measures both cost💰 and UX🩷.
Apps today have diverse latency criteria. Most important ones being:
- TTFT: Initial respond time (aka prefill)
- TPOT: Average time between subsequent output tokens (aka decode)
Other techniques such as dynamic splitFuse (or chunked prefill + piggybacking) suffer from prefill-decode interference. Compared to chunked prefill + piggybacking, disaggregation does not require application to tradeoff between TTFT and TPOT but to adhere to both.
Our research prototype DistServe achieves
- up to 4.48x goodput
- up to 10.2x tighter latency requirement
compared to vLLM across different workloads with distinct latency requirements.
Colocation introduces Costly Tradeoff. Company has to choose between
- Loosen latency criteria -> Less satisfied user 💔
- Over-provision more resources -> More money spent 💰💰💰
@zraytam
Yes. Essentially the takeaway is we find LLMs can be trained with new loss (consistency loss + AR loss) to make parallel decoding much more efficient.
@thomasahle
We follow causal masks, the same architecture & design as decoder-only autoregressive LLMs to validate our idea. This is because it requires minimal training cost and it works.
BERT-like architecture might require some more extensive training experiments, but could be a
@zraytam
No. For now we follow the same exact architecture, causal mask and decoding process as auto-regressive models. Other architectures could be a future direction.
@thomasahle
We use customized forward function where we feed the tokens back in. Feeding hidden states could be an interesting idea but we haven’t tried.