Please welcome Deepti Raghavan, joining us this fall as assistant professor! A recent graduate of
@Stanford
, her research interests focus on operating systems, networking, and machine learning systems:
Stanford CS is running an application support program for underrepresented students. If you’re considering applying to the Computer Science PhD program at Stanford, we'll do our best to give one round of feedback on your application. Apply by October 29:
We’re excited to share our ongoing work on ALTO, a network orchestrator for efficiently serving compound AI systems. ALTO can improve serving throughput for a complex chatbot verification pipeline by 3x while reducing tail latency by 1.8x.
I will be recruiting PhD students this upcoming cycle, to start in fall of 2025. Please apply if you are interested in operating systems, networking or machine learning systems.
I am extremely grateful to my family, friends, collaborators and mentors --
@matei_zaharia
, Phil Levis, and
@schemeprincess
-- for helping me get to this point. Thank you for making my PhD both fun and rewarding!
Generative language models emit tokens incrementally, which is especially useful in the context of compound AI systems because we can stream partial outputs between pipeline stages to overlap computation.
Many compound AI systems combine generative LMs with external tools like retrievers in a pipeline to solve challenging tasks. This work studies how to effectively distribute and parallelize these pipelines at scale.
Some stages are stateful, meaning they have to aggregate partial outputs across a stream. For such cases we need to route all stream data through a consistent stage instance to ensure correct aggregation. We call this aggregation-aware routing.
We discuss some preliminary ideas on designing an interface to enable localized aggregation-aware routing and a distributed prompt-aware scheduling algorithm in our paper.
Each LM stage can produce dynamic fan-out based on how many partial outputs it emits / how frequently the stage is invoked. To handle this fan-out and efficiently load balance across stages we need distributed prompt-aware scheduling.
We are actively looking for more applications to serve on top of ALTO. If you’re working on compound AI systems and want to serve them at scale, please reach out!
@jaewon_chung_cs
@matei_zaharia
The streaming is when downstream tasks require some part of the LM output from the previous stage (not always at the granularity of a token). Claim extraction streams at the granularity of individual sentences to the query generation state, which generates a query per sentence.