Zhihao Jia Profile
Zhihao Jia

@JiaZhihao

Followers
2,379
Following
554
Media
15
Statuses
95

Assistant professor of Computer Science at Carnegie Mellon University. Research on systems and machine learning.

Joined August 2012
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@JiaZhihao
Zhihao Jia
5 months
🚀Introducing Mirage, a superoptimizer that automatically discovers highly-optimized GPU implementations for LLMs (and beyond). For certain attention operators, the fastest programs found by Mirage is 2x faster than existing expert-designed implementations such as FlashAttention
13
50
360
@JiaZhihao
Zhihao Jia
1 year
Announcing FlexFlow Serve! A low-latency, high-performance distributed system for serving LLMs, reducing inference latency by 1.8-2.2x compared to existing systems. Easy to install: pip install flexflow Repo: A few reasons you may want to try it. 1/n
Tweet media one
5
94
413
@JiaZhihao
Zhihao Jia
1 year
Generative LLMs are slow and expensive to serve. Their much smaller, distilled versions are faster and cheaper but achieve suboptimal generative performance. We show it is possible to achieve the best of both worlds. Code: Paper:
4
74
376
@JiaZhihao
Zhihao Jia
4 years
I am excited to share that I will join the Computer Science Department @CSDatCMU at Carnegie Mellon University @SCSatCMU as an assistant professor in Fall 2021! I am grateful to my advisors Alex Aiken and @matei_zaharia , colleagues, family, and friends for the invaluable support.
23
4
303
@JiaZhihao
Zhihao Jia
8 months
LLMs are very expensive to train and serve 💸. OpenAI's Sora for video/image generation faces even higher expenses. Our recent research shows the possibility of making LLMs 10x cheaper by leveraging spot instances on the cloud🚀. A thread 1/n
4
26
182
@JiaZhihao
Zhihao Jia
2 months
#ICML2024 Join us for a 2-hour tutorial on Monday, July 22, focusing on advanced algorithms and systems for efficient LLM serving. The session will include our recent research on: ✨ Mirage: Auto-gen performant GPU kernels for LLMs 💸 SpotServe: Cost-effective LLMs on spot
Tweet media one
2
16
111
@JiaZhihao
Zhihao Jia
1 year
Existing ML compilers only consider program transformations between a few pre-defined ML operators (e.g., fusing two MatMuls sharing an input into one MatMul). Tmr at #osdi23 , Liyan will show that EinNet can discover many more optimizations and accelerate DNNs by up to 2.7x.
1
11
92
@JiaZhihao
Zhihao Jia
5 years
TASO is a Tensor Algebra SuperOptimizer for deep learning. TASO optimizes the computation graphs of DNN architectures using automatically generated and verified graph transformations, outperforming existing DNN graph optimizers by up to 3x: .
Tweet media one
1
13
46
@JiaZhihao
Zhihao Jia
5 months
Excited to share three ML/LLM systems from CMU Catalyst lab, all of which will be presented at #ASPLOS 24. They optimize different aspects of ML/LLM systems and are all open source. We will be at ASPLOS next week. Please reach out if you are interested. A thread for them (1/n).
2
3
41
@JiaZhihao
Zhihao Jia
1 year
Congrats to @yzhao062 for joining the @CSatUSC faculty. It has been a great pleasure to work with, advice, and learn from him. Yue has been doing amazing work on developing systems and algorithms for unsupervised ML. If you are interested in these topics, definitely talk to him.
1
2
29
@JiaZhihao
Zhihao Jia
5 months
For a DNN, Mirage automatically explores various tensor programs that are mathematically equivalent to the specified DNN and discovers those with best performance. For the group-query attention of LLAMA-3-70B, Mirage discovers 69 different programs, including today's manually
Tweet media one
3
1
28
@JiaZhihao
Zhihao Jia
3 years
I am super excited to present our recent work on building automated ML systems at the @euromlsys workshop tomorrow.
@euromlsys
EuroMLSys
3 years
EuroMLSys '21 takes place tomorrow. The program with links to video presentations and manuscripts is available at Two excellent keynote speakers, @JiaZhihao and @annadgoldie , also lined up. We're very excited to welcome everybody!
1
5
10
0
2
26
@JiaZhihao
Zhihao Jia
8 months
1. For serving, SpotServe [ASPLOS'24] is an LLM serving system on spot instances. It handles instance preemptions with dynamic parallelization, promises low tail latency, and reduces serving cost by 54%. Paper: Code:
Tweet media one
1
1
26
@JiaZhihao
Zhihao Jia
5 months
Mirage is more than an attention optimizer and can optimize general DNNs. As another example, for LoRA, which involves three matrix multiplications, Mirage discovers programs that merge all three matmuls in a single GPU kernel, outperforming Triton's implementation by 3x. Curious
1
0
24
@JiaZhihao
Zhihao Jia
2 years
Hui Guan @guanh01 and I are co-chairing MLSys'23 artifact evaluation committee, and are seeking self-nominations for early-career researchers willing to serve on the committee: . Please retweet to help us maximize visibility. #mlsys @MLSysConf
0
16
23
@JiaZhihao
Zhihao Jia
9 months
We at CMU can host postdocs through Carnegie Bosch Fellowships () to work on building efficient, scalable, and affordable systems for ML (especially LLMs). Applications due Jan 26. Feel free to reach out if you are interested.
0
7
18
@JiaZhihao
Zhihao Jia
5 months
@tianle_cai Yes, we advocate for a future where programmers only need to describe the mathematical computation of the program (e.g., two matmuls and a softmax for attention), and superoptimizers such as Mirage can automatically generate high-performance GPU implementations, so you no longer
1
3
19
@JiaZhihao
Zhihao Jia
1 year
2. Multi-node LLM inference. FlexFlow Serve can serve an LLM on multiple nodes by combining tensor model parallelism (within a node) and pipeline model parallelism (across nodes). This allows FlexFlow Serve to support very large LLMs on cheap GPU clusters. 3/n
1
1
18
@JiaZhihao
Zhihao Jia
1 year
3. Offloading-based LLM inference. FlexFlow Serve can support large LLMs on a few GPUs by leveraging CPU DRAM to store an LLM's parameters. FlexFlow Serve is 3x faster than FlexGen. 4/n
1
0
18
@JiaZhihao
Zhihao Jia
8 months
2. For training, Parcae [NSDI'24] is a system for cheap, fast, and scalable LLM training on spot instances. The key idea is a _proactive_, liveput-optimized approach to boosting preemption-aware throughput, reducing cost by 10x. Paper:
Tweet media one
1
2
18
@JiaZhihao
Zhihao Jia
3 years
Look forward to sharing two ML systems we recently built, PET and Dorylus, both of which are open sourced and will appear at #OSDI21 this week. Details in thread.
1
2
18
@JiaZhihao
Zhihao Jia
1 year
1. Low-latency LLM inference. FlexFlow Serve accelerates LLM serving using correctness-preserving speculative inference and supports both greedy and stochastic decoding. This combination reduces LLM decoding steps by 3-4x while preserving generative quality. 2/n
1
0
16
@JiaZhihao
Zhihao Jia
1 year
Solution: SpecInfer combines multiple collectively boost-tuned small speculative models (SSMs) to jointly predict an LLM’s output. The predictions are organized as a tree of tokens, and their correctness is verified by the LLM using a fast tree-based parallel decoding algorithm.
Tweet media one
1
0
13
@JiaZhihao
Zhihao Jia
1 year
We acknowledge that @huggingface also introduced assisted generation last week to leverage speculative inference (). We discuss the main differences between HF's assistant model and SpecInfer at .
1
0
13
@JiaZhihao
Zhihao Jia
1 year
Derivation-based transformations enable a much larger optimization space for tensor programs that includes the transformations considered by prior work as special cases. This results in up to 2.7x speedup on GPUs. Paper: Code:
1
0
12
@JiaZhihao
Zhihao Jia
4 years
I will work as a research scientist at Facebook during my gap year. If you are interested in systems and machine learning, I will be excited to talk to you.
0
1
11
@JiaZhihao
Zhihao Jia
1 year
EinNet is an ML compiler that can discover transformations between **general** tensor algebra expressions using mathematical derivations. For a simple 3x3 convolution, EinNet finds up to 10^8 equivalent expressions, some of which are 2x faster than the original convolution.
Tweet media one
1
1
9
@JiaZhihao
Zhihao Jia
1 year
Kudos to the students who have contributed significantly to SpecInfer: Xupeng Miao, @GabrieleOliaro , @Jackfram2 , @chengxinhao1 , Zeyu Wang, Rae Ying Yee Wong, @chenzhuoming911 , Daiyaan Arfeen, and Reyna Abhyankar.
0
0
9
@JiaZhihao
Zhihao Jia
2 years
We at CMU can host postdocs through Carnegie Bosch Fellowships () to work on building efficient, scalable, and performant systems for ML (especially LLMs). Applications due Feb 28 (in a week). Feel free to reach out if you are interested.
0
0
8
@JiaZhihao
Zhihao Jia
1 year
TLDR: SpecInfer is a system that accelerates generative LLM serving with speculative inference and token tree verification. The key idea is to use an LLM as a token tree verifier instead of an incremental decoder. We show that this reduces LLM inference latency by 2.8x.
1
0
8
@JiaZhihao
Zhihao Jia
1 year
Background: existing LLM systems use incremental decoding, which iteratively decodes a new token using all preceding tokens in an autoregressive fashion. This approach needs all parameters of an LLM to decode a token, and it's performance is limited by slow GPU memory accesses.
Tweet media one
1
0
7
@JiaZhihao
Zhihao Jia
5 months
1. SpecInfer reduces LLM inference latency by 1.5-3.5x using tree-based speculative inference. It's key advantage over existing sequence-based methods is its much higher success rates for speculation (i.e., 52% -> 96%). Paper: Code:
Tweet media one
1
1
6
@JiaZhihao
Zhihao Jia
5 months
@main_horse @main_horse These are all great questions! Mirage's optimizations apply to both single- and multi-GPU scenarios. For example, the experimental results shown in the figure assume tensor model parallelism across 4 A100 GPUs for serving LLAMA-3-70B in half precision. We are working
0
0
6
@JiaZhihao
Zhihao Jia
5 months
3. Korch is a tensor program optimizer that discovers **optimal** kernel orchestration plans for DNNs. It outperforming existing optimizers by up to 1.6 on A100 GPUs, while preserving end-to-end equivalence. Paper: Code:
Tweet media one
0
1
6
@JiaZhihao
Zhihao Jia
5 months
@Yuchenj_UW Definitely. We are working with the @ApacheTVM team to integrate Mirage into TVM. Stay tuned.
1
0
4
@JiaZhihao
Zhihao Jia
5 years
More details available in our SOSP paper:
0
0
3
@JiaZhihao
Zhihao Jia
5 months
2. SpotServe is an LLM serving system on spot instances. It handles instance preemptions with dynamic parallelization, promises low tail latency, and reduces serving cost by 54%. Paper: Code:
Tweet media one
1
1
4
@JiaZhihao
Zhihao Jia
3 years
@luisceze @lindsey @tqchenml @chris_deli @samps @tqchenml and I, together with other faculty members at CMU, are building a new lab for developing automated ML systems and actively look for students:
1
0
4
@JiaZhihao
Zhihao Jia
3 years
@FrancisYan_ Thanks for virtually visiting us! The talk was extremely well received — students were so interested in the topics and eager to know more about your research! We look forward to hosting your in-person visit again soon.:-)
1
0
3
@JiaZhihao
Zhihao Jia
3 years
Dorylus enables affordable, scalable, and accurate GNN training using distributed CPU servers and serverless threads. Through computation separation, Dorylus allows serverless computing to provide a scalable, efficient, and low-cost scheme for GNN training.
Tweet media one
0
0
3
@JiaZhihao
Zhihao Jia
5 months
@main_horse This is per-token decode latency of different triton programs generated by Mirage. We also include the FlashDecoding and FlashInfer’s latency as references.
0
0
3
@JiaZhihao
Zhihao Jia
5 months
@d_haziza @tri_dao @d_haziza Thanks a lot for your interest in Mirage! It's great that the optimization in Fig 4c is already available in FlashAttn. In the post's figure, all baselines perform this optimization by reshaping query tensor, and the main performance difference between Mirage and
0
0
2
@JiaZhihao
Zhihao Jia
5 months
@semiDL Thanks for your interest! Yes, we hope that ML/LLM superoptimizers can help improve the productivity of ML programmers --- they can focus on describing the mathematical specification of a model and superoptimizers such as Mirage automatically searches for potential GPU
0
0
2
@JiaZhihao
Zhihao Jia
4 years
@huanchenzhang Thanks Huanchen! Welcome to Tsinghua IIIS! I am sure we can have exciting collaborations in the future.:)
0
0
1
@JiaZhihao
Zhihao Jia
2 years
1
0
2
@JiaZhihao
Zhihao Jia
1 year
@RashmiKVinayak Congratulations Rashmi!
1
0
2
@JiaZhihao
Zhihao Jia
3 years
@qi2peng2 Congrats!!!
1
0
1
@JiaZhihao
Zhihao Jia
3 years
PET optimizes DNN computation by automatically discovering performance-improving partially equivalent transformations and recovering models' end-to-end functionality by automatically correcting the computation, therefore unlocking previously missed optimization opportunities.
Tweet media one
1
0
1
@JiaZhihao
Zhihao Jia
3 years
0
0
1
@JiaZhihao
Zhihao Jia
3 years
1. Existing ML frameworks optimize DNN computation by considering only fully equivalent transformations. This approach preserves models' functionality but misses transformations that improve DNN performance but only maintain partial equivalence.
1
0
1
@JiaZhihao
Zhihao Jia
3 years
Paper: Code:
1
0
1
@JiaZhihao
Zhihao Jia
4 months
@shwestrick @NYU_Courant Congratulations Sam!
0
0
1
@JiaZhihao
Zhihao Jia
5 months
@AlpinDale Great question! We have a cutlass backend that directly interpret a mugraph and invokes corresponding cutlass/PTX functions: . We are working on directly emitting PTX/SASS code from Mirage. Stay tuned.
0
0
1
@JiaZhihao
Zhihao Jia
4 years
@tqchenml @atalwalkar Congratulations!
0
0
1
@JiaZhihao
Zhihao Jia
5 months
@eric_alcaide Thanks for the suggestion! We will give it a try.:-)
1
0
0
@JiaZhihao
Zhihao Jia
1 year
@1a1a11a It supports both online and offline. The demo shows offline inference though
0
0
1
@JiaZhihao
Zhihao Jia
5 months
@StartupYou @StartupYou Thanks a lot for your interests! We believe that superoptimizating systems enable a promising direction for implementing ML systems, where users only need to specify the mathematical computation of their programmers, and superoptimizers such as Mirage can
0
0
1
@JiaZhihao
Zhihao Jia
3 years
2. Training graph neural networks is challenging, since the neural network computation relies on expensive high-end GPUs but the limited GPU memory cannot scale to today's billion-edge graphs.
1
0
1
@JiaZhihao
Zhihao Jia
5 months
@samlakig Great questions! Mirage finds mathematically equivalent programs (and may introduce numerical instability due to floating points). We are working on analyzing this as a future work.
1
0
1
@JiaZhihao
Zhihao Jia
2 years
@ElaineRShi Lol, the email looks totally fine — not that private.😂
1
0
1
@JiaZhihao
Zhihao Jia
5 months
@eric_alcaide Thanks for the suggestion! The default search configuration for Mirage is tailored for inference workload. You can update the search configurations to make it generate backward kernels. We will update the repo to include such demos.
0
0
1