Zhihao Jia @JiaZhihao profile

Zhihao Jia

@JiaZhihao

Followers

2,379

Following

554

Media

15

Statuses

95

Assistant professor of Computer Science at Carnegie Mellon University. Research on systems and machine learning.

https://t.co/614YcN5ENn

Joined August 2012

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

GCF HEADLINER LISA • 525534 Tweets

Bama • 87371 Tweets

Martinez • 74224 Tweets

Román • 70511 Tweets

スプリンターズS • 51850 Tweets

Tuscaloosa • 44220 Tweets

Riquelme • 38905 Tweets

#MostRequestedLive • 35113 Tweets

Mets • 33143 Tweets

Heisman • 31519 Tweets

Ryan Williams • 28186 Tweets

Kirby • 25842 Tweets

Carson Beck • 25436 Tweets

Milroe • 24784 Tweets

Saban • 21575 Tweets

#RollTide • 20717 Tweets

Jeremiah Smith • 20406 Tweets

Dawgs • 19959 Tweets

Jodhpur Case • 17292 Tweets

ショウマ • 15081 Tweets

招き猫の日 • 14718 Tweets

トッキュウジャー • 13277 Tweets

#KaoPPFanMeet2024 • 13086 Tweets

Florida State • 12148 Tweets

ジンギスカン • 11958 Tweets

ショアキーパー • 11364 Tweets

メェメェ • 10795 Tweets

Penn State • 10476 Tweets

Jim Gaffigan

Davey

Ashton Jeanty

アウトサイダーズ

Cruz Azul

橋木くん

John Black

Melgar

Dbacks

そくほー

Norvell

モスラの歌

デススト2

南泉くん

Laine

設営完了

DeBoer

Figal

Dana Carvey

ピーニャ

#SNLPremiere

#のど自慢

Last Seen Profiles

@spearmintage

@WermeMarl

@yirmiikiotuz

@NTA_AkafuWeb

@Wan_w1n

@lust_jade

@6YBaz0CKYVK05zv

@JBTweetsElites

@i_k_h_a_l_i_d

@JadoVan_

@RashidianAr

@SprayPlanet

@carmen86411319

@DRDC_RDDC

@DHIHockey

@alsabri66

@Nuria08269922

@HesabNft

@Lugonson

@QIEGAO66

Pinned Tweet

Zhihao Jia

@JiaZhihao

5 months

🚀Introducing Mirage, a superoptimizer that automatically discovers highly-optimized GPU implementations for LLMs (and beyond). For certain attention operators, the fastest programs found by Mirage is 2x faster than existing expert-designed implementations such as FlashAttention

13

50

360

Zhihao Jia

@JiaZhihao

1 year

Announcing FlexFlow Serve! A low-latency, high-performance distributed system for serving LLMs, reducing inference latency by 1.8-2.2x compared to existing systems. Easy to install: pip install flexflow Repo: A few reasons you may want to try it. 1/n

5

94

413

Zhihao Jia

@JiaZhihao

1 year

Generative LLMs are slow and expensive to serve. Their much smaller, distilled versions are faster and cheaper but achieve suboptimal generative performance. We show it is possible to achieve the best of both worlds. Code: Paper:

4

74

376

Zhihao Jia

@JiaZhihao

4 years

I am excited to share that I will join the Computer Science Department @CSDatCMU at Carnegie Mellon University @SCSatCMU as an assistant professor in Fall 2021! I am grateful to my advisors Alex Aiken and @matei_zaharia , colleagues, family, and friends for the invaluable support.

23

4

303

Zhihao Jia

@JiaZhihao

8 months

LLMs are very expensive to train and serve 💸. OpenAI's Sora for video/image generation faces even higher expenses. Our recent research shows the possibility of making LLMs 10x cheaper by leveraging spot instances on the cloud🚀. A thread 1/n

4

26

182

Zhihao Jia

@JiaZhihao

2 months

#ICML2024 Join us for a 2-hour tutorial on Monday, July 22, focusing on advanced algorithms and systems for efficient LLM serving. The session will include our recent research on: ✨ Mirage: Auto-gen performant GPU kernels for LLMs 💸 SpotServe: Cost-effective LLMs on spot

2

16

111

Zhihao Jia

@JiaZhihao

1 year

Existing ML compilers only consider program transformations between a few pre-defined ML operators (e.g., fusing two MatMuls sharing an input into one MatMul). Tmr at #osdi23 , Liyan will show that EinNet can discover many more optimizations and accelerate DNNs by up to 2.7x.

1

11

92

Zhihao Jia

@JiaZhihao

5 years

TASO is a Tensor Algebra SuperOptimizer for deep learning. TASO optimizes the computation graphs of DNN architectures using automatically generated and verified graph transformations, outperforming existing DNN graph optimizers by up to 3x: .

1

13

46

Zhihao Jia

@JiaZhihao

5 months

Excited to share three ML/LLM systems from CMU Catalyst lab, all of which will be presented at #ASPLOS 24. They optimize different aspects of ML/LLM systems and are all open source. We will be at ASPLOS next week. Please reach out if you are interested. A thread for them (1/n).

2

3

41

Zhihao Jia

@JiaZhihao

1 year

Congrats to @yzhao062 for joining the @CSatUSC faculty. It has been a great pleasure to work with, advice, and learn from him. Yue has been doing amazing work on developing systems and algorithms for unsupervised ML. If you are interested in these topics, definitely talk to him.

1

2

29

Zhihao Jia

@JiaZhihao

5 months

For a DNN, Mirage automatically explores various tensor programs that are mathematically equivalent to the specified DNN and discovers those with best performance. For the group-query attention of LLAMA-3-70B, Mirage discovers 69 different programs, including today's manually

3

1

28

Zhihao Jia

@JiaZhihao

3 years

I am super excited to present our recent work on building automated ML systems at the @euromlsys workshop tomorrow.

EuroMLSys

@euromlsys

3 years

EuroMLSys '21 takes place tomorrow. The program with links to video presentations and manuscripts is available at Two excellent keynote speakers, @JiaZhihao and @annadgoldie , also lined up. We're very excited to welcome everybody!

1

5

10

0

2

26

Zhihao Jia

@JiaZhihao

8 months

1. For serving, SpotServe [ASPLOS'24] is an LLM serving system on spot instances. It handles instance preemptions with dynamic parallelization, promises low tail latency, and reduces serving cost by 54%. Paper: Code:

1

26

Zhihao Jia

@JiaZhihao

5 months

Mirage is more than an attention optimizer and can optimize general DNNs. As another example, for LoRA, which involves three matrix multiplications, Mirage discovers programs that merge all three matmuls in a single GPU kernel, outperforming Triton's implementation by 3x. Curious

1

0

24

Zhihao Jia

@JiaZhihao

2 years

Hui Guan @guanh01 and I are co-chairing MLSys'23 artifact evaluation committee, and are seeking self-nominations for early-career researchers willing to serve on the committee: . Please retweet to help us maximize visibility. #mlsys @MLSysConf

0

16

23

Zhihao Jia

@JiaZhihao

9 months

We at CMU can host postdocs through Carnegie Bosch Fellowships () to work on building efficient, scalable, and affordable systems for ML (especially LLMs). Applications due Jan 26. Feel free to reach out if you are interested.

CBI Fellowship Program

Announcement Applications are invited for Carnegie Bosch Postdoctoral Fellows for the Fall Semester of 2024 supporting research that seeks to have a positive impact on society (e.g. sustainability)...

carnegiebosch.cmu.edu

0

7

18

Zhihao Jia

@JiaZhihao

5 months

@tianle_cai Yes, we advocate for a future where programmers only need to describe the mathematical computation of the program (e.g., two matmuls and a softmax for attention), and superoptimizers such as Mirage can automatically generate high-performance GPU implementations, so you no longer

1

3

19

Zhihao Jia

@JiaZhihao

1 year

2. Multi-node LLM inference. FlexFlow Serve can serve an LLM on multiple nodes by combining tensor model parallelism (within a node) and pipeline model parallelism (across nodes). This allows FlexFlow Serve to support very large LLMs on cheap GPU clusters. 3/n

1

18

Zhihao Jia

@JiaZhihao

1 year

3. Offloading-based LLM inference. FlexFlow Serve can support large LLMs on a few GPUs by leveraging CPU DRAM to store an LLM's parameters. FlexFlow Serve is 3x faster than FlexGen. 4/n

1

0

18

Zhihao Jia

@JiaZhihao

8 months

2. For training, Parcae [NSDI'24] is a system for cheap, fast, and scalable LLM training on spot instances. The key idea is a _proactive_, liveput-optimized approach to boosting preemption-aware throughput, reducing cost by 10x. Paper:

1

2

18

Zhihao Jia

@JiaZhihao

3 years

Look forward to sharing two ML systems we recently built, PET and Dorylus, both of which are open sourced and will appear at #OSDI21 this week. Details in thread.

1

2

18

Zhihao Jia

@JiaZhihao

1 year

1. Low-latency LLM inference. FlexFlow Serve accelerates LLM serving using correctness-preserving speculative inference and supports both greedy and stochastic decoding. This combination reduces LLM decoding steps by 3-4x while preserving generative quality. 2/n

1

0

16

Zhihao Jia

@JiaZhihao

1 year

Solution: SpecInfer combines multiple collectively boost-tuned small speculative models (SSMs) to jointly predict an LLM’s output. The predictions are organized as a tree of tokens, and their correctness is verified by the LLM using a fast tree-based parallel decoding algorithm.

1

0

13

Zhihao Jia

@JiaZhihao

1 year

We acknowledge that @huggingface also introduced assisted generation last week to leverage speculative inference (). We discuss the main differences between HF's assistant model and SpecInfer at .

Assisted Generation: a new direction toward low-latency text generation

huggingface.co

1

0

13

Zhihao Jia

@JiaZhihao

1 year

Derivation-based transformations enable a much larger optimization space for tensor programs that includes the transformations considered by prior work as special cases. This results in up to 2.7x speedup on GPUs. Paper: Code:

GitHub - InfiniTensor/InfiniTensor

Contribute to InfiniTensor/InfiniTensor development by creating an account on GitHub.

github.com

1

0

12

Zhihao Jia

@JiaZhihao

4 years

I will work as a research scientist at Facebook during my gap year. If you are interested in systems and machine learning, I will be excited to talk to you.

0

1

11

Zhihao Jia

@JiaZhihao

1 year

EinNet is an ML compiler that can discover transformations between **general** tensor algebra expressions using mathematical derivations. For a simple 3x3 convolution, EinNet finds up to 10^8 equivalent expressions, some of which are 2x faster than the original convolution.

1

9

Zhihao Jia

@JiaZhihao

1 year

Kudos to the students who have contributed significantly to SpecInfer: Xupeng Miao, @GabrieleOliaro , @Jackfram2 , @chengxinhao1 , Zeyu Wang, Rae Ying Yee Wong, @chenzhuoming911 , Daiyaan Arfeen, and Reyna Abhyankar.

0

9

Zhihao Jia

@JiaZhihao

2 years

We at CMU can host postdocs through Carnegie Bosch Fellowships () to work on building efficient, scalable, and performant systems for ML (especially LLMs). Applications due Feb 28 (in a week). Feel free to reach out if you are interested.

CBI Fellowship Program

Announcement Applications are invited for Carnegie Bosch Postdoctoral Fellows for the Fall Semester of 2024 supporting research that seeks to have a positive impact on society (e.g. sustainability)...

carnegiebosch.cmu.edu

0

8

Zhihao Jia

@JiaZhihao

1 year

TLDR: SpecInfer is a system that accelerates generative LLM serving with speculative inference and token tree verification. The key idea is to use an LLM as a token tree verifier instead of an incremental decoder. We show that this reduces LLM inference latency by 2.8x.

1

0

8

Zhihao Jia

@JiaZhihao

1 year

Background: existing LLM systems use incremental decoding, which iteratively decodes a new token using all preceding tokens in an autoregressive fashion. This approach needs all parameters of an LLM to decode a token, and it's performance is limited by slow GPU memory accesses.

1

0

7

Zhihao Jia

@JiaZhihao

5 months

1. SpecInfer reduces LLM inference latency by 1.5-3.5x using tree-based speculative inference. It's key advantage over existing sequence-based methods is its much higher success rates for speculation (i.e., 52% -> 96%). Paper: Code:

1

6

Zhihao Jia

@JiaZhihao

5 months

@main_horse @main_horse These are all great questions! Mirage's optimizations apply to both single- and multi-GPU scenarios. For example, the experimental results shown in the figure assume tensor model parallelism across 4 A100 GPUs for serving LLAMA-3-70B in half precision. We are working

0

6

Zhihao Jia

@JiaZhihao

5 months

3. Korch is a tensor program optimizer that discovers **optimal** kernel orchestration plans for DNNs. It outperforming existing optimizers by up to 1.6 on A100 GPUs, while preserving end-to-end equivalence. Paper: Code:

0

1

6

Zhihao Jia

@JiaZhihao

5 months

@Yuchenj_UW Definitely. We are working with the @ApacheTVM team to integrate Mirage into TVM. Stay tuned.

1

0

4

Zhihao Jia

@JiaZhihao

5 years

More details available in our SOSP paper:

0

3

Zhihao Jia

@JiaZhihao

5 months

2. SpotServe is an LLM serving system on spot instances. It handles instance preemptions with dynamic parallelization, promises low tail latency, and reduces serving cost by 54%. Paper: Code:

1

4

Zhihao Jia

@JiaZhihao

3 years

@luisceze @lindsey @tqchenml @chris_deli @samps @tqchenml and I, together with other faculty members at CMU, are building a new lab for developing automated ML systems and actively look for students:

1

0

4

Zhihao Jia

@JiaZhihao

3 years

@FrancisYan_ Thanks for virtually visiting us! The talk was extremely well received — students were so interested in the topics and eager to know more about your research! We look forward to hosting your in-person visit again soon.:-)

1

0

3

Zhihao Jia

@JiaZhihao

2 years

@BeidiChen @CarnegieMellon @CMU_ECE @Meta @MetaAI Welcome to CMU!

1

0

3

Zhihao Jia

@JiaZhihao

3 years

Dorylus enables affordable, scalable, and accurate GNN training using distributed CPU servers and serverless threads. Through computation separation, Dorylus allows serverless computing to provide a scalable, efficient, and low-cost scheme for GNN training.

0

3

Zhihao Jia

@JiaZhihao

5 months

@main_horse This is per-token decode latency of different triton programs generated by Mirage. We also include the FlashDecoding and FlashInfer’s latency as references.

0

3

Zhihao Jia

@JiaZhihao

5 months

@d_haziza @tri_dao @d_haziza Thanks a lot for your interest in Mirage! It's great that the optimization in Fig 4c is already available in FlashAttn. In the post's figure, all baselines perform this optimization by reshaping query tensor, and the main performance difference between Mirage and

0

2

Zhihao Jia

@JiaZhihao

5 months

@semiDL Thanks for your interest! Yes, we hope that ML/LLM superoptimizers can help improve the productivity of ML programmers --- they can focus on describing the mathematical specification of a model and superoptimizers such as Mirage automatically searches for potential GPU

0

2

Zhihao Jia

@JiaZhihao

4 years

@huanchenzhang Thanks Huanchen! Welcome to Tsinghua IIIS! I am sure we can have exciting collaborations in the future.:)

0

1

Zhihao Jia

@JiaZhihao

2 years

@haozhangml @UCSanDiego @HDSIUCSD Congratulations!!!

1

0

2

Zhihao Jia

@JiaZhihao

1 year

@RashmiKVinayak Congratulations Rashmi!

1

0

2

Zhihao Jia

@JiaZhihao

3 years

@qi2peng2 Congrats!!!

1

0

1

Zhihao Jia

@JiaZhihao

1 year

@KuanFang @Cornell @CornellCIS @Cornell_CS Congratulations Kuan!

0

1

Zhihao Jia

@JiaZhihao

3 years

PET optimizes DNN computation by automatically discovering performance-improving partially equivalent transformations and recovering models' end-to-end functionality by automatically correcting the computation, therefore unlocking previously missed optimization opportunities.

1

0

1

Zhihao Jia

@JiaZhihao

3 years

@kexinrong @gatech_scs Congrats!

0

1

Zhihao Jia

@JiaZhihao

3 years

1. Existing ML frameworks optimize DNN computation by considering only fully equivalent transformations. This approach preserves models' functionality but misses transformations that improve DNN performance but only maintain partial equivalence.

1

0

1

Zhihao Jia

@JiaZhihao

3 years

Paper: Code:

1

0

1

Zhihao Jia

@JiaZhihao

4 months

@shwestrick @NYU_Courant Congratulations Sam!

0

1

Zhihao Jia

@JiaZhihao

5 months

@AlpinDale Great question! We have a cutlass backend that directly interpret a mugraph and invokes corresponding cutlass/PTX functions: . We are working on directly emitting PTX/SASS code from Mirage. Stay tuned.

mirage/src/kernel/cuda/customized_kernel.cu at main · mirage-project/mirage

A multi-level tensor algebra superoptimizer. Contribute to mirage-project/mirage development by creating an account on GitHub.

github.com

0

1

Zhihao Jia

@JiaZhihao

4 years

@tqchenml @atalwalkar Congratulations!

0

1

Zhihao Jia

@JiaZhihao

5 months

@eric_alcaide Thanks for the suggestion! We will give it a try.:-)

1

0

Zhihao Jia

@JiaZhihao

1 year

@1a1a11a It supports both online and offline. The demo shows offline inference though

0

1

Zhihao Jia

@JiaZhihao

5 months

@StartupYou @StartupYou Thanks a lot for your interests! We believe that superoptimizating systems enable a promising direction for implementing ML systems, where users only need to specify the mathematical computation of their programmers, and superoptimizers such as Mirage can

0

1

Zhihao Jia

@JiaZhihao

3 years

2. Training graph neural networks is challenging, since the neural network computation relies on expensive high-end GPUs but the limited GPU memory cannot scale to today's billion-edge graphs.

1

0

1

Zhihao Jia

@JiaZhihao

5 months

@samlakig Great questions! Mirage finds mathematically equivalent programs (and may introduce numerical instability due to floating points). We are working on analyzing this as a future work.

1

0

1

Zhihao Jia

@JiaZhihao

2 years

@ElaineRShi Lol, the email looks totally fine — not that private.😂

1

0

1

Zhihao Jia

@JiaZhihao

5 months

@eric_alcaide Thanks for the suggestion! The default search configuration for Mirage is tailored for inference workload. You can update the search configurations to make it generate backward kernels. We will update the repo to include such demos.

0

1