Dan Zhang @DZhang50 profile

Dan Zhang

@DZhang50

Followers

1,965

Following

811

Media

33

Statuses

770

Researcher @ Google DeepMind | ML for Systems | Systems for ML | Computer Architecture PhD @ UT Austin🤘 | Opinions stated here are my own.

SF Bay Area, CA

Joined November 2014

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

North Carolina • 927844 Tweets

Hurricane Helene • 703802 Tweets

#FreeMorara • 363975 Tweets

FEMA • 353514 Tweets

Dikembe Mutombo • 144249 Tweets

Mets • 92424 Tweets

Game 2 • 87140 Tweets

Diaz • 87109 Tweets

Mimi • 74593 Tweets

Verizon • 74450 Tweets

#النصر_الريان • 53264 Tweets

#الاهلي_الوصل • 44024 Tweets

#KYSvBJK • 33349 Tweets

Kemp • 31067 Tweets

#KızılGoncalar • 23944 Tweets

Kayseri • 22956 Tweets

Onana • 22571 Tweets

Semih • 18761 Tweets

الجيش اللبناني • 16546 Tweets

رونالدو • 14880 Tweets

Lindor • 13352 Tweets

Gavin Creel • 11750 Tweets

Bournemouth • 11293 Tweets

Southampton • 10945 Tweets

Ozzie • 10555 Tweets

WE LOVE YOU BAMBAM

Snit

Swayman

Seninleyiz Reis

كريستيانو

القدم الضعيفه

cami jara

بن شرقي

Pierce Johnson

Chris Sale

WotC

$HIPPO

سلطان الغنام

Semenyo

Schwellenbach

سالم النجدي

Ramsdale

Russell Martin

Atilla Karaoğlan

$DEGEN

Immobile

Nimmo

الخيبري

تاليسكا

Last Seen Profiles

@sunlo100

@Redlytitanic

@arbitrum

@Trollangel

@samihanan_had

@rahbloumor

@solmarsolmar

@weirddalle

@neil_abrams

@hotbearsworld

@Kei_IPOstock

@TXkVPmGKlQitsWm

@cbmha

@unstoppable0ne1

@BVR_Showcase

@TRefsal60366

@CanadyNijaree

@nicholaserviti

@Aycaorlu2

@lacasademaga

Pinned Tweet

Dan Zhang

@DZhang50

3 years

Currently, datacenter ML training and inference uses commodity TPU and GPU devices optimized for a wide range of workloads. Given the extreme scale of large datacenter deployments, would it be practical to build custom accelerators optimized for specific workloads? (1/4)

7

17

154

Dan Zhang

@DZhang50

2 years

@ZoeSchiffer He left a bit after 4 years, which means the real reason is that his stock cliff hit.

6

8

229

Dan Zhang

@DZhang50

9 months

@paulg That's not what the article actually says though. If you read it, it says DEI related job postings dropped by 44% in 2023.

6

1

119

Dan Zhang

@DZhang50

2 years

EA is full of 10 page studies that come to obvious conclusions that they could have figured out by having a 10min conversation with an actual expert in the field 😉

Marius Hobbhahn

@MariusHobbhahn

2 years

We modeled the performance of FET-based GPUs assuming that transistor miniaturization will hit a limit before reaching the size of a silicon atom. Our model predicts that performance will plateau between 2027 and 2035 at ~1e14 to 1e15 FLOP/s in FP32. 1/n

11

38

188

14

5

114

Dan Zhang

@DZhang50

3 years

I wrote a paper with a few Google colleagues about FAST, a new technique to build new specialized ML hardware accelerators able to improve computer vision inference performance by up to 6x relative to TPU-v3! (1/5)

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

2

29

109

Dan Zhang

@DZhang50

2 years

The current hottest commodity: G-Research (not Google Research) boat party invite!

5

4

95

Dan Zhang

@DZhang50

2 years

In the age of ChatGPT/Galactica, the value of unverified 10 page essays drops to zero. What matters now, more than ever, is fact checking. To remain grounded in reality, Effective Altruists must work with actual domain experts. Stop paying novices, and start paying experts.

5

8

81

Dan Zhang

@DZhang50

2 years

@minimaxir Google already won it when someone sacrificed their career on behalf of LaMDA 🤔

0

74

Dan Zhang

@DZhang50

3 years

@JeffDean 's terrific keynote at DAC 2021, "The Potential for Machine Learning for Hardware Design," features the work of many Googlers, including our work on FAST. The entire talk is worth watching in full; FAST is discussed between 30:51 - 39:45 (4/4)

Keynote: The Potential of Machine Learning for Hardware Design - Jeff...

Jeff Dean gives Keynote, "The Potential of Machine Learning for Hardware Design," on Monday, December 6, 2021 at 58th DAC.

www.youtube.com

1

7

71

Dan Zhang

@DZhang50

7 months

I'm surprised by the recent Groq discourse. One point which I think many are missing is that this analysis typically compares hundreds of Groq chips with a couple GPUs. A proper analysis should normalize for the Total Cost of Ownership (TCO) for the platform. 1/

Mahesh Sathiamoorthy

@madiator

7 months

Ok so few days back when I posted this, nobody noticed. And now my timeline is full of groq. Look at this throughput! I think the founders are ex-TPU folks, a great testament to Google engineering. :)

7

10

130

12

9

66

Dan Zhang

@DZhang50

2 years

@typedfemale And once by Jurgen Schmidhuber 😉

1

0

58

Dan Zhang

@DZhang50

1 year

Google just launched Cloud TPU v5e! I helped with the performance methodology for the launch 🙂

Sumit Gupta

@SumitGup

1 year

Google Cloud announces our next generation TPU: Cloud TPU v5e TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4.

0

6

36

3

0

60

Dan Zhang

@DZhang50

2 years

@alepiad Exactly. College teaches you how to learn, while balancing multiple projects under pressure with time constraints. The exact material frequently doesn't matter, since the goal is simply just repetition until you learn how to learn.

4

1

55

Dan Zhang

@DZhang50

2 years

@karpathy Why doesn't the compiler just automatically pad to the nearest multiple of 64? It seems that this padding decision should be part of the compiler's cost estimation.

4

0

55

Dan Zhang

@DZhang50

1 year

@MattNiessner This seems inaccurate. Another major incentive is the advantage in infrastructure, data, and other resources. And of course, the sheer number of world class colleagues!

1

0

54

Dan Zhang

@DZhang50

2 years

@typedfemale Usually that's a simple class project for a graduate-level computer architecture course

3

0

41

Dan Zhang

@DZhang50

1 year

The NVIDIA H100 (on the left) sells for around $40-50K. Guess the MSRP of the chip on the right 🤔

7

5

46

Dan Zhang

@DZhang50

2 years

@CkLorentzen Empirically, I've found that self-identified polymaths tend to be not all that they seem.

0

42

Dan Zhang

@DZhang50

7 years

@ingridavendano Alex Gulakov went to my school @UTAustin and was banned from our CS fb group for harassment. He has a long history of awful behavior and has gotten even more extreme since.

0

7

41

Dan Zhang

@DZhang50

1 year

Announcing the 6th ML for Systems workshop at @NeurIPSConf ! This year, we’re also interested in LLMs for systems, and ML for solving large-scale training/serving problems. Four-page extended abstracts are due on Sep 29, 2023. CFP: #mlforsystems

Call for Papers

Workshop on ML for Systems at NeurIPS 2024, December 14, Vancouver Convention Center

mlforsystems.org

0

12

42

Dan Zhang

@DZhang50

10 months

Google is arguably the only company with both SOTA ML models and ML accelerator hardware. What if we could take advantage of this through automated HW/SW codesign? 🤔😉

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

Jeff Dean (@🏡)

@JeffDean

10 months

I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,

273

3K

13K

3

1

40

Dan Zhang

@DZhang50

1 year

@O42nl This is me except I eat free Google food on weekdays instead of cat food 😅

0

32

Dan Zhang

@DZhang50

2 years

@typedfemale OpenAI: *refuses to provide GPT-4 details due to competitive pressure* Alphabet: *releases YouTube Shorts*

0

32

Dan Zhang

@DZhang50

3 years

Our work on Full-Stack Accelerator Search (FAST), accepted at ASPLOS 2022, shows that inference accelerators optimized for SOTA models like EfficientNet and BERT can achieve ~4x Perf/TDP relative to TPUv3 and can be ROI-positive at moderate datacenter deployment sizes. (2/4)

2

1

31

Dan Zhang

@DZhang50

2 years

@typedfemale I think this is evidence that EA doesn't actually believe capabilities research is bad. This is a great example of stated preferences vs revealed preferences.

3

0

27

Dan Zhang

@DZhang50

1 year

@francoisfleuret The complexity isn't the ML accelerator hardware itself, it's in the compiler. Also, accelerator architects frequently shoot themselves in the foot by offloading too much complexity to the compiler.

3

1

29

Dan Zhang

@DZhang50

2 years

@pgolding @alepiad All classes teach how to learn, assuming you don't already know the material. These principles can also be learned outside of college, but many people find the structure that college provides to be useful.

0

25

Dan Zhang

@DZhang50

2 years

In general though, not to pick on this work specifically, but it exemplifies a lot of characteristics of EA work: a lack of domain expertise, resulting in major errors and misunderstandings - but sometimes coincidentally arriving at the right conclusion 😉 9/

2

1

25

Dan Zhang

@DZhang50

2 years

Another key characteristic of EA work: an attempt at credibility via obfuscation. This characteristic is demonstrated here via overly complex (and entirely unnecessary) mathematical models. If it has math, it surely must be correct, right? 😉 10/

0

23

Dan Zhang

@DZhang50

2 years

@typedfemale "Due to competitive pressure, the movie doesn't actually contain any details about large language models"

1

24

Dan Zhang

@DZhang50

2 years

@ESYudkowsky Did you previously believe that the algorithmic bias people didn't have a point?

1

24

Dan Zhang

@DZhang50

2 years

The claim that chip design experts are all "hiding behind walls of NDAs and can only vaguely gesture at things" is incorrect. As shown above, I was able to produce technical details using only publicly available sources 🙂 6/

1

0

20

Dan Zhang

@DZhang50

2 years

Great keynote by @JeffDean at the ML for Systems workshop at NeurIPS!

1

2

24

Dan Zhang

@DZhang50

2 years

Technology node shrinks are effectively a series of one-off improvements. The big upcoming improvement is Gate All Around. After that, things are hazy. That's why Moore's Law is always claimed to be over in 10 years - bc we haven't invented enough one-off improvements yet. 3/

3

1

20

Dan Zhang

@DZhang50

10 months

I'll be attending NeurIPS next week, as a co-organizer for the ML for Systems workshop. Please feel free to reach out if you're interested in LLM serving efficiency, or HW/SW codesign opportunities in GDM!

0

1

22

Dan Zhang

@DZhang50

2 years

@thegautamkamath Generally speaking, post-tenure stress is self-imposed by defining too-aggressive goals for yourself. The obvious solution is to define less-aggressive goals for yourself, but I think a lot of people don't wish to do so.

0

20

Dan Zhang

@DZhang50

2 years

The claim we're ~10 years before hitting the limit of a silicon atom is wrong. The NVIDIA H100 (TSMC 4nm) has 80B transistors with a 814mm^2 die, for a density of ~100M/mm^2. A silicon atom has a radius of 0.2nm; a grid of 1nm*1nm transistors has a density of 1,000,000M/mm^2. 4/

2

0

17

Dan Zhang

@DZhang50

2 years

@MichaelTrazzi @jackclarkSF The NVIDIA work is great, but is limited to optimizing parallel prefix trees. Here's some work from Google Brain that uses ML to design optimized ML accelerators (hardware+scheduler) targeting emerging ML workloads!

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

1

2

18

Dan Zhang

@DZhang50

9 months

@xanderai @paulg Good point! The 44% reduction stat is based on data Indeed, measured from mid-2022 to mid-2023. I found a similar stat below. You can see that overall job postings peaked in mid-2022, implying that this is a deceptive stat measuring from historical highs to now.

2

0

17

Dan Zhang

@DZhang50

2 years

@typedfemale NVIDIA was slightly annoyed that I used the official DGX A100 MSRP in my paper. Like bruh, I can only use publicly available data. If you want your numbers to look better, don't be greedy and set a lower MSRP 😅

0

16

Dan Zhang

@DZhang50

2 years

Getting wine and dined by Jane Street 😳

4

0

19

Dan Zhang

@DZhang50

2 years

People who are really serious about ML should make their own hardware

1

2

17

Dan Zhang

@DZhang50

1 year

Google DeepMind, ex-Brain

Demis Hassabis

@demishassabis

1 year

The phenomenal teams from Google Research’s Brain and @DeepMind have made many of the seminal research advances that underpin modern AI, from Deep RL to Transformers. Now we’re joining forces as a single unit, Google DeepMind, which I’m thrilled to lead!

158

654

4K

1

18

Dan Zhang

@DZhang50

2 years

I'll be attending NeurIPS for the first time! Hmu if you're interested in chatting about future ML accelerators 🤠

1

0

17

Dan Zhang

@DZhang50

3 years

Building custom accelerators has traditionally required large engineering teams working for many years. ML has the potential to change the game, simultaneously reducing engineering time while improving quality of results across the entire chip design stack. (3/4)

1

2

17

Dan Zhang

@DZhang50

2 years

@lucy_guo FANG employee alarm 😎

1

0

16

Dan Zhang

@DZhang50

2 years

@cis_female Usually, only about 10-20% of a GPU is comprised of arithmetic units.

1

0

15

Dan Zhang

@DZhang50

11 months

Are you interested in using ML to improve chip design? Come join us at Google DeepMind!

DeepMind

boards.greenhouse.io

0

3

19

Dan Zhang

@DZhang50

10 months

The ML for Systems workshop is kicking off! First, our special invited speaker Bill Dally, Chief Scientist and SVP of Research at NVIDIA!

0

1

17

Dan Zhang

@DZhang50

2 years

The "obvious conclusion" in this case is that process technology advances will probably run out of steam somewhere around 2035. However, it's not because we'll hit the size of a silicon atom. Instead, it's because we're currently out of new ideas. 2/

2

1

14

Dan Zhang

@DZhang50

2 years

@Scobleizer They probably couldn't get the ML compiler fully working in time 😉

1

0

15

Dan Zhang

@DZhang50

1 year

@abhi_venigalla @boborado GH200 isn't designed for LLMs. It's designed for large embedding models.

1

0

14

Dan Zhang

@DZhang50

2 years

@micsolana Here's the members of the FTX polycule

1

14

Dan Zhang

@DZhang50

2 years

The claim that peak FP32 performance will be capped at 1e15 (1 petaflop/s) is wrong. In reality, we could hit that very soon. For example, the NVIDIA H100 supports ~500 TFLOPS (0.5PFLOPS) TF32, 1PFLOPS BF16, 2PFLOPS FP8. Not FP32, but we could theoretically design one. 5/

1

0

11

Dan Zhang

@DZhang50

2 years

Coincidentally though, the high level conclusions are all correct. "The current paradigm of field-effect transistor-based GPUs will plateau sometime between 2027 and 2035" is true (for now) bc we're currently out of ideas beyond then, as described above. 7/

2

0

11

Dan Zhang

@DZhang50

2 years

@cis_female AMD has a strong focus on scientific computing, which requires FP64.

2

0

14

Dan Zhang

@DZhang50

7 months

@PandaAshwinee You already answered your own question. Groq is setting the API price artificially low for marketing reasons. If they run out of capacity on their initial deployment you'll get rate limited, since they're not making money off this.

1

14

Dan Zhang

@DZhang50

1 year

ANSWER: the chip on the right is the AMD Radeon VII, a consumer graphics card built on TSMC 7nm released in 2019 for an official MSRP of $699.

AMD Radeon VII Specs

AMD Vega 20, 1750 MHz, 3840 Cores, 240 TMUs, 64 ROPs, 16384 MB HBM2, 1000 MHz, 4096 bit

www.techpowerup.com

2

13

Dan Zhang

@DZhang50

10 months

@MatthewJBar This doesn't seem to be the correct takeaway. I think Gemini is a case study demonstrating that, even when caught off-guard, Google can match or exceed whatever OpenAI launches in less than a year (GPT4 launched on 3/14) through good execution.

5

0

13

Dan Zhang

@DZhang50

2 years

@trevorycai You are both wrong (or both right). We need to co-optimize hardware and software. Keeping one constant while only optimizing the other is the wrong approach.

3

0

13

Dan Zhang

@DZhang50

2 years

@srush_nlp I believe that TPUs, rather than GPUs, are fully deterministic. Therefore, if you were to run the same model on a TPU then you should have full determinism. Perhaps you should lobby OpenAI to switch to TPUs 😉

2

0

12

Dan Zhang

@DZhang50

2 years

@zacharynado This appears to be from 2016

1

0

11

Dan Zhang

@DZhang50

3 years

TLDR: we made specialized hardware chips that can do machine learning faster :)

1

0

12

Dan Zhang

@DZhang50

1 year

@8teAPi @TheXeophon Two of the three authors are now at OpenAI 🤔

0

10

Dan Zhang

@DZhang50

3 years

@giladrom @aBarnes94 @Jason I remember Apple did have free apples, but they removed the free apples near the end of my internship

0

10

Dan Zhang

@DZhang50

2 years

@cis_female Not sure, but probably yes. SRAM is usually about 50% of total chip area. My guess is that the H100 will have higher peak performance but lower utilization vs the A100.

3

0

8

Dan Zhang

@DZhang50

2 years

@0w3nl Now measure the runtime

2

0

10

Dan Zhang

@DZhang50

2 years

@khademinori @jekbradbury Yes there are! For example, my project works on using ML to design optimized hardware that runs more efficiently on emerging ML workloads.

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

1

0

10

Dan Zhang

@DZhang50

3 years

FAST extends previous work by expanding the design exploration space to up to 10^2300, covering not just the hardware datapath, but also software scheduling and compiler decisions including padding and fusion. Fusion is key since it addresses memory bandwidth bottlenecks. (2/5)

1

0

10

Dan Zhang

@DZhang50

3 years

I'll be giving my first-ever keynote at ISVLSI this Friday, July 9th at 8:30AM PST titled: "Transforming Chip Design in the Age of Machine Learning ", which will discuss work from the ML for Systems team within Google Brain. Registration is free! (1/2)

1

10

Dan Zhang

@DZhang50

2 years

@ethanCaballero Google is still hiring!

1

0

9

Dan Zhang

@DZhang50

1 year

@typedfemale What's next, SSMs? 🤔

0

9

Dan Zhang

@DZhang50

2 years

They're not letting anyone in 🤗

1

9

Dan Zhang

@DZhang50

2 years

@kipperrii The bay area Haskell cults should count as math cults

0

9

Dan Zhang

@DZhang50

3 years

Accelerating ML inference is important because fast ML inference latency/throughput is required to launch models in production at scale. If an application is sufficiently important and with enough volume, it can make sense to customize a chip for this purpose. (3/5)

1

0

9

Dan Zhang

@DZhang50

2 years

@typedfemale It's fast on FPGAs

1

0

9

Dan Zhang

@DZhang50

1 year

@tszzl The simplest answer is that even if Dojo has completely identical performance, Tesla avoids paying NVIDIA's ridiculous ~10x profit margin. This means you can buy ~10x chips for the same budget.

1

0

8

Dan Zhang

@DZhang50

2 years

RIP

Intel News

@intelnews

2 years

Gordon Moore, Intel Co-Founder who set the course for the future of the semiconductor industry, has passed away at the age of 94.

87

941

2K

0

8

Dan Zhang

@DZhang50

1 year

Is this an aligned LLM?

Kenneth Cassel

@KennethCassel

1 year

Palantir is building a chat LLM interface for war (full vid here )

96

161

1K

1

6

Dan Zhang

@DZhang50

2 years

@ethanCaballero @MitchellAGordon The problem with this graph is obviously that the X axis measures FLOPS rather than step time. The attention component doesn't contain many FLOPs but does take a lot of time due to low compute utilization.

2

0

8

Dan Zhang

@DZhang50

2 years

@cHHillee Very cool! Can Triton be easily extended to target other devices, eg. AMD GPUs or even TPUs?

1

0

8

Dan Zhang

@DZhang50

3 years

Finally, techniques like this are important because they can be used to accelerate the chip design process from years to potentially months. By increasing the workload set targeted by FAST, FAST can also be used to automatically design general-purpose ML accelerators. (5/5)

1

0

8

Dan Zhang

@DZhang50

1 year

@typedfemale I think it's actually the opposite: if your idea works, you are quickly promoted. If your idea is bad and doesn't work, you can still publicly spin your work as being good and switch to a different company.

0

7

Dan Zhang

@DZhang50

2 years

@jbhuang0604 "Rankings are good only if they rank us highly" 🙂

1

0

7

Dan Zhang

@DZhang50

2 years

They're probably still recovering from food poisoning 🤔

Mo Bavarian

@mobav0

2 years

What’s the deal with people with @NeurIPS in their handle. Have they gotten stuck in New Orleans for this whole time?

2

72

1

0

7

Dan Zhang

@DZhang50

1 year

@typedfemale But do alignment researchers get more compute than capabilities researchers? 🤔

2

0

7

Dan Zhang

@DZhang50

2 years

"Offering a performance of between 1e14 and 1e15 FLOP/s in FP32" feels reasonably correct for ML hardware because people don't care as much about true FP32 performance. Instead, we care about BF16 (or lower), which is already at 1e15 FLOPS/s for the H100. 8/

1

0

5

Dan Zhang

@DZhang50

2 years

@typedfemale Transformers Are All You Need

0

7

Dan Zhang

@DZhang50

3 years

A key benefit of the work is that our specialized accelerators can still run other ML workloads - just not as efficiently. Relative to building a chip that can only handle a single workload, this gives engineers the flexibility to still change the model in production. (4/5)

1

0

7

Dan Zhang

@DZhang50

3 years

@giffmana This is true due to limitations in existing accelerator hardware. Our work, FAST, analyzes EfficientNet bottlenecks and shows a framework capable of automatically designing custom accelerators with 4x Perf/TDP on EfficientNet-B7 relative to TPU-v3.

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

2

1

7

Dan Zhang

@DZhang50

2 years

@typedfemale They really wrote 10 pages just to say "don't put all your eggs in one basket" 🤔

0

6

Dan Zhang

@DZhang50

1 year

@finbarrtimbers It's a pretty simple calculation, with an unsurprising conclusion (NVIDIA accelerators are too expensive). I walk through this calculation in Section 5.1: "The Economics of Specialized Accelerators":

A Full-Stack Search Technique for Domain Optimized Deep Learning...

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator...

arxiv.org

1

0

7

Dan Zhang

@DZhang50

2 years

@_jasonwei Judging by my publication count, I must do great work 🥲

0

7

Dan Zhang

@DZhang50