Dan Zhang Profile
Dan Zhang

@DZhang50

Followers
1,965
Following
811
Media
33
Statuses
770

Researcher @ Google DeepMind | ML for Systems | Systems for ML | Computer Architecture PhD @ UT Austin🤘 | Opinions stated here are my own.

SF Bay Area, CA
Joined November 2014
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@DZhang50
Dan Zhang
3 years
Currently, datacenter ML training and inference uses commodity TPU and GPU devices optimized for a wide range of workloads. Given the extreme scale of large datacenter deployments, would it be practical to build custom accelerators optimized for specific workloads? (1/4)
7
17
154
@DZhang50
Dan Zhang
2 years
@ZoeSchiffer He left a bit after 4 years, which means the real reason is that his stock cliff hit.
6
8
229
@DZhang50
Dan Zhang
9 months
@paulg That's not what the article actually says though. If you read it, it says DEI related job postings dropped by 44% in 2023.
6
1
119
@DZhang50
Dan Zhang
2 years
EA is full of 10 page studies that come to obvious conclusions that they could have figured out by having a 10min conversation with an actual expert in the field 😉
@MariusHobbhahn
Marius Hobbhahn
2 years
We modeled the performance of FET-based GPUs assuming that transistor miniaturization will hit a limit before reaching the size of a silicon atom. Our model predicts that performance will plateau between 2027 and 2035 at ~1e14 to 1e15 FLOP/s in FP32. 1/n
Tweet media one
11
38
188
14
5
114
@DZhang50
Dan Zhang
3 years
I wrote a paper with a few Google colleagues about FAST, a new technique to build new specialized ML hardware accelerators able to improve computer vision inference performance by up to 6x relative to TPU-v3! (1/5)
2
29
109
@DZhang50
Dan Zhang
2 years
The current hottest commodity: G-Research (not Google Research) boat party invite!
Tweet media one
5
4
95
@DZhang50
Dan Zhang
2 years
In the age of ChatGPT/Galactica, the value of unverified 10 page essays drops to zero. What matters now, more than ever, is fact checking. To remain grounded in reality, Effective Altruists must work with actual domain experts. Stop paying novices, and start paying experts.
5
8
81
@DZhang50
Dan Zhang
2 years
@minimaxir Google already won it when someone sacrificed their career on behalf of LaMDA 🤔
0
0
74
@DZhang50
Dan Zhang
3 years
@JeffDean 's terrific keynote at DAC 2021, "The Potential for Machine Learning for Hardware Design," features the work of many Googlers, including our work on FAST. The entire talk is worth watching in full; FAST is discussed between 30:51 - 39:45 (4/4)
1
7
71
@DZhang50
Dan Zhang
7 months
I'm surprised by the recent Groq discourse. One point which I think many are missing is that this analysis typically compares hundreds of Groq chips with a couple GPUs. A proper analysis should normalize for the Total Cost of Ownership (TCO) for the platform. 1/
@madiator
Mahesh Sathiamoorthy
7 months
Ok so few days back when I posted this, nobody noticed. And now my timeline is full of groq. Look at this throughput! I think the founders are ex-TPU folks, a great testament to Google engineering. :)
Tweet media one
7
10
130
12
9
66
@DZhang50
Dan Zhang
2 years
@typedfemale And once by Jurgen Schmidhuber 😉
1
0
58
@DZhang50
Dan Zhang
1 year
Google just launched Cloud TPU v5e! I helped with the performance methodology for the launch 🙂
@SumitGup
Sumit Gupta
1 year
Google Cloud announces our next generation TPU: Cloud TPU v5e TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4.
0
6
36
3
0
60
@DZhang50
Dan Zhang
2 years
@alepiad Exactly. College teaches you how to learn, while balancing multiple projects under pressure with time constraints. The exact material frequently doesn't matter, since the goal is simply just repetition until you learn how to learn.
4
1
55
@DZhang50
Dan Zhang
2 years
@karpathy Why doesn't the compiler just automatically pad to the nearest multiple of 64? It seems that this padding decision should be part of the compiler's cost estimation.
4
0
55
@DZhang50
Dan Zhang
1 year
@MattNiessner This seems inaccurate. Another major incentive is the advantage in infrastructure, data, and other resources. And of course, the sheer number of world class colleagues!
1
0
54
@DZhang50
Dan Zhang
2 years
@typedfemale Usually that's a simple class project for a graduate-level computer architecture course
3
0
41
@DZhang50
Dan Zhang
1 year
The NVIDIA H100 (on the left) sells for around $40-50K. Guess the MSRP of the chip on the right 🤔
Tweet media one
Tweet media two
7
5
46
@DZhang50
Dan Zhang
2 years
@CkLorentzen Empirically, I've found that self-identified polymaths tend to be not all that they seem.
Tweet media one
0
0
42
@DZhang50
Dan Zhang
7 years
@ingridavendano Alex Gulakov went to my school @UTAustin and was banned from our CS fb group for harassment. He has a long history of awful behavior and has gotten even more extreme since.
0
7
41
@DZhang50
Dan Zhang
1 year
Announcing the 6th ML for Systems workshop at @NeurIPSConf ! This year, we’re also interested in LLMs for systems, and ML for solving large-scale training/serving problems. Four-page extended abstracts are due on Sep 29, 2023. CFP: #mlforsystems
0
12
42
@DZhang50
Dan Zhang
10 months
Google is arguably the only company with both SOTA ML models and ML accelerator hardware. What if we could take advantage of this through automated HW/SW codesign? 🤔😉
@JeffDean
Jeff Dean (@🏡)
10 months
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,
Tweet media one
Tweet media two
273
3K
13K
3
1
40
@DZhang50
Dan Zhang
1 year
@O42nl This is me except I eat free Google food on weekdays instead of cat food 😅
0
0
32
@DZhang50
Dan Zhang
2 years
@typedfemale OpenAI: *refuses to provide GPT-4 details due to competitive pressure* Alphabet: *releases YouTube Shorts*
0
0
32
@DZhang50
Dan Zhang
3 years
Our work on Full-Stack Accelerator Search (FAST), accepted at ASPLOS 2022, shows that inference accelerators optimized for SOTA models like EfficientNet and BERT can achieve ~4x Perf/TDP relative to TPUv3 and can be ROI-positive at moderate datacenter deployment sizes. (2/4)
2
1
31
@DZhang50
Dan Zhang
2 years
@typedfemale I think this is evidence that EA doesn't actually believe capabilities research is bad. This is a great example of stated preferences vs revealed preferences.
3
0
27
@DZhang50
Dan Zhang
1 year
@francoisfleuret The complexity isn't the ML accelerator hardware itself, it's in the compiler. Also, accelerator architects frequently shoot themselves in the foot by offloading too much complexity to the compiler.
3
1
29
@DZhang50
Dan Zhang
2 years
@pgolding @alepiad All classes teach how to learn, assuming you don't already know the material. These principles can also be learned outside of college, but many people find the structure that college provides to be useful.
0
0
25
@DZhang50
Dan Zhang
2 years
In general though, not to pick on this work specifically, but it exemplifies a lot of characteristics of EA work: a lack of domain expertise, resulting in major errors and misunderstandings - but sometimes coincidentally arriving at the right conclusion 😉 9/
2
1
25
@DZhang50
Dan Zhang
2 years
Another key characteristic of EA work: an attempt at credibility via obfuscation. This characteristic is demonstrated here via overly complex (and entirely unnecessary) mathematical models. If it has math, it surely must be correct, right? 😉 10/
0
0
23
@DZhang50
Dan Zhang
2 years
@typedfemale "Due to competitive pressure, the movie doesn't actually contain any details about large language models"
1
1
24
@DZhang50
Dan Zhang
2 years
@ESYudkowsky Did you previously believe that the algorithmic bias people didn't have a point?
1
1
24
@DZhang50
Dan Zhang
2 years
The claim that chip design experts are all "hiding behind walls of NDAs and can only vaguely gesture at things" is incorrect. As shown above, I was able to produce technical details using only publicly available sources 🙂 6/
1
0
20
@DZhang50
Dan Zhang
2 years
Great keynote by @JeffDean at the ML for Systems workshop at NeurIPS!
Tweet media one
1
2
24
@DZhang50
Dan Zhang
2 years
Technology node shrinks are effectively a series of one-off improvements. The big upcoming improvement is Gate All Around. After that, things are hazy. That's why Moore's Law is always claimed to be over in 10 years - bc we haven't invented enough one-off improvements yet. 3/
Tweet media one
3
1
20
@DZhang50
Dan Zhang
10 months
I'll be attending NeurIPS next week, as a co-organizer for the ML for Systems workshop. Please feel free to reach out if you're interested in LLM serving efficiency, or HW/SW codesign opportunities in GDM!
0
1
22
@DZhang50
Dan Zhang
2 years
@thegautamkamath Generally speaking, post-tenure stress is self-imposed by defining too-aggressive goals for yourself. The obvious solution is to define less-aggressive goals for yourself, but I think a lot of people don't wish to do so.
0
0
20
@DZhang50
Dan Zhang
2 years
The claim we're ~10 years before hitting the limit of a silicon atom is wrong. The NVIDIA H100 (TSMC 4nm) has 80B transistors with a 814mm^2 die, for a density of ~100M/mm^2. A silicon atom has a radius of 0.2nm; a grid of 1nm*1nm transistors has a density of 1,000,000M/mm^2. 4/
2
0
17
@DZhang50
Dan Zhang
2 years
@MichaelTrazzi @jackclarkSF The NVIDIA work is great, but is limited to optimizing parallel prefix trees. Here's some work from Google Brain that uses ML to design optimized ML accelerators (hardware+scheduler) targeting emerging ML workloads!
1
2
18
@DZhang50
Dan Zhang
9 months
@xanderai @paulg Good point! The 44% reduction stat is based on data Indeed, measured from mid-2022 to mid-2023. I found a similar stat below. You can see that overall job postings peaked in mid-2022, implying that this is a deceptive stat measuring from historical highs to now.
Tweet media one
2
0
17
@DZhang50
Dan Zhang
2 years
@typedfemale NVIDIA was slightly annoyed that I used the official DGX A100 MSRP in my paper. Like bruh, I can only use publicly available data. If you want your numbers to look better, don't be greedy and set a lower MSRP 😅
0
0
16
@DZhang50
Dan Zhang
2 years
Getting wine and dined by Jane Street 😳
Tweet media one
4
0
19
@DZhang50
Dan Zhang
2 years
People who are really serious about ML should make their own hardware
Tweet media one
1
2
17
@DZhang50
Dan Zhang
1 year
Google DeepMind, ex-Brain
@demishassabis
Demis Hassabis
1 year
The phenomenal teams from Google Research’s Brain and @DeepMind have made many of the seminal research advances that underpin modern AI, from Deep RL to Transformers. Now we’re joining forces as a single unit, Google DeepMind, which I’m thrilled to lead!
158
654
4K
1
1
18
@DZhang50
Dan Zhang
2 years
I'll be attending NeurIPS for the first time! Hmu if you're interested in chatting about future ML accelerators 🤠
1
0
17
@DZhang50
Dan Zhang
3 years
Building custom accelerators has traditionally required large engineering teams working for many years. ML has the potential to change the game, simultaneously reducing engineering time while improving quality of results across the entire chip design stack. (3/4)
1
2
17
@DZhang50
Dan Zhang
2 years
@lucy_guo FANG employee alarm 😎
Tweet media one
1
0
16
@DZhang50
Dan Zhang
2 years
@cis_female Usually, only about 10-20% of a GPU is comprised of arithmetic units.
1
0
15
@DZhang50
Dan Zhang
11 months
Are you interested in using ML to improve chip design? Come join us at Google DeepMind!
0
3
19
@DZhang50
Dan Zhang
10 months
The ML for Systems workshop is kicking off! First, our special invited speaker Bill Dally, Chief Scientist and SVP of Research at NVIDIA!
Tweet media one
Tweet media two
0
1
17
@DZhang50
Dan Zhang
2 years
The "obvious conclusion" in this case is that process technology advances will probably run out of steam somewhere around 2035. However, it's not because we'll hit the size of a silicon atom. Instead, it's because we're currently out of new ideas. 2/
2
1
14
@DZhang50
Dan Zhang
2 years
@Scobleizer They probably couldn't get the ML compiler fully working in time 😉
1
0
15
@DZhang50
Dan Zhang
1 year
@abhi_venigalla @boborado GH200 isn't designed for LLMs. It's designed for large embedding models.
1
0
14
@DZhang50
Dan Zhang
2 years
@micsolana Here's the members of the FTX polycule
Tweet media one
Tweet media two
1
1
14
@DZhang50
Dan Zhang
2 years
The claim that peak FP32 performance will be capped at 1e15 (1 petaflop/s) is wrong. In reality, we could hit that very soon. For example, the NVIDIA H100 supports ~500 TFLOPS (0.5PFLOPS) TF32, 1PFLOPS BF16, 2PFLOPS FP8. Not FP32, but we could theoretically design one. 5/
1
0
11
@DZhang50
Dan Zhang
2 years
Coincidentally though, the high level conclusions are all correct. "The current paradigm of field-effect transistor-based GPUs will plateau sometime between 2027 and 2035" is true (for now) bc we're currently out of ideas beyond then, as described above. 7/
2
0
11
@DZhang50
Dan Zhang
2 years
@cis_female AMD has a strong focus on scientific computing, which requires FP64.
2
0
14
@DZhang50
Dan Zhang
7 months
@PandaAshwinee You already answered your own question. Groq is setting the API price artificially low for marketing reasons. If they run out of capacity on their initial deployment you'll get rate limited, since they're not making money off this.
1
1
14
@DZhang50
Dan Zhang
1 year
ANSWER: the chip on the right is the AMD Radeon VII, a consumer graphics card built on TSMC 7nm released in 2019 for an official MSRP of $699.
2
2
13
@DZhang50
Dan Zhang
10 months
@MatthewJBar This doesn't seem to be the correct takeaway. I think Gemini is a case study demonstrating that, even when caught off-guard, Google can match or exceed whatever OpenAI launches in less than a year (GPT4 launched on 3/14) through good execution.
5
0
13
@DZhang50
Dan Zhang
2 years
@trevorycai You are both wrong (or both right). We need to co-optimize hardware and software. Keeping one constant while only optimizing the other is the wrong approach.
3
0
13
@DZhang50
Dan Zhang
2 years
@srush_nlp I believe that TPUs, rather than GPUs, are fully deterministic. Therefore, if you were to run the same model on a TPU then you should have full determinism. Perhaps you should lobby OpenAI to switch to TPUs 😉
2
0
12
@DZhang50
Dan Zhang
2 years
@zacharynado This appears to be from 2016
1
0
11
@DZhang50
Dan Zhang
3 years
TLDR: we made specialized hardware chips that can do machine learning faster :)
1
0
12
@DZhang50
Dan Zhang
1 year
@8teAPi @TheXeophon Two of the three authors are now at OpenAI 🤔
0
0
10
@DZhang50
Dan Zhang
3 years
@giladrom @aBarnes94 @Jason I remember Apple did have free apples, but they removed the free apples near the end of my internship
0
0
10
@DZhang50
Dan Zhang
2 years
@cis_female Not sure, but probably yes. SRAM is usually about 50% of total chip area. My guess is that the H100 will have higher peak performance but lower utilization vs the A100.
3
0
8
@DZhang50
Dan Zhang
2 years
@0w3nl Now measure the runtime
2
0
10
@DZhang50
Dan Zhang
3 years
FAST extends previous work by expanding the design exploration space to up to 10^2300, covering not just the hardware datapath, but also software scheduling and compiler decisions including padding and fusion. Fusion is key since it addresses memory bandwidth bottlenecks. (2/5)
1
0
10
@DZhang50
Dan Zhang
3 years
I'll be giving my first-ever keynote at ISVLSI this Friday, July 9th at 8:30AM PST titled: "Transforming Chip Design in the Age of Machine Learning ", which will discuss work from the ML for Systems team within Google Brain. Registration is free! (1/2)
1
1
10
@DZhang50
Dan Zhang
2 years
@ethanCaballero Google is still hiring!
1
0
9
@DZhang50
Dan Zhang
1 year
@typedfemale What's next, SSMs? 🤔
0
0
9
@DZhang50
Dan Zhang
2 years
They're not letting anyone in 🤗
Tweet media one
1
1
9
@DZhang50
Dan Zhang
2 years
@kipperrii The bay area Haskell cults should count as math cults
0
0
9
@DZhang50
Dan Zhang
3 years
Accelerating ML inference is important because fast ML inference latency/throughput is required to launch models in production at scale. If an application is sufficiently important and with enough volume, it can make sense to customize a chip for this purpose. (3/5)
1
0
9
@DZhang50
Dan Zhang
2 years
@typedfemale It's fast on FPGAs
1
0
9
@DZhang50
Dan Zhang
1 year
@tszzl The simplest answer is that even if Dojo has completely identical performance, Tesla avoids paying NVIDIA's ridiculous ~10x profit margin. This means you can buy ~10x chips for the same budget.
1
0
8
@DZhang50
Dan Zhang
2 years
RIP
@intelnews
Intel News
2 years
Gordon Moore, Intel Co-Founder who set the course for the future of the semiconductor industry, has passed away at the age of 94.
87
941
2K
0
0
8
@DZhang50
Dan Zhang
1 year
Is this an aligned LLM?
@KennethCassel
Kenneth Cassel
1 year
Palantir is building a chat LLM interface for war (full vid here )
96
161
1K
1
1
6
@DZhang50
Dan Zhang
2 years
@ethanCaballero @MitchellAGordon The problem with this graph is obviously that the X axis measures FLOPS rather than step time. The attention component doesn't contain many FLOPs but does take a lot of time due to low compute utilization.
2
0
8
@DZhang50
Dan Zhang
2 years
@cHHillee Very cool! Can Triton be easily extended to target other devices, eg. AMD GPUs or even TPUs?
1
0
8
@DZhang50
Dan Zhang
3 years
Finally, techniques like this are important because they can be used to accelerate the chip design process from years to potentially months. By increasing the workload set targeted by FAST, FAST can also be used to automatically design general-purpose ML accelerators. (5/5)
1
0
8
@DZhang50
Dan Zhang
1 year
@typedfemale I think it's actually the opposite: if your idea works, you are quickly promoted. If your idea is bad and doesn't work, you can still publicly spin your work as being good and switch to a different company.
0
0
7
@DZhang50
Dan Zhang
2 years
@jbhuang0604 "Rankings are good only if they rank us highly" 🙂
1
0
7
@DZhang50
Dan Zhang
2 years
They're probably still recovering from food poisoning 🤔
@mobav0
Mo Bavarian
2 years
What’s the deal with people with @NeurIPS in their handle. Have they gotten stuck in New Orleans for this whole time?
2
2
72
1
0
7
@DZhang50
Dan Zhang
1 year
@typedfemale But do alignment researchers get more compute than capabilities researchers? 🤔
2
0
7
@DZhang50
Dan Zhang
2 years
"Offering a performance of between 1e14 and 1e15 FLOP/s in FP32" feels reasonably correct for ML hardware because people don't care as much about true FP32 performance. Instead, we care about BF16 (or lower), which is already at 1e15 FLOPS/s for the H100. 8/
1
0
5
@DZhang50
Dan Zhang
2 years
@typedfemale Transformers Are All You Need
0
0
7
@DZhang50
Dan Zhang
3 years
A key benefit of the work is that our specialized accelerators can still run other ML workloads - just not as efficiently. Relative to building a chip that can only handle a single workload, this gives engineers the flexibility to still change the model in production. (4/5)
1
0
7
@DZhang50
Dan Zhang
3 years
@giffmana This is true due to limitations in existing accelerator hardware. Our work, FAST, analyzes EfficientNet bottlenecks and shows a framework capable of automatically designing custom accelerators with 4x Perf/TDP on EfficientNet-B7 relative to TPU-v3.
2
1
7
@DZhang50
Dan Zhang
2 years
@typedfemale They really wrote 10 pages just to say "don't put all your eggs in one basket" 🤔
0
0
6
@DZhang50
Dan Zhang
1 year
@finbarrtimbers It's a pretty simple calculation, with an unsurprising conclusion (NVIDIA accelerators are too expensive). I walk through this calculation in Section 5.1: "The Economics of Specialized Accelerators":
1
0
7
@DZhang50
Dan Zhang
2 years
@_jasonwei Judging by my publication count, I must do great work 🥲
0
0
7
@DZhang50
Dan Zhang
1 year
@finbarrtimbers Just use TPUs 🥰
3
0
7
@DZhang50
Dan Zhang
2 years
@Tim_Dettmers Very cool! Do you think this technique can be easily scaled to int4?
1
0
7