![Umar Jamil Profile](https://pbs.twimg.com/profile_images/1711253980469248000/JYDEruz-_x96.jpg)
Umar Jamil
@hkproj
Followers
9K
Following
2K
Statuses
471
I used to keep GPUs hot🔥, now I make them go brrrr🔥🔥🔥 @Get_Writer Join the best AI community on Discord: https://t.co/zYH1DlgdbW Opinions my own
Milan, Lombardy
Joined February 2018
In this video, I'll be deriving and coding Flash Attention from scratch. No prior knowledge of CUDA or Triton is required. Link to the video: All the code will be written in Python with Triton, but no prior knowledge of Triton is required. I'll also explain the CUDA programming model from zero. I'll explore the following topics: * Review of Multi-Head Attention * Safe Softmax * Online Softmax (with proof!) * Introduction to GPUs and the CUDA programming model * Tensor layouts: row-major layout, stride, reshape, transpose * Block Matrix Multiplication * Introduction to Triton * Forward pass of Flash Attention in Triton * How Autograd works * What are derivatives, gradients, and Jacobians * Jacobian of the Matrix Multiplication operation * Jacobian of the Softmax operation * Backwards pass of Flash Attention in Triton * Triton tricks: Software pipelining If you find this video useful, consider subscribing to my channel and sharing the video within your network of friends and colleagues. #flashattention #triton #cuda #tutorial #python #attention #transformers #deeplearning
40
281
2K
@Grad62304977 @mrsiipa Yeah, that’s why they hired Noam back. He did an amazing job scaling inference at CharacterAI I shared the two blog posts from CharacterAI that describe the arch changes to reduce the cost of the KV Cache
1
0
40
RT @NVIDIAHPCDev: 🌟We just learned there's a 100 Day #CUDA Challenge happening. Launched by @hkproj -- there's now more than 60 coders w…
0
9
0
@marksaroufim @jxmnop I think it’s also about talent and docs: early stable software -> more developers using it -> companies buy expensive HW that employees can actually use -> more senior devs in the market on said HW… In AI you don’t have 6 months to train devs on a new HW, market is too fast.
1
1
82
@pientropy Compute your daily cost to the company - divide by two - that's how much it saves to use it for research. In Europe it depends on where you live/work, but in the US the average tech worker who needs to do research would definitely save money to the company by having it.
0
0
2
Community won. I’ll describe the pipeline parallelism method they used in the DeepSeek V3 starting from early pipeline parallelism designs from 2018. Of course from first principles, with pen and paper. May daughter Sofia may cry in the background from time to time.
If the community wants, let’s talk about pipeline parallelism in my next video. Community votes, I deliver.
2
6
288