![Colfax International Profile](https://pbs.twimg.com/profile_images/985055732587347969/z566iDGE_x96.jpg)
Colfax International
@colfaxintl
Followers
957
Following
272
Statuses
650
HPC & AI Solutions (https://t.co/VqfbQgi7kA) | Research (https://t.co/32b1YZTuB2) | Colfax Experience Center (https://t.co/cAlyTPEGOl)
Santa Clara, CA
Joined February 2009
The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier series on GEMM optimization, and in particular on the Hopper WGMMA instruction for fast matrix-multiplication: 1) Periodic promotion of lower precision FP8 WGMMA accumulators computed via Tensor Cores to full FP32 precision using CUDA Cores. 2) 128x128 blockwise and 1x128 groupwise scales for FP8 quantization of weights and activations. We hope this provides some greater depth on specific changes you need to make to standard FP8 GEMMs in order to make them useful in a practical setting, such as the training setup used for the DeepSeek-V3 model. Finally, both (1) and (2) are now implemented in CUTLASS; see example 67 and the PR linked in our post. As always, beyond just using the CUTLASS API, it's a good idea to examine the source code to understand the nuts-and-bolts of performance engineering.
0
3
12
๐๐ผ๐น๐ณ๐ฎ๐
๐ป๐ผ๐ ๐ผ๐ณ๐ณ๐ฒ๐ฟ๐ ๐ก๐ฉ๐๐๐๐ ๐๐น๐ฎ๐ฐ๐ธ๐๐ฒ๐น๐น-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐ฒ๐ฟ๐๐ฒ๐ฟ๐ ๐ 8U/10U servers โข NVIDIA HGXโข B200 8-GPU baseboard โข 2x AMD EPYCโข 9004/9005 OR 2x 4th/5th Gen Intelยฎ Xeonยฎ Scalable OR 2x Intelยฎ Xeonยฎ 6900 series Learn more
0
2
4
๐๐จ๐ง๐๐๐ฆ๐ฆ ๐ง๐๐๐ผ๐ฟ๐ถ๐ฎ๐น: ๐ฃ๐ฒ๐ฟ๐๐ถ๐๐๐ฒ๐ป๐ ๐๐ฒ๐ฟ๐ป๐ฒ๐น๐ ๐ฎ๐ป๐ฑ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ-๐ Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions.
0
1
21
In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with @character_ai to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference
0
3
15
In this @GPU_MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS.
0
3
18
RT @hyhieu226: A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typโฆ
0
28
0
RT @hyhieu226: Checkout our newest CUDA tutorial. The topic is software pipelining: overlap mem copying with compuโฆ
0
115
0
RT @hyhieu226: @cosminnegruseri @tri_dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highlโฆ
0
2
0
๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐ถ๐ป๐ด ๐๐ผ๐น๐ณ๐ฎ๐
๐๐ฐ๐ฐ๐ฒ๐๐ ๐๐๐ฏ: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more โ all before they are shipped to you The service is free for all Colfax customers.
0
1
4
RT @hyhieu226: ๐๐งโ๐New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) If you have run PyTorcโฆ
0
52
0
๐๐ฟ๐ฒ๐ฒ ๐ก๐ฉ๐๐๐๐ยฎ ๐๐ญ๐ฌ๐ฌ ๐ก๐ฉ๐ ๐ง๐ฒ๐๐ ๐๐ฟ๐ถ๐๐ฒ Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. ๐๐๐ฆ๐ข๐ณ๐ฏ ๐ฎ๐ฐ๐ณ๐ฆ #AI
0
1
1
RT @togethercompute: We are thrilled to release FlashAttention-3 in partnership with @Meta , @nvidia, @Princeton, and @colfaxintl. The imโฆ
0
31
0
RT @tri_dao: This project is a collab with Jay Shah & Ganesh Bikshandi (@colfaxintl), @ipiszy (@meta), @DROP_ALL_TABLES and @_prrama (@nvidโฆ
0
3
0
RT @DROP_ALL_TABLES: FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort toโฆ
0
53
0
RT @tri_dao: FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantageโฆ
0
341
0
RT @PyTorch: Introducing FlashAttention-3 ๐ Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to @colfaxintl, @AIatโฆ
0
94
0
RT @hyhieu226: A tutorial to help your kernels run faster on the H100s. The H100 SXM GPU has the memory bandwidthโฆ
0
65
0
24-Bay 2U JBOD with KIOXIA PM7 Series 24G SAS SSD Drives A Big Leap Forward for Enterprise Storage Access a whole new level of SSD performance. Featuring the latest KIOXIA PM7 Series 24G SAS SSDs, Colfax CX22424c-JBOD offers enterprises the performance they need to keep up with todayโs rapidly evolving data demands. ๐ Learn more ๐ KIOXIA PM7 Series (2.5-inch, 15 mm thickness) > Enterprise SAS Mixed Use SSD (PM7-V Series) / Enterprise SAS Read Intensive SSD (PM7-R Series) > 24G SAS interface with single/dual-port support > 3 DWPD (PM7-V) / 1 DWPD (PM7-R) with 100 % Random Write Workload > Up to 720K random read IOPS (4 KiB) in dual-port mode > Power Loss Protection and End-to-End Data Protection, including T10 DIF > Capacities from 1.6 TB to 12.8 TB (PM7-V) / 1.92 TB to 30.72 TB (PM7-R) #colfax #kioxia #storage #jbod #ssd #sas4 #24gsas
0
1
2