Colfax International Profile
Colfax International

@colfaxintl

Followers
957
Following
272
Statuses
650

HPC & AI Solutions (https://t.co/VqfbQgi7kA) | Research (https://t.co/32b1YZTuB2) | Colfax Experience Center (https://t.co/cAlyTPEGOl)

Santa Clara, CA
Joined February 2009
Don't wanna be here? Send us removal request.
@colfaxintl
Colfax International
12 days
The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier series on GEMM optimization, and in particular on the Hopper WGMMA instruction for fast matrix-multiplication: 1) Periodic promotion of lower precision FP8 WGMMA accumulators computed via Tensor Cores to full FP32 precision using CUDA Cores. 2) 128x128 blockwise and 1x128 groupwise scales for FP8 quantization of weights and activations. We hope this provides some greater depth on specific changes you need to make to standard FP8 GEMMs in order to make them useful in a practical setting, such as the training setup used for the DeepSeek-V3 model. Finally, both (1) and (2) are now implemented in CUTLASS; see example 67 and the PR linked in our post. As always, beyond just using the CUTLASS API, it's a good idea to examine the source code to understand the nuts-and-bolts of performance engineering.
Tweet media one
0
3
12
@colfaxintl
Colfax International
21 days
๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—ป๐—ผ๐˜„ ๐—ผ๐—ณ๐—ณ๐—ฒ๐—ฟ๐˜€ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—” ๐—•๐—น๐—ฎ๐—ฐ๐—ธ๐˜„๐—ฒ๐—น๐—น-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐˜€ ๐Ÿ‘‰ 8U/10U servers โ€ข NVIDIA HGXโ„ข B200 8-GPU baseboard โ€ข 2x AMD EPYCโ„ข 9004/9005 OR 2x 4th/5th Gen Intelยฎ Xeonยฎ Scalable OR 2x Intelยฎ Xeonยฎ 6900 series Learn more
Tweet media one
0
2
4
@colfaxintl
Colfax International
2 months
๐—–๐—จ๐—ง๐—Ÿ๐—”๐—ฆ๐—ฆ ๐—ง๐˜‚๐˜๐—ผ๐—ฟ๐—ถ๐—ฎ๐—น: ๐—ฃ๐—ฒ๐—ฟ๐˜€๐—ถ๐˜€๐˜๐—ฒ๐—ป๐˜ ๐—ž๐—ฒ๐—ฟ๐—ป๐—ฒ๐—น๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ-๐—ž Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions.
0
1
21
@colfaxintl
Colfax International
2 months
In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with @character_ai to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference
0
3
15
@colfaxintl
Colfax International
3 months
In this @GPU_MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS.
0
3
18
@colfaxintl
Colfax International
3 months
RT @hyhieu226: A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typโ€ฆ
0
28
0
@colfaxintl
Colfax International
5 months
RT @hyhieu226: Checkout our newest CUDA tutorial. The topic is software pipelining: overlap mem copying with compuโ€ฆ
0
115
0
@colfaxintl
Colfax International
6 months
We have a few tutorials posted and few of them lined up. More here
@asdf1234_0
Haicheng Wu
6 months
CUTLASS reached 5K stars this summer with 3.5M downloads per month. Thank you for your support!
0
2
5
@colfaxintl
Colfax International
6 months
RT @hyhieu226: @cosminnegruseri @tri_dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highlโ€ฆ
0
2
0
@colfaxintl
Colfax International
6 months
๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ถ๐—ป๐—ด ๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—›๐˜‚๐—ฏ: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more โ€” all before they are shipped to you The service is free for all Colfax customers.
Tweet media one
0
1
4
@colfaxintl
Colfax International
6 months
RT @hyhieu226: ๐Ÿ“š๐Ÿง‘โ€๐ŸŽ“New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) If you have run PyTorcโ€ฆ
0
52
0
@colfaxintl
Colfax International
7 months
๐—™๐—ฟ๐—ฒ๐—ฒ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—”ยฎ ๐—›๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—ก๐—ฉ๐—Ÿ ๐—ง๐—ฒ๐˜€๐˜ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. ๐Ÿ‘‰๐˜“๐˜ฆ๐˜ข๐˜ณ๐˜ฏ ๐˜ฎ๐˜ฐ๐˜ณ๐˜ฆ #AI
Tweet media one
0
1
1
@colfaxintl
Colfax International
7 months
RT @togethercompute: We are thrilled to release FlashAttention-3 in partnership with @Meta , @nvidia, @Princeton, and @colfaxintl. The imโ€ฆ
0
31
0
@colfaxintl
Colfax International
7 months
RT @tri_dao: This project is a collab with Jay Shah & Ganesh Bikshandi (@colfaxintl), @ipiszy (@meta), @DROP_ALL_TABLES and @_prrama (@nvidโ€ฆ
0
3
0
@colfaxintl
Colfax International
7 months
RT @DROP_ALL_TABLES: FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort toโ€ฆ
0
53
0
@colfaxintl
Colfax International
7 months
RT @tri_dao: FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantageโ€ฆ
0
341
0
@colfaxintl
Colfax International
7 months
RT @PyTorch: Introducing FlashAttention-3 ๐Ÿš€ Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to @colfaxintl, @AIatโ€ฆ
0
94
0
@colfaxintl
Colfax International
8 months
RT @hyhieu226: A tutorial to help your kernels run faster on the H100s. The H100 SXM GPU has the memory bandwidthโ€ฆ
0
65
0
@colfaxintl
Colfax International
9 months
24-Bay 2U JBOD with KIOXIA PM7 Series 24G SAS SSD Drives A Big Leap Forward for Enterprise Storage Access a whole new level of SSD performance. Featuring the latest KIOXIA PM7 Series 24G SAS SSDs, Colfax CX22424c-JBOD offers enterprises the performance they need to keep up with todayโ€™s rapidly evolving data demands. ๐Ÿ‘‰ Learn more ๐Ÿ‘‡ KIOXIA PM7 Series (2.5-inch, 15 mm thickness) > Enterprise SAS Mixed Use SSD (PM7-V Series) / Enterprise SAS Read Intensive SSD (PM7-R Series) > 24G SAS interface with single/dual-port support > 3 DWPD (PM7-V) / 1 DWPD (PM7-R) with 100 % Random Write Workload > Up to 720K random read IOPS (4 KiB) in dual-port mode > Power Loss Protection and End-to-End Data Protection, including T10 DIF > Capacities from 1.6 TB to 12.8 TB (PM7-V) / 1.92 TB to 30.72 TB (PM7-R) #colfax #kioxia #storage #jbod #ssd #sas4 #24gsas
Tweet media one
0
1
2