Colfax International @colfaxintl profile

Colfax International

@colfaxintl

Followers

957

Following

272

Statuses

650

HPC & AI Solutions (https://t.co/VqfbQgi7kA) | Research (https://t.co/32b1YZTuB2) | Colfax Experience Center (https://t.co/cAlyTPEGOl)

Santa Clara, CA

Joined February 2009

Don't wanna be here? Send us removal request.

Colfax International

@colfaxintl

12 days

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier series on GEMM optimization, and in particular on the Hopper WGMMA instruction for fast matrix-multiplication: 1) Periodic promotion of lower precision FP8 WGMMA accumulators computed via Tensor Cores to full FP32 precision using CUDA Cores. 2) 128x128 blockwise and 1x128 groupwise scales for FP8 quantization of weights and activations. We hope this provides some greater depth on specific changes you need to make to standard FP8 GEMMs in order to make them useful in a practical setting, such as the training setup used for the DeepSeek-V3 model. Finally, both (1) and (2) are now implemented in CUTLASS; see example 67 and the PR linked in our post. As always, beyond just using the CUTLASS API, it's a good idea to examine the source code to understand the nuts-and-bolts of performance engineering.

0

3

12

Colfax International

@colfaxintl

21 days

𝗖𝗼𝗹𝗳𝗮𝘅 𝗻𝗼𝘄 𝗼𝗳𝗳𝗲𝗿𝘀 𝗡𝗩𝗜𝗗𝗜𝗔 𝗕𝗹𝗮𝗰𝗸𝘄𝗲𝗹𝗹-𝗯𝗮𝘀𝗲𝗱 𝘀𝗲𝗿𝘃𝗲𝗿𝘀 👉 8U/10U servers • NVIDIA HGX™ B200 8-GPU baseboard • 2x AMD EPYC™ 9004/9005 OR 2x 4th/5th Gen Intel® Xeon® Scalable OR 2x Intel® Xeon® 6900 series Learn more

0

2

4

Colfax International

@colfaxintl

2 months

𝗖𝗨𝗧𝗟𝗔𝗦𝗦 𝗧𝘂𝘁𝗼𝗿𝗶𝗮𝗹: 𝗣𝗲𝗿𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗮𝗻𝗱 𝗦𝘁𝗿𝗲𝗮𝗺-𝗞 Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions.

0

1

21

Colfax International

@colfaxintl

2 months

In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with @character_ai to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference

0

3

15

Colfax International

@colfaxintl

3 months

In this @GPU_MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS.

0

3

18

Colfax International

@colfaxintl

3 months

RT @hyhieu226: A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typ…

0

28

0

Colfax International

@colfaxintl

5 months

RT @hyhieu226: Checkout our newest CUDA tutorial. The topic is software pipelining: overlap mem copying with compu…

0

115

0

Colfax International

@colfaxintl

6 months

We have a few tutorials posted and few of them lined up. More here

Haicheng Wu

@asdf1234_0

6 months

CUTLASS reached 5K stars this summer with 3.5M downloads per month. Thank you for your support!

0

2

5

Colfax International

@colfaxintl

6 months

RT @hyhieu226: @cosminnegruseri @tri_dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highl…

0

2

0

Colfax International

@colfaxintl

6 months

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗖𝗼𝗹𝗳𝗮𝘅 𝗔𝗰𝗰𝗲𝘀𝘀 𝗛𝘂𝗯: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more — all before they are shipped to you The service is free for all Colfax customers.

0

1

4

Colfax International

@colfaxintl

6 months

RT @hyhieu226: 📚🧑‍🎓New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) If you have run PyTorc…

0

52

0

Colfax International

@colfaxintl

7 months

𝗙𝗿𝗲𝗲 𝗡𝗩𝗜𝗗𝗜𝗔® 𝗛𝟭𝟬𝟬 𝗡𝗩𝗟 𝗧𝗲𝘀𝘁 𝗗𝗿𝗶𝘃𝗲 Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. 👉𝘓𝘦𝘢𝘳𝘯 𝘮𝘰𝘳𝘦 #AI

0

1

Colfax International

@colfaxintl

7 months

RT @togethercompute: We are thrilled to release FlashAttention-3 in partnership with @Meta , @nvidia, @Princeton, and @colfaxintl. The im…

0

31

0

Colfax International

@colfaxintl

7 months

RT @tri_dao: This project is a collab with Jay Shah & Ganesh Bikshandi (@colfaxintl), @ipiszy (@meta), @DROP_ALL_TABLES and @_prrama (@nvid…

0

3

0

Colfax International

@colfaxintl

7 months

RT @DROP_ALL_TABLES: FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to…

0

53

0

Colfax International

@colfaxintl

7 months

RT @tri_dao: FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage…

0

341

0

Colfax International

@colfaxintl

7 months

RT @PyTorch: Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to @colfaxintl, @AIat…

0

94

0

Colfax International

@colfaxintl

8 months

RT @hyhieu226: A tutorial to help your kernels run faster on the H100s. The H100 SXM GPU has the memory bandwidth…

0

65

0

Colfax International

@colfaxintl

9 months

24-Bay 2U JBOD with KIOXIA PM7 Series 24G SAS SSD Drives A Big Leap Forward for Enterprise Storage Access a whole new level of SSD performance. Featuring the latest KIOXIA PM7 Series 24G SAS SSDs, Colfax CX22424c-JBOD offers enterprises the performance they need to keep up with today’s rapidly evolving data demands. 👉 Learn more 👇 KIOXIA PM7 Series (2.5-inch, 15 mm thickness) > Enterprise SAS Mixed Use SSD (PM7-V Series) / Enterprise SAS Read Intensive SSD (PM7-R Series) > 24G SAS interface with single/dual-port support > 3 DWPD (PM7-V) / 1 DWPD (PM7-R) with 100 % Random Write Workload > Up to 720K random read IOPS (4 KiB) in dual-port mode > Power Loss Protection and End-to-End Data Protection, including T10 DIF > Capacities from 1.6 TB to 12.8 TB (PM7-V) / 1.92 TB to 30.72 TB (PM7-R) #colfax #kioxia #storage #jbod #ssd #sas4 #24gsas

0

1

2