Vijay @__tensorcore__ profile

Vijay

@__tensorcore__

Followers

1K

Following

8K

Statuses

1K

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

Joined July 2015

Don't wanna be here? Send us removal request.

Vijay

@__tensorcore__

8 days

@salykova_ No, this isn’t the reason why. Debugging CUDA is hard for all the same reasons debugging a massive distributed async MPI program is hard to debug. It’s not related to templates

1

0

5

Vijay

@__tensorcore__

11 days

💚

Hieu Pham

@hyhieu226

12 days

Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.

0

14

Vijay

@__tensorcore__

12 days

RT @asdf1234_0: CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver t…

0

25

0

Vijay

@__tensorcore__

14 days

How to separate the wheat from the chaff 101 🤦🏽‍♂️

shrihacker

@shrihacker

15 days

Deepseek engineers so cracked they bypassed cuda

4

0

28

Vijay

@__tensorcore__

14 days

RT @typedfemale: don't tell on yourself by being impressed with inline PTX

0

4

0

Vijay

@__tensorcore__

19 days

@PonekudOnekom Link is in the first tweet

0

Vijay

@__tensorcore__

19 days

@PonekudOnekom So many. See the release log.

1

0

1

Vijay

@__tensorcore__

19 days

RT @__tensorcore__: 🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention ke…

0

41

0

Vijay

@__tensorcore__

19 days

RT @cudagdb: What y'all been waitin' for has finally rolled in. Check out ex75 for all you folks who just can’t get enough of them MoEs!

0

1

0

Vijay

@__tensorcore__

19 days

@bfspector

Vijay

@__tensorcore__

19 days

🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: Can't wait to see what y'all end up cooking with this over the next few moths and years 💚

0

3

Vijay

@__tensorcore__

19 days

@MalekiSaeed It should be very hackable. Code quality is really nice :) but I’ll let you be the judge. Really pushed to get this out day zero so folks like you can build on top of it rather than start from scratch. Contributions to add any and all new features are welcome

0

6

Vijay

@__tensorcore__

19 days

@MalekiSaeed We didn’t end up having the time to run it ourselves due to time crunch, but you should just be able to build and run ex77. It’s zippy to say the least but I’m sure there some more roof for optimization.

1

0

3

Vijay

@__tensorcore__

19 days

@bfspector @aaryan04 @HazyResearch @togethercompute @nvidia PS, may wanna benchmark against example 77 now ;) I hear there might be a new fastest public FA in town

1

0

7

Vijay

@__tensorcore__

20 days

RT @_rozzai: An amazing read.

0

1

0

Vijay

@__tensorcore__

20 days

@cis_female @hyhieu226 CUTLASS is different things to different people. If you’re talking about only using CuTe and writing everything from scratch, restrictions on grid size don’t apply but you’re limited to affine shapes. But say you want to reuse the persistent scheduler, then grid is restricted too

0

2

Vijay

@__tensorcore__

20 days

@cis_female @hyhieu226 And TK is nothing like CUTLASS beyond that apparent similarity. Templates are just a result of choosing C++ as the language. It doesn’t make the programming model abstraction level similar. TK doesn’t have anywhere near as much control as CUTLASS, nor does it support the features

0

3

Vijay

@__tensorcore__

20 days

@cis_female @hyhieu226 And to say that you can use all the normal CUDA affordances isn’t true either. Eg persistent kernels on Hopper break normal CUDA teachings of dynamically scaling grid size. No kernel in CUTLASS lets you change its block size. Perf tuning strategy is not focused on occupancy etc

1

0

2