__tensorcore__ Profile Banner
Vijay Profile
Vijay

@__tensorcore__

Followers
1K
Following
8K
Statuses
1K

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

Joined July 2015
Don't wanna be here? Send us removal request.
@__tensorcore__
Vijay
8 days
@salykova_ No, this isn’t the reason why. Debugging CUDA is hard for all the same reasons debugging a massive distributed async MPI program is hard to debug. It’s not related to templates
1
0
5
@__tensorcore__
Vijay
11 days
💚
@hyhieu226
Hieu Pham
12 days
Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.
0
0
14
@__tensorcore__
Vijay
12 days
RT @asdf1234_0: CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver t…
0
25
0
@__tensorcore__
Vijay
14 days
How to separate the wheat from the chaff 101 🤦🏽‍♂️
@shrihacker
shrihacker
15 days
Deepseek engineers so cracked they bypassed cuda
Tweet media one
Tweet media two
4
0
28
@__tensorcore__
Vijay
14 days
RT @typedfemale: don't tell on yourself by being impressed with inline PTX
0
4
0
@__tensorcore__
Vijay
19 days
@PonekudOnekom Link is in the first tweet
0
0
0
@__tensorcore__
Vijay
19 days
@PonekudOnekom So many. See the release log.
1
0
1
@__tensorcore__
Vijay
19 days
RT @__tensorcore__: 🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention ke…
0
41
0
@__tensorcore__
Vijay
19 days
RT @cudagdb: What y'all been waitin' for has finally rolled in. Check out ex75 for all you folks who just can’t get enough of them MoEs!
0
1
0
@__tensorcore__
Vijay
19 days
@__tensorcore__
Vijay
19 days
🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
Tweet media one
0
0
3
@__tensorcore__
Vijay
19 days
@MalekiSaeed It should be very hackable. Code quality is really nice :) but I’ll let you be the judge. Really pushed to get this out day zero so folks like you can build on top of it rather than start from scratch. Contributions to add any and all new features are welcome
0
0
6
@__tensorcore__
Vijay
19 days
@MalekiSaeed We didn’t end up having the time to run it ourselves due to time crunch, but you should just be able to build and run ex77. It’s zippy to say the least but I’m sure there some more roof for optimization.
1
0
3
@__tensorcore__
Vijay
19 days
@bfspector @aaryan04 @HazyResearch @togethercompute @nvidia PS, may wanna benchmark against example 77 now ;) I hear there might be a new fastest public FA in town
1
0
7
@__tensorcore__
Vijay
20 days
RT @_rozzai: An amazing read.
Tweet media one
0
1
0
@__tensorcore__
Vijay
20 days
@cis_female @hyhieu226 CUTLASS is different things to different people. If you’re talking about only using CuTe and writing everything from scratch, restrictions on grid size don’t apply but you’re limited to affine shapes. But say you want to reuse the persistent scheduler, then grid is restricted too
0
0
2
@__tensorcore__
Vijay
20 days
@cis_female @hyhieu226 And TK is nothing like CUTLASS beyond that apparent similarity. Templates are just a result of choosing C++ as the language. It doesn’t make the programming model abstraction level similar. TK doesn’t have anywhere near as much control as CUTLASS, nor does it support the features
0
0
3
@__tensorcore__
Vijay
20 days
@cis_female @hyhieu226 And to say that you can use all the normal CUDA affordances isn’t true either. Eg persistent kernels on Hopper break normal CUDA teachings of dynamically scaling grid size. No kernel in CUTLASS lets you change its block size. Perf tuning strategy is not focused on occupancy etc
1
0
2