![Vijay Profile](https://pbs.twimg.com/profile_images/1232504842951757826/v8ES4JQ1_x96.jpg)
Vijay
@__tensorcore__
Followers
1K
Following
8K
Statuses
1K
MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.
Joined July 2015
@salykova_ No, this isn’t the reason why. Debugging CUDA is hard for all the same reasons debugging a massive distributed async MPI program is hard to debug. It’s not related to templates
1
0
5
RT @asdf1234_0: CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver t…
0
25
0
RT @__tensorcore__: 🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention ke…
0
41
0
RT @cudagdb: What y'all been waitin' for has finally rolled in. Check out ex75 for all you folks who just can’t get enough of them MoEs!
0
1
0
@MalekiSaeed It should be very hackable. Code quality is really nice :) but I’ll let you be the judge. Really pushed to get this out day zero so folks like you can build on top of it rather than start from scratch. Contributions to add any and all new features are welcome
0
0
6
@MalekiSaeed We didn’t end up having the time to run it ourselves due to time crunch, but you should just be able to build and run ex77. It’s zippy to say the least but I’m sure there some more roof for optimization.
1
0
3
@bfspector @aaryan04 @HazyResearch @togethercompute @nvidia PS, may wanna benchmark against example 77 now ;) I hear there might be a new fastest public FA in town
1
0
7
@cis_female @hyhieu226 CUTLASS is different things to different people. If you’re talking about only using CuTe and writing everything from scratch, restrictions on grid size don’t apply but you’re limited to affine shapes. But say you want to reuse the persistent scheduler, then grid is restricted too
0
0
2
@cis_female @hyhieu226 And TK is nothing like CUTLASS beyond that apparent similarity. Templates are just a result of choosing C++ as the language. It doesn’t make the programming model abstraction level similar. TK doesn’t have anywhere near as much control as CUTLASS, nor does it support the features
0
0
3
@cis_female @hyhieu226 And to say that you can use all the normal CUDA affordances isn’t true either. Eg persistent kernels on Hopper break normal CUDA teachings of dynamically scaling grid size. No kernel in CUTLASS lets you change its block size. Perf tuning strategy is not focused on occupancy etc
1
0
2