salykova_ Profile Banner
Aman Salykov Profile
Aman Salykov

@salykova_

Followers
2K
Following
358
Statuses
59

making AI inference run really fast

Vienna
Joined November 2019
Don't wanna be here? Send us removal request.
@salykova_
Aman Salykov
1 month
Excited to announce!
Tweet media one
15
62
556
@salykova_
Aman Salykov
7 days
@gaunernst - this one in case you want to beat apple's Accelerate
0
0
4
@salykova_
Aman Salykov
8 days
@__tensorcore__ agree with this, but heavily templated code further complicates debugging. I remember one of your devs mentioning that this is (one of) the main reason why using cuda-gdb with CUTLASS is not recommended
0
0
0
@salykova_
Aman Salykov
8 days
Tweet media one
4
12
243
@salykova_
Aman Salykov
10 days
The modern NVCC compiler is so advanced that handwritten SASS code will likely be on par with or slower than the code generated by the compiler (assuming well-written CUDA/PTX code)
0
0
12
@salykova_
Aman Salykov
11 days
0
0
2
@salykova_
Aman Salykov
14 days
RT @awnihannun: Reminder, many institutions outside of the US and China are building amazing foundation models. many-polar world (aka even…
0
20
0
@salykova_
Aman Salykov
15 days
@AnushElangovan πŸ‘πŸ‘πŸ‘πŸ‘
0
0
1
@salykova_
Aman Salykov
15 days
0
0
2
@salykova_
Aman Salykov
15 days
0
0
1
@salykova_
Aman Salykov
15 days
NVIDIA RTX Blackwell whitepaper is out!
Tweet media one
0
5
24
@salykova_
Aman Salykov
16 days
@PytorchToAtoms @giffmana @cHHillee @PytorchToAtoms btw. it depends on how you perform benchmarks: with locked stable clock or unlocked. in the latter case, sure, your results will be affected by the number of iterations, matrix size, etc. as your clock speed varies due to power limits
1
0
1
@salykova_
Aman Salykov
16 days
@PytorchToAtoms @giffmana @cHHillee yes, but it gives you relative performance among the generated kernels. you can then pick the best cutlass kernel and test it however you like
0
0
1
@salykova_
Aman Salykov
16 days
@giffmana @PytorchToAtoms @cHHillee I believe it is what @cHHillee meant by 'autotuning'
1
0
1