Aman Salykov @salykova_ profile

Aman Salykov

@salykova_

Followers

2K

Following

358

Statuses

59

making AI inference run really fast

Vienna

Joined November 2019

Don't wanna be here? Send us removal request.

Aman Salykov

@salykova_

1 month

Excited to announce!

15

62

556

Aman Salykov

@salykova_

7 days

@gaunernst - this one in case you want to beat apple's Accelerate

0

4

Aman Salykov

@salykova_

8 days

@__tensorcore__ agree with this, but heavily templated code further complicates debugging. I remember one of your devs mentioning that this is (one of) the main reason why using cuda-gdb with CUTLASS is not recommended

0

Aman Salykov

@salykova_

8 days

4

12

243

Aman Salykov

@salykova_

10 days

The modern NVCC compiler is so advanced that handwritten SASS code will likely be on par with or slower than the code generated by the compiler (assuming well-written CUDA/PTX code)

0

12

Aman Salykov

@salykova_

11 days

@dhtikna @typedfemale

0

2

Aman Salykov

@salykova_

14 days

RT @awnihannun: Reminder, many institutions outside of the US and China are building amazing foundation models. many-polar world (aka even…

0

20

0

Aman Salykov

@salykova_

14 days

@awnihannun @MistralAI @MiniMax__AI @kyutai_labs @SakanaAILabs @nx_ai_com in Austria

0

3

Aman Salykov

@salykova_

15 days

@AnushElangovan 👍👍👍👍

0

1

Aman Salykov

@salykova_

15 days

@__ReJ__ @AgileJebrim

0

2

Aman Salykov

@salykova_

15 days

@telmin_orca

0

1

Aman Salykov

@salykova_

15 days

NVIDIA RTX Blackwell whitepaper is out!

0

5

24

Aman Salykov

@salykova_

16 days

@PytorchToAtoms @giffmana @cHHillee @PytorchToAtoms btw. it depends on how you perform benchmarks: with locked stable clock or unlocked. in the latter case, sure, your results will be affected by the number of iterations, matrix size, etc. as your clock speed varies due to power limits

1

0

1

Aman Salykov

@salykova_

16 days

@PytorchToAtoms @giffmana @cHHillee yes, but it gives you relative performance among the generated kernels. you can then pick the best cutlass kernel and test it however you like

0

1

Aman Salykov

@salykova_

16 days

@giffmana @PytorchToAtoms @cHHillee I believe it is what @cHHillee meant by 'autotuning'

1

0

1