As in the title.
This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`):
- bsr_scatter_mm and bfloat16
- for blocksize 16x16, the average/maximum speed up is about 29/75 %.
- for blocksize 32x32, the average/maximum speed up is about 23/58 %.
- for blocksize 64x64, the average/maximum speed up is about 27/66 %.
- for blocksize 128x128, the average/maximum speed up is about 33/72 %.
- bsr_dense_mm and bfloat16
- for blocksize 16x16, the average/maximum speed up is about 47/61 %.
- for blocksize 32x32, the average/maximum speed up is about 29/43 %.
- for blocksize 64x64, the average/maximum speed up is about 21/41 %.
- for blocksize 128x128, the average/maximum speed up is about 12/29 %.
- bsr_dense_mm and float32
- for blocksize 16x16, the average/maximum speed up is about 35/49 %.
- for blocksize 32x32, the average/maximum speed up is about 2/5 %.
- for blocksize 64x64, the average/maximum speed up is about 2/21 %.
- for blocksize 128x128, the average/maximum speed up is about 79/84 %.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553
Approved by: https://github.com/cpuhrsch
Finding optimal meta parameters for bsr_dense_mm and bsr_scatter_mm triton kernels is a tedious job. This PR introduces a tool (a Python script `torch/sparse/_triton_ops_meta.py`) that finds the optimal set of meta parameters for a given set of matrix multiplication inputs and their block sizes. Currently, such a set is found for square bsr tensor inputs with sizes 256...16384 and square blocksizes 16...128, and dense tensor inputs with sizes 256...131072.
As a result, bsr_dense_mm performance has increased as follows (`NVIDIA A100-SXM4-80GB`):
- for blocksize 16x16, the average/maximum speed up is about 40/60 %.
- for blocksize 32x32, the average/maximum speed up is about 28/45 %.
- for blocksize 64x64, the average/maximum speed up is about 26/43 %.
- for blocksize 128x128, the average/maximum speed up is about 12/28 %.
To enable the performance improvements through meta parameter optimization for other CUDA devices, one must execute the `_triton_ops_meta.py` which will calculate the optimal meta parameters and store the results in a dictionary object defined in `_triton_ops_meta.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112737
Approved by: https://github.com/cpuhrsch