Files
pytorch/torch/sparse
Pearu Peterson 69c4819f53 Add bsr_dense_addmm triton kernel (#114595)
As in the title.

The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters):
- it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below).
- it supports rectangular blocks in sparse BSR tensor weights

The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is  55/94 %
- with 32x32 blocks, the average/maximal speed up is  33/63 %
- with 64x64 blocks, the average/maximal speed up is  23/42 %
- with 128x128 blocks, the average/maximal speed up is  15/39 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595
Approved by: https://github.com/cpuhrsch
2023-11-29 05:29:25 +00:00
..