mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
As in the title. The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters): - it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below). - it supports rectangular blocks in sparse BSR tensor weights The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`): - with 16x16 blocks, the average/maximal speed up is 55/94 % - with 32x32 blocks, the average/maximal speed up is 33/63 % - with 64x64 blocks, the average/maximal speed up is 23/42 % - with 128x128 blocks, the average/maximal speed up is 15/39 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595 Approved by: https://github.com/cpuhrsch