### Motivation of this PR
This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300
**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:
* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`
The reduce type can be choose from: "max", "mean", "max", "min".
extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.
### Performance
Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.
Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
#### before:
```
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9
aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9
aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9
aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9
aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6
aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6
aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9
aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3
```
#### after
```
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9
aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9
aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9
aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9
aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6
aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6
aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9
aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4
aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
As per title.
Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.
CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
As per title.
Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
As per title.
Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
This PR is a copy of https://github.com/pytorch/pytorch/pull/90849 that merge was reverted.
The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:
`torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking.
`torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.
The PR fixes https://github.com/pytorch/pytorch/issues/90833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92094
Approved by: https://github.com/cpuhrsch
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.
The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).
### Benchmarks
Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:
* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128
So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!
### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!
### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:
- `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively
- `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.
The PR also fixes https://github.com/pytorch/pytorch/issues/90833
# Main issue
*The following content is outdated after merging the PRs in this ghstack but kept for the record.*
The importance of this feature is that when enabling the invariants checks by default, say, via
<details>
```
$ git diff
diff --git a/torch/__init__.py b/torch/__init__.py
index c8543057c7..19a91d0482 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ:
# Populate magic methods on SymInt and SymFloat
import torch.fx.experimental.symbolic_shapes
+
+# temporarily enable sparse tensor arguments validation in unsafe
+# constructors:
+
+torch._C._set_check_sparse_tensor_invariants(True)
```
</details>
a massive number of test failures/errors occur in test_sparse_csr.py tests:
```
$ pytest -sv test/test_sparse_csr.py
<snip>
==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ====
```
that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised:
```
AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor"
RuntimeError: CUDA error: device-side assert triggered
RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied.
RuntimeError: expected col_indices to be a strided and contiguous tensor
RuntimeError: expected row_indices to be a strided and contiguous tensor
RuntimeError: expected values to be a strided and contiguous tensor
RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered
RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90849
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.
The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).
### Benchmarks
Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:
* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128
So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!
### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!
### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
This PR extends the `Tensor.to_sparse()` method to `Tensor.to_sparse(layout=None, blocksize=None)` in a BC manner (`layout=None` means `layout=torch.sparse_coo`).
In addition, the PR adds support for the following conversions:
- non-hybrid/hybrid COO tensor to CSR or CSC or a COO tensor
- short, bool, byte, char, bfloat16, int, long, half CSR tensor to a BSR tensor
and fixes the following conversions:
- hybrid COO to COO tensor
- non-batch/batch hybrid BSR to BSR or BSC tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89502
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
As per title. This implementation is not the most optimal and could be improved albeit with native kernels (i.e. block matching need not be materialized).
Compared to existing kernels it offers:
- Half float support (In fact, any dtype that supports `matmul` will work).
- Arbitrary block sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85551
Approved by: https://github.com/amjames, https://github.com/cpuhrsch