35 Commits

Author SHA1 Message Date
bf6b40da3e fix deterministic scatter_add path for multi-d tensors (#162866)
PReviously for more than 2d tensor `select` didn't work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866
Approved by: https://github.com/valentinandrei
2025-09-15 06:50:00 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
beb52f5c0a use more efficient implementation for broadcasted indexing in determi… (#156744)
…nistic scatter_add

per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156744
Approved by: https://github.com/suo
2025-06-25 02:59:50 +00:00
99ae7d4069 Reland fast gather and index implementation (#151917)
This PR reapplies #151490 and #151753 together, and adds some missing checks when applying the fast path.
Previously missed checks:
1) indexing path has the stride in the indexed dimension in bytes, gather path has the stride in the indexed dimension in elements. When checking if fast path is applicable, I didn't take this difference into account, and still multiplied the indexing stride by element size. Fixed and test added
2) We want to take fast path only when we are copying contiguous equally spaced slices of inputs + all the necessary alignment requirements. The effective tensor size should be 2d (after all possible flattening is applied), the index stride in the last dimension should be 0, and, since in the kernel we are not applying non-indexing-related offsets to src tensor, the src tensor stride in the second dimension should be 0. This automatically happens for gather with dim=0, so I didn't put in an explicit condition for this. Sometimes all conditions except first dim "effective" stride equal to 0 are satisfied for scatter on non-zero dim, when index size in the indexing dimension is 1 and thus it is collapsed (dimensions of size 1 are always collapsed), e.g.
```
        # test gather along 1st dim that can accidentally trigger fast path
        # because due to index dimension in the gather dim being 1
        # an unexpected squashing in tensorIterator happens
        src = make_tensor((16, 2, 16), device=device, dtype=dtype)
        ind = torch.randint(2, (16, 1), device=device).view(16, 1, 1).expand(16, 1, 16)
        res = torch.gather(src, dim=1, index=ind)
        if res.device.type == "cuda":
            ref_cpu = torch.gather(src.cpu(), dim=1, index=ind.cpu())
            self.assertEqual(res.cpu(), ref_cpu, atol=0, rtol=0)
```
Note that if index size here was (16, 2, 16) instead of (16, 1, 16) then the middle dimension could not be collapsed and we wouldn't end up incorrectly taking fast path.
We could update the kernel to take this stride into account when computing offsets into src tensor, or we could specifically disallow non-zero stride on the first dimension. I took the second path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151917
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/Skylion007
2025-04-23 19:13:13 +00:00
b8f4dc5a9f [ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264)
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.

scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513

Using the following reproducer
```
import torch
import triton

def main():
    dtype = torch.float32
    dim = 1305301
    a = torch.rand(100, device="cuda", dtype=dtype)
    index = torch.randint(0, 100, (dim,), device="cuda")
    src = torch.rand(dim, device="cuda", dtype=dtype)

    print("=" * 20)
    print(
        triton.testing.do_bench(
            lambda: a.scatter_add(0, index, src),
            return_mode="median",
        )
    )
    print("=" * 20)

if __name__ == "__main__":
    main()
```

co-authored by: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-04-22 21:55:40 +00:00
f072bf27a7 Revert "faster gather implementation (#151490)"
This reverts commit 541f8cd34cbccfcaf04a377f747390f83658d6ec.

Reverted https://github.com/pytorch/pytorch/pull/151490 on behalf of https://github.com/malfet due to Looks like it breaks demucs accuracy, though may be bogus, but let's try to revert, see c729f7dbee/3 ([comment](https://github.com/pytorch/pytorch/pull/151490#issuecomment-2821803788))
2025-04-22 16:09:14 +00:00
0ff302e8e0 Revert "reroute index to fast implementation for indexing on 0th dimension (#151753)"
This reverts commit 4d78e19365c4e2189693c7a81b665d4ec2d2cf53.

Reverted https://github.com/pytorch/pytorch/pull/151753 on behalf of https://github.com/malfet due to Looks like it breaks bunch of distributed tests with DSA, see 4d78e19365 ([comment](https://github.com/pytorch/pytorch/pull/151753#issuecomment-2820078298))
2025-04-22 05:03:03 +00:00
4d78e19365 reroute index to fast implementation for indexing on 0th dimension (#151753)
Per title, improve x[index] cuda perf for the common case of indexing along the first dim, using vectorized gather kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151753
Approved by: https://github.com/eqy
2025-04-21 23:15:30 +00:00
541f8cd34c faster gather implementation (#151490)
So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions.

Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh.
Additional tests are coming to make sure this implementation doesn't break anything

`gather` is equivalent to x[indices] for 1d indices via
```
def fn_gather(x, indices):
    return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1]))

def fn_index(x, indices):
    return x[indices]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-18 07:48:31 +00:00
6c0a2d8bbf Fix the check for can_use_expanded_index_path (#140351)
Fixes #129093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140351
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2024-11-15 05:52:23 +00:00
a0e2f62edd Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit 9e24c263f998819f849bb8293323213101e9aefc.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))
2024-05-02 21:36:18 +00:00
9e24c263f9 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
2024-05-01 23:58:35 +00:00
4d410155b2 Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit e09f98c705e4851414cd8ddf21949177af2b13aa.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))
2024-05-01 16:02:02 +00:00
e09f98c705 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-05-01 14:31:31 +00:00
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
053654b9cf Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-13 09:34:29 +00:00
f8aedf1efe Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)"
This reverts commit da7675621efce341c80187e404ac62cb6c22bbf8.

Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing da7675621e https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753))
2023-07-06 17:02:03 +00:00
da7675621e Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-06 01:23:56 +00:00
9bc68fcd25 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If stride == 1 we can use the whole warp to reduce the gradients
If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505
Approved by: https://github.com/ngimel
2023-05-03 23:52:58 +00:00
1114673c90 Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)"
This reverts commit 97afbcbc8007857a51c85e9c61fe6d80564ef1f9.

Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))
2023-05-02 16:46:04 +00:00
97afbcbc80 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If `stride == 1` we can use the whole warp to reduce the gradients
If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441
Approved by: https://github.com/ngimel
2023-05-01 22:41:00 +00:00
4c9d660733 fix gather issue when index is shape of n by 1 (#99709)
Fix https://github.com/pytorch/pytorch/issues/99595

When the index is shape of {N, 1}, it will also have strides of {1, 0}, which is the same as an expanded tensor (e.g. shape of {5, 5} and strides {1, 0}), leading to wrong output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99709
Approved by: https://github.com/XiaobingSuper, https://github.com/ezyang
2023-04-24 20:55:46 +00:00
3654552b8c add deterministic impl for scatter and scatter_reduction sum/mean mode (#98060)
using the existing deterministic implementation via `index_put` which has a deterministic implementation based on sorting indices.

With the `accumulate` arg in `index_put`, this can work for both scatter and scatter_reduce with sum/mean reduction mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98060
Approved by: https://github.com/mikaylagawarecki
2023-04-03 20:38:29 +00:00
dc6916b341 optimize gather performance for gnn usage on CPU (#87586)
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized.

This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension.

Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format.

for more details, please refer to https://github.com/pyg-team/pytorch_geometric/issues/4891#issuecomment-1288423705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87586
Approved by: https://github.com/rusty1s, https://github.com/malfet
2023-01-12 00:43:43 +00:00
8f5f15a64b optimize scatter_add performance for gnn usage on CPU (#82703)
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82703
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-01-11 05:55:09 +00:00
5b58140d1a Add deterministic impl of scatter_add CUDA for all input sizes (#79466)
Fixes #50469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79466
Approved by: https://github.com/ngimel
2022-09-07 03:12:49 +00:00
d7847ed23e Add integer support to scatter_reduce (#80324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80324
Approved by: https://github.com/cpuhrsch
2022-06-29 21:10:26 +00:00
519347df49 fix gather sparse_grad backward crash with empty index tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78698

Approved by: https://github.com/ngimel
2022-06-04 02:25:05 +00:00
2a9779adbf Bugfix NAN and Inf handling for scatter_reduce (amin and amax)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75651

Approved by: https://github.com/cpuhrsch
2022-05-06 15:39:57 +00:00
4aa6b6b9de Revert "Bugfix NAN and Inf handling for scatter_reduce (amin and amax)"
This reverts commit 4441582f809ee94045e77ec595470bb6e68ba5f6.

Reverted https://github.com/pytorch/pytorch/pull/75651 on behalf of https://github.com/malfet
2022-05-04 21:29:07 +00:00
4441582f80 Bugfix NAN and Inf handling for scatter_reduce (amin and amax)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75651

Approved by: https://github.com/cpuhrsch
2022-05-03 20:10:30 +00:00
e9a8e6f74a Add include_self flag to scatter_reduce
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74607

Approved by: https://github.com/cpuhrsch
2022-04-05 16:31:39 +00:00
f7829812b4 scatter_reduce CUDA support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74606

Approved by: https://github.com/cpuhrsch
2022-04-01 15:28:06 +00:00
2bfa018462 [BC-breaking] Use ScatterGatherKernel for scatter_reduce (CPU-only) (#74226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74226

Update signature of `scatter_reduce_` to match `scatter_/scatter_add_`

`Tensor.scatter_reduce_(int64 dim, Tensor index, Tensor src, str reduce)`

- Add new reduction options in ScatterGatherKernel.cpp and update `scatter_reduce` to call into the cpu kernel for `scatter.reduce`
- `scatter_reduce` now has the same shape constraints as `scatter_` and `scatter_add_`
- Migrate `test/test_torch.py:test_scatter_reduce` to `test/test_scatter_gather_ops.py`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D35222842

Pulled By: mikaylagawarecki

fbshipit-source-id: 84930add2ad30baf872c495251373313cb7428bd
(cherry picked from commit 1b45139482e22eb0dc8b6aec2a7b25a4b58e31df)
2022-04-01 05:57:45 +00:00
e0d829a266 Kill the test_torch.py mixin and creates test_scatter_gather_ops (#71691)
Summary:
Per title.

Also annotates test_torch.py with additional cleanup tasks and adds empty sample inputs to elementwise unary and binary OpInfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71691

Reviewed By: ngimel

Differential Revision: D33735126

Pulled By: mruberry

fbshipit-source-id: 8cc097a7581a8b620540c95b2a5889c1165ecf23
(cherry picked from commit 5c6a245a3f9ba7c064fc77c8cd4045f903e73cfd)
2022-01-24 09:32:32 +00:00