This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
This reverts commit 5494b2a8d38c3ddbeb2d96a5ac990e20ec4c48fd.
Need to skip `test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_*` again. Tests are failing now with "core dumped" error
```
python test_sparse_csr.py -v -k test_sampled_addmm_zero_sized_cuda_float64
test_sampled_addmm_zero_sized_cuda_float64 (__main__.TestSparseCSRCUDA) ... /tmp/pytorch/test/test_sparse_csr.py:2503: c = torch.empty(m, n, dtype=dtype, device=device, layout=torch.sparse_csr)
GPU core dump created: gpucore.186789
:0:rocdevice.cpp :2992: 4701819131755 us: Callback: Queue 0x760cdcd00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted (core dumped)
```
These failures are linked to `test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_*` due to incorrect test log parsing. We will be able to close these issues also:
- Fixes https://github.com/pytorch/pytorch/issues/163663
- Fixes https://github.com/pytorch/pytorch/issues/160786
- Fixes https://github.com/pytorch/pytorch/issues/160785
- Fixes https://github.com/pytorch/pytorch/issues/160784
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163848
Approved by: https://github.com/jeffdaily
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461
Approved by: https://github.com/drisspg
As in the title.
Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413
The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.
Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.
Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855
Approved by: https://github.com/cpuhrsch
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0)
CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951
Approved by: https://github.com/zou3519
Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below:
```
% python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))"
tensor([[0., 1.],
[2., 3.]], device='mps:0')
tensor([[0., 2.],
[1., 3.]], device='mps:0')
```
Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed
Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565
TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance
Fixes https://github.com/pytorch/pytorch/issues/116331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547
Approved by: https://github.com/albanD, https://github.com/Skylion007
`add` when passed one sparse and one dense argument will error if the
sparse argument does not have csr layout. This PR modifies the
underlying algorithm to be generic on the compressed dimension handling
both csr and csc. The functions are renamed to use the
`sparse_compressed` qualifier rather than `sparse_csr`
Fixes: #114807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
ghstack dependencies: #115432
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
As in the title.
In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch