For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. Extended XPUTestBase.get_all_devices to support multiple devices
2. Added skipXPU decorator
3. Extended onlyOn to support device list
4. Enabled 'xpu' for some test pathes
5. Added allow_xpu=True for supported test class.
6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
7. Use skipIfXpu and skipXPU to disable unsupported test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where.
Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163138
Approved by: https://github.com/henryoier, https://github.com/ezyang
This PR changes the behavior for compile wrapped op tests:
- supported_but_unclaimed_forward
- supported_but_unclaimed_backward
These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile.
~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976
Approved by: https://github.com/ezyang
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.
scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513
Using the following reproducer
```
import torch
import triton
def main():
dtype = torch.float32
dim = 1305301
a = torch.rand(100, device="cuda", dtype=dtype)
index = torch.randint(0, 100, (dim,), device="cuda")
src = torch.rand(dim, device="cuda", dtype=dtype)
print("=" * 20)
print(
triton.testing.do_bench(
lambda: a.scatter_add(0, index, src),
return_mode="median",
)
)
print("=" * 20)
if __name__ == "__main__":
main()
```
co-authored by: @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
We have a failing unit test on Aarch64
```
Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='')
To execute this test, run the following from the base repo dir:
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem.
4a545eb85d/test/test_ops.py (L546)
ex variable is not reset after this for next loop iteration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597
Approved by: https://github.com/digantdesai
Ever since #135140, this test will fail if run with CPU parameterization (e.g. test_out__refs_logical_or_cpu_float32) and CUDA available - as far as I can tell, the PyTorch CI isn't currently checking for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137140
Approved by: https://github.com/ezyang
Tracking issue: #138399
This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:
```python
def run(device):
S = (5,)
a = torch.rand(S, device=device, dtype=torch.float32)
b = 2
out = torch.empty(S, device=device, dtype=torch.float64)
return torch.pow(a, b, out=out)
>>> run("cpu")
Traceback (most recent call last):
File "test.py", line 34, in run
return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float
>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```
**~Update:~**
~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~
- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~
**Update:**
Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:
- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287
Approved by: https://github.com/ezyang
Tracking issue: #138399
This PR fixes a number of reference implementations (which are also used as meta
functions), making them more consistent with CPU device. More specifically, it fixes those
operations that use `_make_elementwise_unary_reference` decorator, and don't error on
mismatching out argument dtype while they error when using concrete devices (e.g. CPU).
The fixed operations are:
- `abs`
- `ceil`
- `floor`
- `frac`
- `isneginf`
- `isposinf`
- `sgn`
- `sign`
- `signbit`
- `trunc`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286
This PR replaces the parameter names specified in the `triangular_solve_meta`
function (specifically in its `@out_wrapper(...)` decorator) by those written in the
_native_functions.yaml_ file.
This name mismatch caused the operation to fail when using the meta device (see error
below):
```python
Traceback (most recent call last):
File "examples/test.py", line 23, in <module>
torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out)
File "torch/_decomp/__init__.py", line 100, in _fn
return f(*args, **kwargs, out=None if is_none else out_kwargs)
File "torch/_prims_common/wrappers.py", line 289, in _fn
result = fn(*args, **kwargs)
TypeError: triangular_solve_meta() got an unexpected keyword argument 'X'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186
Approved by: https://github.com/ezyang
Reference: https://github.com/pytorch/pytorch/issues/138399
This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.
I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
Reference: https://github.com/pytorch/pytorch/issues/138399
This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.
I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
See #121528 for additional context.
In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).
Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.
Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.
Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device).
Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291
Approved by: https://github.com/YuqingJ
Follow-up to #113118 and #124306.
Developed in coordination with the solution to https://github.com/microsoft/onnxscript/pull/1547
This PR adds the missing fake tensor implementation for `aten.unique_dim`, thus enabling tracing and compilation of `torch.unique` when `dim` is not None.
Local testing has proceeded with the following simple script (provided that one has checked out the changes in https://github.com/microsoft/onnxscript/pull/1547):
```python
import onnx
import onnxruntime as ort
import logging
import numpy as np
onnx_program = torch.onnx.dynamo_export(
lambda x: torch.unique(x,
dim=0,
return_inverse=True),
torch.arange(10),
export_options=torch.onnx.ExportOptions(
dynamic_shapes=True,
diagnostic_options=torch.onnx.DiagnosticOptions(
verbosity_level=logging.DEBUG)))
onnx_program.save("torch_unique.onnx")
onnx_inputs = onnx_program.adapt_torch_inputs_to_onnx(torch.arange(10))
onnx_outputs = onnx_program(*onnx_inputs)
loaded_onnx_program = onnx.load("torch_unique.onnx")
onnx.checker.check_model(loaded_onnx_program)
ort_session = ort.InferenceSession("torch_unique.onnx")
inputs = np.random.randint(0, 10, 10)
print(f"Inputs: {inputs}")
outputs = ort_session.run(None,
{
"l_x_": inputs
})
print(f"Outputs: {outputs}")
print("Success")
```
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126561
Approved by: https://github.com/ezyang
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang