Commit Graph

82797 Commits

Author SHA1 Message Date
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
3251171ae8 Make whl metadata public readable (#144164)
After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909.

FBGEMM folks are also noticed this failure on their end (cc @q10)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164
Approved by: https://github.com/clee2000
2025-01-03 21:08:49 +00:00
9bf2a9a616 [ScaledMM] Fix NaNs in test for garbage input data (#144042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042
Approved by: https://github.com/janeyx99
2025-01-03 21:02:25 +00:00
b75f32b848 Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139)
Address related comments earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-01-03 20:41:36 +00:00
64bffb3124 remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134
Approved by: https://github.com/Skylion007
2025-01-03 20:18:40 +00:00
64b197b603 remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135
Approved by: https://github.com/Skylion007
2025-01-03 20:08:11 +00:00
9b8a4e7141 remove allow-untyped-defs from torch/onnx/operators.py (#144133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133
Approved by: https://github.com/Skylion007
2025-01-03 20:07:56 +00:00
6e09d32c00 remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132
Approved by: https://github.com/Skylion007
2025-01-03 20:07:37 +00:00
eb7a303d21 [dtensor] expose the __create_chunk_list__ in the doc (#144100)
as titled, this PR expose this dunder method as a public API in the doc,
so that different checkpoint implementations can leverage this protocol,
instead of exposing a separate API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100
Approved by: https://github.com/awgu
ghstack dependencies: #144099
2025-01-03 20:06:23 +00:00
45411d1fc9 Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2025-01-03 20:03:40 +00:00
e9e18a9617 remove allow-untyped-defs from _export/db/logging.py (#144093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093
Approved by: https://github.com/Skylion007
2025-01-03 19:36:14 +00:00
ad09395674 [MPSInductor] Fix multi rangevar kernel invocation (#144050)
By changing `thread_position_in_grid` type to uint{n} and passing
dimentions during the kernel call

`pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050
Approved by: https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105, #144156
2025-01-03 19:32:43 +00:00
52e107a7ca [MPSInductor] Add constant, isinf and isnan ops (#144156)
Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105
2025-01-03 19:32:43 +00:00
383ff4011c [ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155)
Minor thing that improves readability.  I didn't realize you could specify characters for strip when I wrote this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-03 19:25:30 +00:00
8b3479e361 remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131
Approved by: https://github.com/Skylion007
2025-01-03 19:07:21 +00:00
7b69f7b449 Clarify what we mean by decoupled weight decay in the *AdamWs (#144101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101
Approved by: https://github.com/albanD
2025-01-03 19:06:00 +00:00
c36f94b373 [while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106
Approved by: https://github.com/zou3519
ghstack dependencies: #143105, #143545
2025-01-03 19:01:07 +00:00
5660709856 [hop][BE] unify meta checking with check_meta_consistency (#143545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545
Approved by: https://github.com/zou3519
ghstack dependencies: #143105
2025-01-03 19:01:07 +00:00
6e8dca9ff3 [while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105
Approved by: https://github.com/zou3519
2025-01-03 19:01:07 +00:00
56f6289f6a [mps/inductor] Add support for atanh(). (#144121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-03 18:55:05 +00:00
a7b61c5b49 [MPSInductor] Add signbit op support (#144105)
By mapping it to `metal::signbit`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #144055, #144051, #144122
2025-01-03 18:34:46 +00:00
8d63a4a409 Revert "Set enable_trace_contextlib_contextmanager flag to True (#140604)"
This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59.

Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))
2025-01-03 18:23:53 +00:00
c5c897c3a1 [dynamo][easy] Miscellaneous fixes (#144141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129, #144130
2025-01-03 18:22:56 +00:00
732359c633 [dynamo][easy] Minor fixes in guards.cpp (#144130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129
2025-01-03 18:22:56 +00:00
a450e177fd [dynamo] remove inline inbuilt tests as flag is enabled by default (#144129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129
Approved by: https://github.com/williamwen42
2025-01-03 18:22:56 +00:00
2409b49a33 Revert "Rewrite _reparametrize_module to use contextmanager (#138203)"
This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973.

Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))
2025-01-03 18:17:32 +00:00
60fe8a65af [Inductor] Generalize tiling algorithm to handle fused reductions (#144041)
# Issue

This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges.

# Fix

Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel.

Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with  a node of ranges `([8], [])` when we have `reduction_numel=8`.

Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND.

# Test plan

This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041
Approved by: https://github.com/jansel
2025-01-03 18:16:27 +00:00
e93f625d00 [AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990)
`autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS).

This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI.

```
// in AOTI codegen
kernels.cuda_fused_0(
  (const half*)arg0_1.data_ptr(), (const half*)arg1_1.data_ptr(), (half*)buf0.data_ptr(),
  (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216,
  (size_t*)nullptr, (uint8_t*)workspace_0.data_ptr(), stream);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990
Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire
2025-01-03 18:01:12 +00:00
f3968373c1 Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118)
CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4.  I also clean up some redundant CI jobs on periodic and inductor-periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118
Approved by: https://github.com/atalman
2025-01-03 17:45:41 +00:00
cbdc70ae07 Use the build environment as sccache prefix instead of workflow name (#144112)
This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804.  The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself.

Logically, the same build should use the same cache regardless of the workflows.  We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches.

I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112
Approved by: https://github.com/malfet
2025-01-03 17:33:03 +00:00
b9fbd65dfd AOTI fallback ops: remove ops that were never codegen'ed (#143421)
Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility.

1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design.
2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support.
3. `zeros.names` requires a `Dimname` input, which we can't currently codegen.

Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421
Approved by: https://github.com/desertfire
ghstack dependencies: #141371, #143223
2025-01-03 16:05:38 +00:00
b5b419d627 cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223)
When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic.

This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223
Approved by: https://github.com/desertfire
ghstack dependencies: #141371
2025-01-03 16:05:38 +00:00
e88d06f54e ir.ExternKernel: correctly handle kwarg default arguments (#141371)
Additionally, enable torchinductor opinfo tests exercising all
previously fixed bugs in this stack.

Note: I've manually sharded the cpp_wrapper CI checks into 2 shards.
Once all OpInfo tests are enabled we should switch back to automatic
sharding, but until then the pipeline doesn't have appropriate timing
stats.  More shards would be helpful given the compilation slowdown
associated with cpp_wrapper, but 2 will do for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371
Approved by: https://github.com/desertfire
2025-01-03 16:05:31 +00:00
f7644efa79 [MPSInductor][EZ] Fix logical_[or|end] ops (#144122)
For boolean operands it does not really matter whether `&` or `&&` is
used, but if one ever to rely on operator precedence, then bitwise ops
should have higher precendence than logical ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122
Approved by: https://github.com/huydhn
ghstack dependencies: #144055, #144051
2025-01-03 15:28:07 +00:00
b336d72dae [MPSInductor] Preserve dtype during load (#144051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051
Approved by: https://github.com/Skylion007
ghstack dependencies: #144055
2025-01-03 15:17:33 +00:00
a1ae8fadc7 [cpu][vec] support reduce ops for add and max (#144065)
### Description

During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go  into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions.

### Details
- Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions;
- Implement the scalar version for fallback path in vec base;
- Add the operator `reduce` in vec base, in order to simplify the codes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065
Approved by: https://github.com/mingfeima
2025-01-03 13:01:52 +00:00
55dc61dd52 Dataloader distribute tasks to workers when in_order is False (#142324)
Fixes #105203 and is a follow up PR to #141833

When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding.
The current default behaviour is left as is.

Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky
```bash
for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324
Approved by: https://github.com/andrewkho
2025-01-03 12:57:04 +00:00
c09bf71bd6 [Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848)
Fix https://github.com/pytorch/pytorch/issues/143568
Before:
![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003)
After:
![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-01-03 09:00:43 +00:00
d9507548d8 [dynamo][BE] move zip_longest polyfill to submodule polyfills.itertools (#144067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067
Approved by: https://github.com/yanboliang
ghstack dependencies: #144066
2025-01-03 08:08:31 +00:00
fb1beb31d2 [dynamo][BE] move dropwhile polyfill to submodule polyfills.itertools (#144066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066
Approved by: https://github.com/jansel
2025-01-03 08:08:31 +00:00
00df63f09f [ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986)
I experienced an error while doing a DEBUG build of pytorch on rocm:
```
additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
```
Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node.

Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986
Approved by: https://github.com/jeffdaily
2025-01-03 06:53:08 +00:00
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
48a05ee773 [dtensor] improve doc of the DTensor class (#144099)
as titled: explicitly list all public members to make sure the public
API stays consistent, also use groupwise as the member order to make doc
look better

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099
Approved by: https://github.com/awgu
2025-01-03 05:35:44 +00:00
41b5c600df [ReduceOps] Add dimension checking for cummin()/cummax(). (#143920)
Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed.

Test Plan: torch_test.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #71477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920
Approved by: https://github.com/malfet
2025-01-03 04:14:33 +00:00
c5b75f8db1 [AOTI] Remove more AOTI_TORCH_EXPORT (#144081)
Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue.

Differential Revision: D67762069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081
Approved by: https://github.com/yushangdi
2025-01-03 02:17:38 +00:00
c31912666e [ROCm] Print amdgpu info on bare metal for CI runners (#144038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038
Approved by: https://github.com/jeffdaily
2025-01-03 02:00:40 +00:00
37e9da0687 [ROCm][Windows] Disable roctracer-related code (#143329)
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
2025-01-03 01:51:01 +00:00
891a86d1ad remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091
Approved by: https://github.com/aorenste
2025-01-03 01:26:36 +00:00
377e29745f remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090
Approved by: https://github.com/aorenste
2025-01-03 01:22:50 +00:00
0d6db839a7 remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088
Approved by: https://github.com/aorenste
2025-01-03 01:21:44 +00:00