Compare commits

..

3098 Commits

Author SHA1 Message Date
1e5f51ed2c Assorted fixes 2025-06-25 14:03:13 -07:00
2a738143b4 [DONT MERGE] Diffusion models benchmarking for compile time
ghstack-source-id: 1591c4dddcf9d828191ed7bb54ec98669e074816
Pull-Request: https://github.com/pytorch/pytorch/pull/155866
2025-06-20 22:12:55 -07:00
965f830bc9 [invoke_subgraph] Add config flag to control support of input aliasing
ghstack-source-id: 79d3bf9f22aecdaa5be6d95a2cb51d6b4d1a47a0
Pull-Request: https://github.com/pytorch/pytorch/pull/156450
2025-06-20 22:12:55 -07:00
a097a0d3b2 [dynamo] Guard eagerly on list objects to avoid guard on getitem index
ghstack-source-id: 2c5a7e61f7395361508f8f4ec1f3ab8b7449385a
Pull-Request: https://github.com/pytorch/pytorch/pull/156531
2025-06-20 22:12:54 -07:00
255c2b0d6c [compile] Release nested_compile_region API
ghstack-source-id: 913ef891c853836dfff75afb943359f6c9ad12db
Pull-Request: https://github.com/pytorch/pytorch/pull/156449
2025-06-20 22:12:54 -07:00
e98dd95446 [nativert] Move SerialGraphExecutor to PyTorch core (#156459)
Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D76917966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-21 01:32:06 +00:00
a67eb1a0d6 [ez] remove unused functions (#156466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466
Approved by: https://github.com/jingsh
2025-06-21 00:38:34 +00:00
2ee23175d9 [dynamo][guards] Catch exception and return false in the backend match (#156341)
Its difficult to write a test. I found this while debugging a sefgault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341
Approved by: https://github.com/williamwen42
2025-06-21 00:13:26 +00:00
0f0c010714 [c10d] init_process_group supports index-only device id (#156214)
Before:
```
acc = torch.accelerator.current_accelerator()
if acc:
  local_idx = ...
  dist.init_process_group(
    device_id=torch.device(acc.type, local_idx)
  )
```
After:
```
dist.init_process_group(device_id=local_idx)
```

That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-06-21 00:02:37 +00:00
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
6098209bff [BE][5/X] Phase out usage of use_max_autotune() (#156269)
These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269
Approved by: https://github.com/masnesral
2025-06-20 22:37:45 +00:00
5ab257c74c [invoke_subgraph] Make invoke_subgraph cacheable (#156448)
Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-06-20 21:20:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
b8fc5e0c0d skip flaky test in CPython 3.13 tests (#155561)
Changed files:
* test_math.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561
Approved by: https://github.com/zou3519
2025-06-20 20:25:35 +00:00
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
a69e27ca5a Remove unused MultiKernelCall import from inductor codegen (#156158)
Since it's now actually used within async_compile.multi_kernel

```
    def multi_kernel(self, *args, **kwargs) -> Any:
        from torch._inductor.codegen.multi_kernel import MultiKernelCall

        # no need to call this in parallel since the sub-kernels are already parallel tasks
        return MultiKernelCall(*args, **kwargs)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-20 19:55:24 +00:00
e5ea24fb27 [nativert] Move auto_functionalize_kernel (#156454)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Copied from original auto_functionalize Diff Summary D53776805:

This is a non-functional kernel implementation for auto_functionalize

In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.

This would mutates the input tensors inplace, which is unsafe in general.

However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.

In the proper functional implementation, it will

make a clone of the mutating input tensor

return these new instance of tensors as AutoFunctionalizeKernel output.

If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.

Test Plan: See internal for test plan

Differential Revision: D76926383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
2025-06-20 19:53:16 +00:00
eb331b59fe Add shim fallback for narrow (#156496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496
Approved by: https://github.com/albanD
2025-06-20 19:47:00 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
ef6d2cee7a [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-20 18:54:38 +00:00
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
502486d946 [PT2]Add weight and constant config path template (#156359)
Summary: At title.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D76925510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359
Approved by: https://github.com/SherlockNoMad
2025-06-20 18:46:01 +00:00
4b6cbf528b Add C shim fallback for fill_ (#156245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245
Approved by: https://github.com/desertfire
2025-06-20 18:45:48 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
d309cd1d50 Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969)"
This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d.

Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))
2025-06-20 18:40:38 +00:00
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
55dae0bf7a Add a basic shim and stable::Tensor is_contiguous API (#156228)
Add a limited is_contiguous in shim, stable::Tensor API with a test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228
Approved by: https://github.com/desertfire
2025-06-20 17:59:52 +00:00
49ee1e7106 [CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138)
Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this
Use `--no-renames` to make both the old name and the old name show up in the diff.  Without it I think only the new name shows up in git diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever
2025-06-20 17:58:04 +00:00
e31f205292 [Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504)
Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504
Approved by: https://github.com/blaine-rister

Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
2025-06-20 17:43:38 +00:00
d83ff89d3b Add toggle functionality for XPU profiler (#155135)
Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-20 17:27:48 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
63360e64da [BE][Easy] do not install yanked types-pkg-resources in lint environment (#156462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462
Approved by: https://github.com/ezyang
2025-06-20 16:00:43 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
b4442f42a9 Revert "Upgrade to DLPack 1.0. (#145000)"
This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6.

Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))
2025-06-20 15:32:47 +00:00
edd45f3a02 Revert "[Precompile] Hook up backend="inductor" (#155387)"
This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840.

Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](2c68c3e8d5) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))
2025-06-20 15:30:04 +00:00
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
24dc33b37b [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-20 07:03:29 +00:00
537b0877a8 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-20 07:03:16 +00:00
2c372a0502 [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-20 07:03:07 +00:00
b46eb1ccaf [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-20 07:02:57 +00:00
2c68c3e8d5 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-20 06:38:29 +00:00
d5b4a32960 [BE] fix PYPROJECT linting errors in test/ and tools/ (#156021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021
Approved by: https://github.com/Skylion007
2025-06-20 06:19:05 +00:00
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
c37ddcaefb Fix torchgen update-aoti-shim (#156323)
will remove the fill changes before landing and let Jane merge her changes!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323
Approved by: https://github.com/janeyx99
2025-06-20 05:23:06 +00:00
f7a5ad6c29 [Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407)
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
2025-06-20 03:08:02 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
cyy
3c2324c64a [2/N] Fix cppcoreguidelines-init-variables suppression (#146237)
This PR removes all `cppcoreguidelines-init-variables` suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237
Approved by: https://github.com/ezyang
2025-06-19 23:26:42 +00:00
52f873adc2 Add logging for async compile worker statistics (#155820)
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.

Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
2025-06-19 23:10:15 +00:00
c60d8188d2 [nativert] Move GraphExecutorBase to PyTorch core (#156196)
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with  execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
CI

Rollback Plan:

Differential Revision: D76830436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
2025-06-19 22:42:35 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
3644b41a7c [ONNX] Note on attention op symbolic function (#156441)
Follow up https://github.com/pytorch/pytorch/pull/156367
Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441
Approved by: https://github.com/justinchuby
2025-06-19 21:00:05 +00:00
443b5b43c3 xpu: fix AOT compilation in sycl cpp extension (#156364)
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.

Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")

CC: @pengxin99, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
2025-06-19 20:11:38 +00:00
d32deb664a [c10d] Disable NCCL NVLS when using deterministic mode (#156381)
via setting env `NCCL_ALGO=^NVLS`.

Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381
Approved by: https://github.com/ngimel
2025-06-19 20:09:24 +00:00
69f2e09cc2 Add more shards to H100 benchmark, and also run it more frequently (#156429)
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours.  I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090

With this computing power, we could also run the whole suite every 4 hours now.  I could run this less frequently later if I see queueing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
2025-06-19 20:02:56 +00:00
aac0e8f0e9 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 20:02:38 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c0ee01c2fb tools/nightly.py: only download torch via pip and install dependenices via uv (#156409)
Setup time (cpu-only): 70s -> 27.6s -> 17.4s

The tool can setup the pinned NVIDIA dependencies correctly:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate
make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda"
make[1]: Entering directory '/home/PanXuehai/Projects/pytorch'
/home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda
log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log
Creating virtual environment
Removing existing venv: /home/PanXuehai/Projects/pytorch/venv
Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - uv
  - pip
  - setuptools
  - packaging
  - wheel
  - build[uv]
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting uv
  Using cached f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1)
Collecting setuptools
  Using cached 17031897da/setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Collecting packaging
  Using cached 38679034af/packaging-25.0-py3-none-any.whl (66 kB)
Collecting wheel
  Using cached 87f3254fd8/wheel-0.45.1-py3-none-any.whl (72 kB)
Collecting build[uv]
  Using cached 80633736cd/build-1.2.2.post1-py3-none-any.whl (22 kB)
Collecting pyproject_hooks (from build[uv])
  Using cached 12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build
Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1
Installing packages took 6.251 [s]
Creating virtual environment took 9.050 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB)
Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Successfully downloaded torch
Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww:
  - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Downloading packages took 6.284 [s]
Unpacking wheel file
Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK
Unpacking wheel file took 15.107 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - filelock
  - typing-extensions>=4.10.0
  - setuptools; python_version >= "3.12"
  - sympy>=1.13.3
  - networkx
  - jinja2
  - fsspec
  - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64"
  - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux"
  - numpy
  - cmake
  - ninja
  - packaging
  - ruff
  - mypy
  - pytest
  - hypothesis
  - ipython
  - rich
  - clang-format
  - clang-tidy
  - sphinx
Using Python 3.13.4 environment at: venv
Resolved 78 packages in 2.95s
Installed 76 packages in 93ms
 + alabaster==1.0.0
 + asttokens==3.0.0
 + attrs==24.2.0
 + babel==2.17.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + clang-format==20.1.6
 + clang-tidy==20.1.0
 + cmake==3.25.0
 + decorator==5.2.1
 + docutils==0.21.2
 + executing==2.2.0
 + filelock==3.18.0
 + fsspec==2025.5.1
 + hypothesis==6.135.11
 + idna==3.10
 + imagesize==1.4.1
 + iniconfig==2.1.0
 + ipython==9.3.0
 + ipython-pygments-lexers==1.1.1
 + jedi==0.19.2
 + jinja2==3.1.6
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib-inline==0.1.7
 + mdurl==0.1.2
 + mpmath==1.3.0
 + mypy==1.16.1
 + mypy-extensions==1.0.0
 + networkx==3.5
 + ninja==1.11.1.4
 + numpy==2.3.0
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.3
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.2.5
 + nvidia-nvtx-cu12==12.8.90
 + parso==0.8.4
 + pathspec==0.12.1
 + pexpect==4.9.0
 + pluggy==1.6.0
 + prompt-toolkit==3.0.51
 + ptyprocess==0.7.0
 + pure-eval==0.2.3
 + pygments==2.19.1
 + pytest==8.4.1
 + pytorch-triton==3.3.1+gitc8757738
 + requests==2.32.3
 + rich==14.0.0
 + roman-numerals-py==3.1.0
 + ruff==0.12.0
 + snowballstemmer==3.0.1
 + sortedcontainers==2.4.0
 + sphinx==8.2.3
 + sphinxcontrib-applehelp==2.0.0
 + sphinxcontrib-devhelp==2.0.0
 + sphinxcontrib-htmlhelp==2.1.0
 + sphinxcontrib-jsmath==1.0.1
 + sphinxcontrib-qthelp==2.0.0
 + sphinxcontrib-serializinghtml==2.0.0
 + stack-data==0.6.3
 + sympy==1.14.0
 + traitlets==5.14.3
 + typing-extensions==4.14.0
 + urllib3==2.2.3
 + wcwidth==0.2.13
Installing packages took 3.080 [s]
Installing dependencies took 3.080 [s]
Pulling nightly PyTorch
Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba
Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef
Already up to date.
Pulling nightly PyTorch took 0.017 [s]
Moving nightly files into repo
Moving nightly files into repo took 4.898 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.021 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate

make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409
Approved by: https://github.com/ezyang
ghstack dependencies: #156408
2025-06-19 19:42:15 +00:00
71faa7e5b9 tools/nightly.py: use uv pip install instead of pip install (#156408)
Setup time: 70s -> 27.6s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408
Approved by: https://github.com/ezyang
2025-06-19 19:42:15 +00:00
134dfb3fe6 [dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791)
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.

We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.

Fixes #155778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
2025-06-19 19:37:44 +00:00
e4c9f6d9a2 [nativert] Move c10_kernel (#156208)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:c10_kernel_test
```

Differential Revision: D76825830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
2025-06-19 17:36:23 +00:00
f402eed4d9 [ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611)
This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4)

CUDAHooks::versionMIOpen() was added to detect MIOpen version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-19 17:22:37 +00:00
085f270a00 [ROCm] Enable more parallelism for multi-dimensional reductions (#155806)
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-06-19 17:19:40 +00:00
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
6e185c5312 Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
6eb6f198e1 update codebase structure documentation to include mps (#156297)
📚 The doc update

adding description about mps folder in code structure guide

@albanD @malfet @svekars @sekyondaMeta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297
Approved by: https://github.com/ezyang
2025-06-19 16:16:29 +00:00
7f0cddfb55 [dynamo] Add documentation for guard_filter_fn (#156114)
Summary: Adding a section of doc for guard_filter_fn.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76756743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114
Approved by: https://github.com/jansel
2025-06-19 16:13:12 +00:00
c9afcffed0 [AOTInductor] Call most runtime fallback ops without calling into Python (#154142)
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python.  This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.

Fixes #150988
Fixes #153478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
2025-06-19 15:27:15 +00:00
317af4c87b Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)"
This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c.

Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))
2025-06-19 15:09:29 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
fec8af8b98 [bugfix] [build] guard cuda version for ipc with fabric handle (#156394)
https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394
Approved by: https://github.com/ngimel
2025-06-19 13:54:01 +00:00
769d754ab2 [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-19 13:22:41 +00:00
8cb0c4a4da [Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586)
This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #150287
2025-06-19 13:17:22 +00:00
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
ccb1f687d6 Port two dynamo test cases for Intel GPU (#156056)
For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-06-19 12:49:04 +00:00
a8fe982993 Revert "[build] Create target for flash attention (#156235)"
This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0.

Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](6d02321472) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))
2025-06-19 11:47:27 +00:00
4da98351b9 [SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211)
Adds NVSHMEM PUT with Signal operation support for Triton kernels:

- Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block
- Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`)
- Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`)

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set`
`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add`

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_set(self) -> None:
    @triton.jit
    def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                         signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
        nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

    # ... setup code ...

    val = 11
    inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val)
    out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)  # destination buffer

    # Signal flag buffer - starts at 0, will be set to 1 upon completion
    flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

    peer = 1 - rank
    NVSHMEM_SIGNAL_SET = 0  # atomic set operation
    SIGNAL_VAL = 1  # completion signal value

    if rank == 0:
        # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1
        put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                    signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET,
                                    peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   # Rank 1 can check flag to know data transfer completed!
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")
```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([1], device='cuda:1')

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

Working as expected! Data is received, and flag set to 1 for completion signal!

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_add(self) -> None:
   @triton.jit
   def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                        signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
       nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

   # ... setup code ...

   # Signal buffer (uint64 flag)
   flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

   peer = 1 - rank
   NVSHMEM_SIGNAL_ADD = 5  # atomic add operation
   SIGNAL_VAL = 16  # Signal value to add

   if rank == 0:
       # Rank 0 puts into Rank 1 and adds to signal
       put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                   signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD,
                                   peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")

```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([16], device='cuda:1')

----------------------------------------------------------------------
Ran 1 test in 17.145s

OK
```

The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-19 10:24:30 +00:00
348e2a76df s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397
Approved by: https://github.com/laithsakka
2025-06-19 10:18:28 +00:00
02080c2cd9 Fix num_heads inference in ONNX Attention-23 exporter (#156367)
Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105

Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3:

![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a)

But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed.

Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-06-19 09:40:01 +00:00
8fcda2c60d [SymmMem] Add runtime detection of NVSHMEM (#156291)
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
2025-06-19 08:26:11 +00:00
eabf7cd3c5 [export] update docs for Dims (#156262)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262
Approved by: https://github.com/angelayi
2025-06-19 06:25:21 +00:00
ec0276103f [PGO] fix whitelist scalar bug (#156194)
Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D76830552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194
Approved by: https://github.com/bobrenjc93
2025-06-19 05:51:21 +00:00
1c960c5638 [Makefile] lazily setup lintrunner on first make lint run (#156058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058
Approved by: https://github.com/ezyang
2025-06-19 05:43:35 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
ce8180a61d [c10d] Disable stack trace call in logging (#156362)
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76929722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
2025-06-19 05:11:57 +00:00
a21806f038 [ez][export] Better error message for schema check in torch.export.load (#156361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156354

torch.export.load() only supports files generated by torch.export.save()

Test Plan:
CI

Rollback Plan:

Differential Revision: D76928725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361
Approved by: https://github.com/zhxchen17
2025-06-19 04:50:56 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
3bec588bf5 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
2025-06-19 03:47:41 +00:00
6d02321472 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 03:35:04 +00:00
77518d1a13 [CI] fix xpu-smi hang in XPU test container (#156171)
Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171
Approved by: https://github.com/huydhn
2025-06-19 02:48:11 +00:00
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
d4ad280429 Enable querying the build and runtime NCCL versions (#156305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305
Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin
2025-06-19 02:00:08 +00:00
bc9bd2a766 Use linux.2xlarge runner (#156351)
The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-06-19 01:50:56 +00:00
e5a1197191 Fix fx tracing for mark dynamic (#156346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346
Approved by: https://github.com/tony-ivchenko
2025-06-19 01:03:09 +00:00
6959b5febe Context on torch.cuda.memory._record_memory_history max_entries (#155889)
Context on torch.cuda.memory._record_memory_history buffer behavior

## Description

Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?

Fixes #129674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
2025-06-19 00:44:43 +00:00
6303cc41b7 [ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262)
We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing.

Fixes #155045.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262
Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 23:52:35 +00:00
b8c2d4c259 add a corner test case of dynamic sizes for combo kernel (#156035)
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op

Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```

Rollback Plan:

Differential Revision: D76699002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
2025-06-18 22:57:09 +00:00
76d07e919f Unbreak //c10/util:base (#156216)
Missing dep.

Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2025-06-18 22:44:20 +00:00
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
a5b4463d60 [nativert] session state (#156190)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76827309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190
Approved by: https://github.com/zhxchen17
2025-06-18 22:40:44 +00:00
6918758f55 [export] Update documents for ExportGraphSiganture (#156244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156184

The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused.

Test Plan:
Document change only. CI

Rollback Plan:

Differential Revision: D76849097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244
Approved by: https://github.com/yushangdi
2025-06-18 22:37:34 +00:00
1e474cc9c8 [ONNX] Fix how shapes are computed for float4 (#156353)
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.

Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
2025-06-18 22:28:02 +00:00
9afee0fa96 [inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201)
internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif

Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201
Approved by: https://github.com/masnesral
2025-06-18 22:15:32 +00:00
e5a0b73ce9 [MTIA Aten Backend] Migrate logical_and.out (#156286)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_and.out to in-tree

Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286
Approved by: https://github.com/nautsimon, https://github.com/jingsh
ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285
2025-06-18 21:57:05 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
f5eb42e4c0 [nativert] move layoutplanneralgorithm to libtorch (#156205)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76831634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205
Approved by: https://github.com/zhxchen17
2025-06-18 21:46:38 +00:00
d1c924c68a [MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate t.Tensor_out / lt.Scalar_out to in-tree.

Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283, #156284
2025-06-18 21:40:26 +00:00
5c7e1d39ab [MTIA Aten Backend] Migrate logit (#156284)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logit to in-tree.

Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283
2025-06-18 21:36:27 +00:00
706e236b08 [MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_or.out / log.out / log2.out to in-tree.

Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047
2025-06-18 21:27:58 +00:00
ab81fb846c [MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047)
Migrate remainder.Tensor_out / reciprocal.out / neg.out

Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046
2025-06-18 21:17:34 +00:00
c26ce593d8 [MTIA Aten Backend] Migrate nan_to_num.out (#156046)
Migrate nan_to_num.out

Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634
2025-06-18 21:14:13 +00:00
2f1c5c4131 [MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634)
# Context

MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.

# This diff

Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.

The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.

----------------

p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time

Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
2025-06-18 21:10:18 +00:00
e99cc126a4 [AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133)
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid

Rollback Plan:

Differential Revision: D76769711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
2025-06-18 20:53:21 +00:00
728cf6721e Revert "[PT2]load dense delta by trimming prefixes (#155872)"
This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f.

Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))
2025-06-18 20:05:56 +00:00
c74fd35050 [PT2]load dense delta by trimming prefixes (#155872)
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.

Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.

Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}  --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```

Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455

USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```

Rollback Plan:

Differential Revision: D76520301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
2025-06-18 19:13:22 +00:00
48de3da253 fix: avoid flamegraph script setup conflicts (#156310)
Fixes #156309

Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet

Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
2025-06-18 19:06:22 +00:00
cbafba5794 Allow forcing FSDP2 to always use SUM reductions (#155915)
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.

Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.

This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).

Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
2025-06-18 18:57:47 +00:00
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
aee2bfc5ba [Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194)
As title.
Thanks @anmyachev  for the work on compatibility adaptation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194
Approved by: https://github.com/jansel
2025-06-18 18:17:07 +00:00
2620361d19 Add batching rule for torch.matrix_exp (#155202)
## Summary

Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.

Fixes #115992

## Details

`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.

## Testing

The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback

**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
2025-06-18 17:35:35 +00:00
eqy
a5f59cc2ea [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel
2025-06-18 17:32:36 +00:00
94f8679019 Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)"
This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365.

Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))
2025-06-18 16:52:19 +00:00
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
ce3406817d Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit fe37db4f1270745d6c523623143332ddf263af55.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))
2025-06-18 15:53:32 +00:00
c5d3e7a4ff Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))
2025-06-18 15:51:06 +00:00
408d9884b0 Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))
2025-06-18 15:47:21 +00:00
6201981f48 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))
2025-06-18 15:43:22 +00:00
d290fe7690 Remove legacy export testing path (#156093)
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76734378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
2025-06-18 15:36:44 +00:00
7531bd6491 [ROCm] upgrade to 6.4.1 patch release (#156112)
Fixes #155292.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 15:21:44 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
6d3a4356f6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
2025-06-18 14:38:55 +00:00
c177abd217 Disable pinning check when loading sparse tensors (#154638)
Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled.

Fixes https://github.com/pytorch/pytorch/issues/153143 .

For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-18 14:33:36 +00:00
8f02161d10 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))
2025-06-18 14:19:39 +00:00
b30e04b3c8 Make the NCCL PG Options and Config copyable and safe to init standalone (#155700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700
Approved by: https://github.com/kwen2501
2025-06-18 13:36:27 +00:00
1bb9b1858b [CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174)
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
2025-06-18 13:17:46 +00:00
d99cac2816 [Kineto][submodule] Update kineto pin for XPU toggle feature (#155488)
Part of #154898
Update kineto submodule

Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-18 12:39:58 +00:00
c11888e7a6 Skip more tests on s390x (#155210)
Make CI for s390x green before fixing and restoring tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210
Approved by: https://github.com/seemethere
2025-06-18 12:07:17 +00:00
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
f45f483884 [user triton] AOT Inductor support for new host-side TMA api (#155879)
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.

Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.

What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.

Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2}  and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
2025-06-18 09:35:11 +00:00
577baa4116 [c10d] Add a logger for all nccl collectives with its time duration when completed (#156008)
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
2025-06-18 09:08:42 +00:00
c5a4fe9c17 [CI] fix the ci image name for public copy in ghcr (#156169)
After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169
Approved by: https://github.com/malfet
2025-06-18 08:16:56 +00:00
a6a3a44144 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-18 07:27:20 +00:00
614a415145 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-18 07:27:20 +00:00
3c8c48f793 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-18 07:27:09 +00:00
920f6e681e [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-18 07:27:00 +00:00
fe37db4f12 [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-18 07:26:52 +00:00
ccc6279b40 flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719)
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198

There are two problems:

(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.

This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.

Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.

(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.

**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.

This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-06-18 07:02:04 +00:00
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
96df866410 [audio hash update] update the pinned audio hash (#156259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259
Approved by: https://github.com/pytorchbot
2025-06-18 06:02:46 +00:00
a5df6ffbc2 Improve IPC for Expandable Segments to use fabric handle when possible (#156074)
Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-06-18 05:22:06 +00:00
29867b211a [cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234)
When using docker with cutlass backend, we can get
```
No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions'
```
First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888

Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-06-18 05:03:43 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
a82c171bb2 remove skipifrocm from composability tests (#156036)
Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too

tested locally
<img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036
Approved by: https://github.com/jeffdaily
2025-06-18 04:24:42 +00:00
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
17b38b850e [ca] Allow using compiled autograd context managers during backward runtime (#156120)
Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check.

FIXES https://github.com/pytorch/pytorch/issues/152219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120
Approved by: https://github.com/jansel
ghstack dependencies: #155521, #155480
2025-06-18 03:01:15 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
4851863e3f fix hack to check if register_buffer has been overridden (#155963)
Followup on https://github.com/pytorch/pytorch/pull/125971

`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!

Example demonstration:

```python
class Base:
    def register_buffer(self, buffer):
        pass

class InheritedOk(Base):
    pass

class InheritedOverride(Base):
    def register_buffer(self, buffer):
        pass

b = Base()
ok = InheritedOk()
override = InheritedOverride()

print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False

print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```

(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
2025-06-18 01:50:30 +00:00
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
68996dc183 [BE][2/X] Phase out usage of use_max_autotune() (#155848)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848
Approved by: https://github.com/masnesral
2025-06-18 01:18:09 +00:00
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
3dabc351bb [Break XPU] Fix XPU UT failures introduced by community. (#156091)
Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091
Approved by: https://github.com/jansel
2025-06-17 23:43:37 +00:00
38e1e5d54c Add get_pipeline_order() for Gpipe and 1F1B (#155935)
The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR.

The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935
Approved by: https://github.com/wconstab
2025-06-17 23:39:17 +00:00
5435e75399 [ez] rename choice_timings -> choice_timings_fn (#156099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099
Approved by: https://github.com/mlazos
ghstack dependencies: #155982, #155996, #156053
2025-06-17 23:30:27 +00:00
12b02137af [MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:

Previous performance (using torch==2.7.1):
```[-------------------------------  -------------------------------]
                                              |  eager  |  compile
1 threads: -------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |  131.0  |   136.9
      cumsum-dim0-128x128 (torch.float16)     |  116.9  |   121.2
      cumsum-dim0-512x512 (torch.float16)     |  132.5  |   151.9
      cumsum-dim0-1024x1024 (torch.float16)   |  150.0  |   163.0
      cumsum-dim1-32x32 (torch.float16)       |  125.9  |   140.9
      cumsum-dim1-128x128 (torch.float16)     |  116.4  |   129.4
      cumsum-dim1-512x512 (torch.float16)     |  135.9  |   150.1
      cumsum-dim1-1024x1024 (torch.float16)   |  139.5  |   154.2
      cumsum-1d-100 (torch.float16)           |  119.5  |   127.1
      cumsum-1d-10000 (torch.float16)         |  128.9  |   142.5
      cumsum-1d-1000000 (torch.float16)       |  140.6  |   145.6
      cumsum-dim0-32x32 (torch.float32)       |  115.7  |   132.5
      cumsum-dim0-128x128 (torch.float32)     |  118.0  |   131.5
      cumsum-dim0-512x512 (torch.float32)     |  138.8  |   151.6
      cumsum-dim0-1024x1024 (torch.float32)   |  155.5  |   164.2
      cumsum-dim1-32x32 (torch.float32)       |  127.2  |   141.7
      cumsum-dim1-128x128 (torch.float32)     |  117.7  |   130.5
      cumsum-dim1-512x512 (torch.float32)     |  138.2  |   152.3
      cumsum-dim1-1024x1024 (torch.float32)   |  144.4  |   158.6
      cumsum-1d-100 (torch.float32)           |  118.6  |   128.0
      cumsum-1d-10000 (torch.float32)         |  125.5  |   141.5
      cumsum-1d-1000000 (torch.float32)       |  143.9  |   158.4
      cumsum-dim0-32x32 (torch.bfloat16)      |  106.6  |   137.6
      cumsum-dim0-128x128 (torch.bfloat16)    |  118.1  |   131.0
      cumsum-dim0-512x512 (torch.bfloat16)    |  140.0  |   154.3
      cumsum-dim0-1024x1024 (torch.bfloat16)  |  153.2  |   164.4
      cumsum-dim1-32x32 (torch.bfloat16)      |  127.9  |   132.6
      cumsum-dim1-128x128 (torch.bfloat16)    |  116.5  |   129.6
      cumsum-dim1-512x512 (torch.bfloat16)    |  136.5  |   151.2
      cumsum-dim1-1024x1024 (torch.bfloat16)  |  139.8  |   144.8
      cumsum-1d-100 (torch.bfloat16)          |  115.7  |   129.4
      cumsum-1d-10000 (torch.bfloat16)        |  125.0  |   143.3
      cumsum-1d-1000000 (torch.bfloat16)      |  127.8  |   143.4

Times are in microseconds (us).
```

Current performance:
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |    107.4  |    123.8
      cumsum-dim0-128x128 (torch.float16)     |    134.2  |    145.8
      cumsum-dim0-512x512 (torch.float16)     |    207.3  |    231.6
      cumsum-dim0-1024x1024 (torch.float16)   |    318.9  |    355.3
      cumsum-dim1-32x32 (torch.float16)       |     98.0  |    114.3
      cumsum-dim1-128x128 (torch.float16)     |    110.8  |    121.6
      cumsum-dim1-512x512 (torch.float16)     |    193.0  |    209.1
      cumsum-dim1-1024x1024 (torch.float16)   |    844.7  |    870.8
      cumsum-1d-100 (torch.float16)           |    108.4  |    125.0
      cumsum-1d-10000 (torch.float16)         |    784.7  |    852.3
      cumsum-1d-1000000 (torch.float16)       |  65855.2  |  66725.9
      cumsum-dim0-32x32 (torch.float32)       |    114.7  |    115.7
      cumsum-dim0-128x128 (torch.float32)     |    139.0  |    151.6
      cumsum-dim0-512x512 (torch.float32)     |    197.3  |    208.0
      cumsum-dim0-1024x1024 (torch.float32)   |    312.7  |    332.9
      cumsum-dim1-32x32 (torch.float32)       |     92.0  |    110.8
      cumsum-dim1-128x128 (torch.float32)     |    114.2  |    125.0
      cumsum-dim1-512x512 (torch.float32)     |    186.2  |    196.1
      cumsum-dim1-1024x1024 (torch.float32)   |    752.0  |    825.0
      cumsum-1d-100 (torch.float32)           |    112.4  |    122.0
      cumsum-1d-10000 (torch.float32)         |    793.5  |    863.5
      cumsum-1d-1000000 (torch.float32)       |  66431.8  |  66040.0
      cumsum-dim0-32x32 (torch.bfloat16)      |    111.6  |    121.6
      cumsum-dim0-128x128 (torch.bfloat16)    |    139.0  |    138.4
      cumsum-dim0-512x512 (torch.bfloat16)    |    217.6  |    230.1
      cumsum-dim0-1024x1024 (torch.bfloat16)  |    305.2  |    325.6
      cumsum-dim1-32x32 (torch.bfloat16)      |    100.5  |    110.9
      cumsum-dim1-128x128 (torch.bfloat16)    |    112.8  |    125.0
      cumsum-dim1-512x512 (torch.bfloat16)    |    187.8  |    208.9
      cumsum-dim1-1024x1024 (torch.bfloat16)  |    790.9  |    864.7
      cumsum-1d-100 (torch.bfloat16)          |    111.6  |    124.6
      cumsum-1d-10000 (torch.bfloat16)        |    778.1  |    844.9
      cumsum-1d-1000000 (torch.bfloat16)      |  64654.3  |  64082.5

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
54998c2daa Document padding size limitations in nn.modules.padding (#134840) (#155618)
Fixes #134840

Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:

- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive

These changes help prevent runtime errors when users attempt to use large padding values.

## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
2025-06-17 22:16:48 +00:00
937529f0b3 Pass by const ref instead of by value in StableIValue from (#156126)
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.

I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).

Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
2025-06-17 22:11:30 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
bb462a6237 [cutlass backend] Fix prescreening non-deterministic problem (#156144)
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)

What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.

However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.

With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
2025-06-17 20:39:06 +00:00
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
50940270ae [BE][3/X] Phase out usage of use_max_autotune() (#155849)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849
Approved by: https://github.com/masnesral
2025-06-17 20:26:29 +00:00
b020971e78 [BE] fix typos in torchgen/ (#156083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083
Approved by: https://github.com/jingsh
ghstack dependencies: #156079, #156082
2025-06-17 19:25:50 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
5eb5c3700b [ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525)
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.

`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2

Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1,  rocSolver <3.26)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
2025-06-17 19:20:36 +00:00
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
4a26bb8a12 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-17 18:59:44 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
8b0e0e4f23 [dynamo] Support tracing of functools.lru_cached method (#156125)
Fixes https://github.com/pytorch/pytorch/issues/155841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125
Approved by: https://github.com/williamwen42
2025-06-17 18:11:32 +00:00
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
32c1611263 [CI][run_test] Fix rerun logic for failing at exit (#155853)
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.

The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file).  This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted.  But sometimes pytest reports a success, but the process has non zero exit code.  The two cases I know of are hangs and double freeing at exit.  In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option).  But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly

Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds.  This is correct even in the normal path because initial value shouldn't change in that case

Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this

I think I have a PR that fixes the above but its old so I need to take another look

Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
2025-06-17 17:51:40 +00:00
6629eaf0c6 [CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193)
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`

Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
2025-06-17 17:49:33 +00:00
a4ea242edc [MPS] Implement scan metal kernels (#156100)
Implements metal kernels for scan operations:
- Migrates cumsum and cumprod from MPSGraph implementation to Metal.
- Fixes #154881
- Adds MPS backend support for cummin and cummax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100
Approved by: https://github.com/malfet
2025-06-17 17:44:22 +00:00
9a5c59368d Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977
Approved by: https://github.com/albanD
ghstack dependencies: #155367
2025-06-17 17:36:31 +00:00
b115a4c03a torch::stable::Tensor beginnings, mainly mem mgmt (#155367)
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
//    C APIs with AtenTensorHandle. Under the hood we still call to these C shim
//    APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
//    around ownership. This makes it difficult to pass one input into 2
//    different functions, e.g., doing something like c = a(t) + b(t) for
//    stable::Tensor t. Thus, we use a shared_ptr here.
```

This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
2025-06-17 17:36:31 +00:00
2625c70aec Update CODEOWNERS (#156182)
as title says. removing me as codeowner for cpp extensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182
Approved by: https://github.com/albanD
2025-06-17 17:15:41 +00:00
a24afbff3f Support torch.cuda.*Tensor in Dynamo (#156107)
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.

I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.

Fixes #130722

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
2025-06-17 16:31:10 +00:00
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
d66b4bcc3f [inductor][triton pin] Support triton builtins after #7054 (#156031)
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.

To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.

(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`

Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
2025-06-17 16:09:55 +00:00
d083841c0e Fix a small sphinx markup error (#156061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061
Approved by: https://github.com/colesbury
2025-06-17 15:36:02 +00:00
0079c80b35 [CI] Do not constrain memory for ROCm testing in CI (#156115)
Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115
Approved by: https://github.com/jeffdaily
2025-06-17 15:30:36 +00:00
7fcad0231c [Docs] Convert to markdown to fix 155025 (#155789)
Related to #155025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789
Approved by: https://github.com/svekars
2025-06-17 15:08:14 +00:00
4886ba64dc [BE] Refactor functions from optional_submodules (#155954)
And use `pathlib.Path` instead of `os.path`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954
Approved by: https://github.com/Skylion007
ghstack dependencies: #155947
2025-06-17 14:41:52 +00:00
cf90c9f8d1 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-17 14:15:49 +00:00
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
0a0023d984 Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564)
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
2025-06-17 12:54:58 +00:00
11bb1ece50 [CI] Fix triton version split issue (#155670)
Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670
Approved by: https://github.com/atalman, https://github.com/EikanWang
2025-06-17 12:42:40 +00:00
1cce73b5f4 [build] Change --cmake{,-only} arguments to envvars to support modern Python build frontend (#156045)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045
Approved by: https://github.com/ezyang
ghstack dependencies: #156040, #156041
2025-06-17 11:40:24 +00:00
57084ca846 [BE][setup] allow passing pytorch-specific setup.py options from envvars (#156041)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041
Approved by: https://github.com/ezyang
ghstack dependencies: #156040
2025-06-17 11:40:24 +00:00
092aed1b18 [Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992)
In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v.
In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR.   SDPA UTs pass in local test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992
Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 11:09:51 +00:00
4a8f5e752b [FSDP2] explain user contract for fully_shard (#156070)
<img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070
Approved by: https://github.com/mori360
2025-06-17 10:03:19 +00:00
8d7ee0f4fb [BE] fix typos in .ci/, .circleci/, .github/ (#156069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-17 09:43:14 +00:00
2e0e08588e [BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553
Approved by: https://github.com/ezyang
ghstack dependencies: #144551
2025-06-17 08:18:47 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
cyy
d43c0bdf46 [CI] Move ASAN jobs to clang-18 (#149099)
Use clang-18 for ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099
Approved by: https://github.com/ezyang
2025-06-17 04:51:07 +00:00
7b0118884e [invoke_subgraph][inductor] Dont fallback on complex dtype (#155885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #155828
2025-06-17 04:47:12 +00:00
ffcc6fea7b [invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828
Approved by: https://github.com/zou3519
2025-06-17 04:47:12 +00:00
b1713c6655 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
ghstack dependencies: #156121
2025-06-17 04:46:26 +00:00
82672206b7 [SymmMem] Make get_rank_to_global_rank return const ref (#156117)
Avoiding a copy, not expecting a caller to change its value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117
Approved by: https://github.com/fegin
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116
2025-06-17 04:13:18 +00:00
eea3bcb3d1 [SymmMem] Cache rank_to_global_rank exchange (#156116)
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.

Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```

After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
2025-06-17 04:12:37 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
45382b284d [cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875)
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.

Also I want to change how cutlass .so are compiled. Hence this change.

This change is broken down since @coconutruben is trying to make a change to the same files too.

Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
2025-06-17 02:06:54 +00:00
cyy
64bb6317a5 [Accelerator] Fix Python typing in accelerator (#152394)
There are some changes:
1. Use keywords for arguments if possible.
2. `__exit__ ` of `device_index` is changed to return None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394
Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 01:27:40 +00:00
1f0eb79e3e [dynamo] fix KeyError in LOAD_FAST_CHECK (#155763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #155761
2025-06-17 00:54:16 +00:00
4e833c2005 [dynamo] support tracing weakref callback (#155761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-17 00:54:16 +00:00
e6252f62ef [ONNX] Implements converter for higher order ops scan (#154513)
Fixes #151327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-17 00:54:07 +00:00
b618817479 [PGO] include ints/floats in suggested whitelist (#155980)
Made the mistake of dropping these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980
Approved by: https://github.com/bobrenjc93
2025-06-17 00:41:38 +00:00
4311aea5e7 [AOTInductor] Add class declarations to torch._C._aoti interface file (#155128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128
Approved by: https://github.com/desertfire
ghstack dependencies: #155149
2025-06-17 00:10:57 +00:00
82fb904140 Add warning for incorrected grad results at world size 1 (#154928)
Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928
Approved by: https://github.com/weifengpy
2025-06-17 00:08:04 +00:00
eb4cf59ecd Add FSDP2 logging (#155826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826
Approved by: https://github.com/weifengpy
2025-06-16 23:49:58 +00:00
6e2992a998 Remove unused Azure pipeline trigger script (#156134)
## Summary
- delete `.circleci/scripts/trigger_azure_pipeline.py`

## Testing
- `python3 -m pip install flake8`
- `python3 -m flake8 .circleci/scripts`

------
https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134
Approved by: https://github.com/izaitsevfb
2025-06-16 23:42:52 +00:00
4781b0ee60 [SymmMem] Add NVSHMEM GET support to Triton (#155890)
Adds NVSHMEM GET operation support for Triton kernels:

- Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block
- Add basic `test_triton_get` for 2-rank GET operation
- Add `test_triton_get_ring` for ring topology GET across arbitrary ranks

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get`

```python
@skipIfRocm
@requires_triton()
def test_triton_get(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   val = 7
   inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(
       val if rank == 0 else -1
   )
   out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)

   peer = 1 - rank
   if rank == 1:
       # Rank 1 gets data from rank 0
       get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")
```

```

[Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8)
...
[Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8)
...
[Rank 1] got data from peer 0

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

```python
@skipIfRocm
@requires_triton()
def test_triton_get_ring(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   # Ring topology: each rank gets data from the rank to its left
   peer = (rank - 1) % world_size

   # All ranks execute the get operation
   get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")

```

```
Output (8 GPUs):

[Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8)
[Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8)
[Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8)
[Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8)
[Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8)
[Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8)
[Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8)
[Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8)
[Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 2] got data from peer 1
[Rank 5] got data from peer 4
[Rank 0] got data from peer 7
[Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8)
[Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8)
[Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8)
[Rank 6] got data from peer 5
[Rank 3] got data from peer 2
[Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8)
[Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8)
[Rank 7] got data from peer 6
[Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8)
[Rank 4] got data from peer 3

----------------------------------------------------------------------
Ran 1 test in 41.045s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-16 23:18:15 +00:00
bb1f3d1a55 [MPSInductor] Improve _default dtype inference (#156121)
By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121
Approved by: https://github.com/dcci
2025-06-16 23:11:53 +00:00
508cdc4fc9 [BE][4/X] Phase out usage of use_max_autotune() (#155850)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850
Approved by: https://github.com/masnesral
2025-06-16 23:10:26 +00:00
f2d70898c6 [nativert] Move OpKernel to PyTorch core (#156011)
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test

Rollback Plan:

Differential Revision: D76525939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
2025-06-16 22:53:10 +00:00
35ecd7c2d4 Revert "[Cutlass] Fix buffer missing issues (#155897)"
This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb.

Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))
2025-06-16 22:44:39 +00:00
190f76fa31 Revert "Implement guard collectives (#155558)"
This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but c92f1075aa/1 shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))
2025-06-16 22:26:52 +00:00
c92f1075aa Fix if condition for CUDA 12.9 Win build (#156108)
follow-up for https://github.com/pytorch/pytorch/pull/155799/files
Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108
Approved by: https://github.com/atalman
2025-06-16 21:57:34 +00:00
cce4d213a6 Remove non-header-only dep from c10_headers target (#155858)
It depends on /c10/util:base which is not header-only.

Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858
Approved by: https://github.com/ezyang
2025-06-16 21:41:25 +00:00
a24ce67dee [ez] fix grammar error in comment (#156053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053
Approved by: https://github.com/jingsh
ghstack dependencies: #155982, #155996
2025-06-16 20:53:07 +00:00
247113e03e Add size_hint_or_throw (#155615)
## Summary
`TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484
- One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr.
- Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite.

This PR adds `size_hint_or_throw` which will throw if unbacked symints exist
- use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints

------
with Codex
https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615
Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-16 20:46:51 +00:00
008345be9d Fix #155018 (convert distributed rst to markdown) (#155528)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155018

Docs comparison (check out the 'new' whenever docs build)

1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html))
2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html))
3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html))
4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html))
5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528
Approved by: https://github.com/wz337, https://github.com/svekars
2025-06-16 20:46:09 +00:00
eb2af14f8e [PT2][partitioners] Add aten.split to view_ops list [relanding #155424] (#155943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943
Approved by: https://github.com/ShatianWang
2025-06-16 20:42:54 +00:00
03488d820c Revert "[MPS][Testing][BE] Fix samples for full_like (#156026)"
This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06.

Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](2d832c9587) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))
2025-06-16 19:50:26 +00:00
6d2155db49 [PGO] no code state update on dynamic=False (#155961)
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.

A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.

Test Plan:
test/dynamo/test_pgo.py

Rollback Plan:

Differential Revisi,on: D76630499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
2025-06-16 19:47:55 +00:00
5a5a05a6a3 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 19:46:16 +00:00
61b271e0f3 Revert "Implement guard collectives (#155558)"
This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](38e5e81e55) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))
2025-06-16 19:40:46 +00:00
7cf38d2a05 Make benchmark by op for TS model work with sample inputs (#155988)
Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR

Test Plan:
```
buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76554920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988
Approved by: https://github.com/dolpm
2025-06-16 19:15:07 +00:00
2dc1627451 [doc] Add documentation for division by zero behavior in autograd (#155987)
Fixes #128796

This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:

1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
   - Masking before division
   - Using MaskedTensor (experimental API)

The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.

This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.

dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
2025-06-16 19:02:12 +00:00
907d0931cc [ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480)
For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480
Approved by: https://github.com/jansel
ghstack dependencies: #155521
2025-06-16 18:45:03 +00:00
9ff9c28fe8 [ca] Functionalize AccumulateGrad (#155521)
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.

Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
        new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1);  getitem_1 = None
        copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1);  copy_ = None
```
- AOTAutograd: functionalizes the copy_

After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.

While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.

In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
2025-06-16 18:45:02 +00:00
42ff6a4a5c [Inductor] Delay codegen for fallback arguments and improve typing (#154371)
Delays code generation for arguments to fallback ops.  This is inspired by #155642, and likely fixes similar memory leaks.

Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-06-16 18:00:04 +00:00
4162c0f702 [BE][setup] gracefully handle envvars representing a boolean in setup.py (#156040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040
Approved by: https://github.com/malfet
2025-06-16 17:56:31 +00:00
f48a157660 [aoti] Add more to error message (#155974)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974
Approved by: https://github.com/yushangdi
2025-06-16 17:49:52 +00:00
fbd88ae2b5 Convert to markdown: checkpoint.rst (#156009)
Related to #155014

Use two commits to have a try.
```bash
 1800  git mv docs/source/checkpoint.rst docs/source/checkpoint.md
 1802  git commit -m "[Docs] Rename checkpoint.rst"
 1803  git push origin ckpoint

# update the markdown file
 1805  git add .
 1806  git commit -m "modify checkpoint.md"
 1807  git push origin ckpoint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009
Approved by: https://github.com/svekars
2025-06-16 17:48:23 +00:00
a10024d7de Convert complex_numbers.rst to markdown (#156039)
Related to #155014

Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039
Approved by: https://github.com/svekars
2025-06-16 17:24:37 +00:00
e9fdaf8701 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))
2025-06-16 17:22:55 +00:00
45596ec58f Delete tools/onnx/update_default_opset_version.py (#156055)
The tool is no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055
Approved by: https://github.com/titaiwangms
2025-06-16 17:21:36 +00:00
365ce465f3 Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)"
This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7.

Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](8142a02860) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))
2025-06-16 16:59:25 +00:00
2a4e357192 Fix compilation warning with gcc14 (#155934)
Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-06-16 16:43:15 +00:00
503362d019 Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)"
This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968.

Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))
2025-06-16 15:13:53 +00:00
b8d96c3f78 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))
2025-06-16 14:59:01 +00:00
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
6c493e2b14 [BE] add codespell linter (#156066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066
Approved by: https://github.com/malfet
2025-06-16 14:56:07 +00:00
2d832c9587 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
2025-06-16 14:27:42 +00:00
831c9010c7 [BE] Remove non-existing operator from unimplemented list (#156025)
Never heard of torch.login :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025
Approved by: https://github.com/dcci
2025-06-16 14:14:58 +00:00
38e5e81e55 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 14:09:14 +00:00
05faba4028 Bump requests from 2.32.2 to 2.32.4 in /.github (#155491)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-16 06:48:08 -07:00
d6ee5144ca [xla hash update] update the pinned xla hash (#156064)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064
Approved by: https://github.com/pytorchbot
2025-06-16 11:11:10 +00:00
8142a02860 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-16 10:55:47 +00:00
bf7e290854 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
2025-06-16 10:28:45 +00:00
f810e98143 [ONNX] Update default opset to 18 (#156023)
Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023
Approved by: https://github.com/Skylion007
2025-06-16 08:40:49 +00:00
39c605e8b3 remove allow-untyped-defs from context.py (#155622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622
Approved by: https://github.com/Skylion007
2025-06-16 07:38:34 +00:00
d9799a2ee7 Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699)
Fixes #153310

As the title

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
2025-06-16 07:10:06 +00:00
156b28e62a [audio hash update] update the pinned audio hash (#155648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648
Approved by: https://github.com/pytorchbot
2025-06-16 03:57:28 +00:00
c620d0b5c7 convert: rst to myst pr2/2 (#155911)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-16 00:44:44 +00:00
c83041cac2 [test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827)
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```

These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
2025-06-15 20:24:19 +00:00
bc9b8ea230 [user triton] JIT inductor support for new host-side TMA api (#155814)
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.

* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)

AOTI support is not implemented yet.

Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
2025-06-15 20:24:19 +00:00
b7c95acc6c [user triton] triton_kernel_wrap support for new host-side TMA API (#155777)
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.

Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)

Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
2025-06-15 20:24:19 +00:00
54976bca10 [dynamo] Provide helper functions for guard filter hook (#155083)
Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083
Approved by: https://github.com/zhxchen17, https://github.com/jansel
2025-06-15 17:49:36 +00:00
0935a97d95 [Dynamo] Add torch.accelerator API to trace_rules (#155884)
# Motivation
- Add binding API and non-binding API in torch.accelerator to trace rules.
- Add some function in torch.accelerator to const fold functon list for Dynamo capature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787, #155788
2025-06-15 17:09:57 +00:00
b51d803785 [Dynamo] Add XPU API to trace_rules (#155788)
# Motivation
- Add binding API and non-bindling API to trace rules for XPU;
- Add some XPU API to the const fold function for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787
2025-06-15 17:09:57 +00:00
69acba2b19 [Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787)
# Motivation
- Add XPU-specific Stream and Event to in graph calss list for Dynamo capture.
- Add generic Stream and Event to i graph class list for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-06-15 17:09:57 +00:00
53cd18f6b3 Update gradient behavior note in torch.amin and torch.amax (#155071)
Fixes #155048

The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:

New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
        when there are multiple input elements with the same minimum or maximum value.`

cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
2025-06-15 16:09:31 +00:00
655b3b14ff [executorch hash update] update the pinned executorch hash (#156007)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007
Approved by: https://github.com/pytorchbot
2025-06-15 04:51:37 +00:00
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
6ebe9a4f47 [BE][Ez]: Optimize nvshmem alloc with missing move (#156000)
Saw this in another PR where there was a missing move on this potentially very hot path with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-06-15 03:04:08 +00:00
32eee8ed22 [SymmMem] Add nvshmem_free (#155975)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed.

Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975
Approved by: https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971
2025-06-15 01:23:49 +00:00
b8aee84fb9 [c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949)
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
2025-06-15 00:37:42 +00:00
3159ee2ad3 Update test_schedule_multiproc to use world_size=2 (#155921)
The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

In follow up PRs I will refactor the tests and move some tests to use large world sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
ghstack dependencies: #155920
2025-06-15 00:24:18 +00:00
8e1471bdc9 Allow MultiProcContinuousTest to set world_size (#155920)
`MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920
Approved by: https://github.com/fduwjj
2025-06-15 00:24:17 +00:00
9bd42c1570 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-15 00:08:50 +00:00
a35b3a9b95 [cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939)
Differential Revision: D76571828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-06-15 00:01:57 +00:00
eqy
47c8810b52 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-14 23:34:31 +00:00
0fa361e429 [ez] fix typo in _inductor/scheduler.py (#155996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996
Approved by: https://github.com/Skylion007
ghstack dependencies: #155982
2025-06-14 21:21:35 +00:00
77ac3a0965 [SymmMem] Remove wrappers around nvshmem APIs (#155971)
`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018.

Therefore there is no need to build an extra layer to hide dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968
2025-06-14 19:58:09 +00:00
2c0d94a7de [SymmMem] Remove unused ptr_to_symm_mem_ (#155968)
No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty.
This PR removes it and supports relying functionalities via the `allocations_` map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835
2025-06-14 19:57:06 +00:00
a317c63d1b [BE]: Update NCCL to 2.27.3 (#155233)
Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517

This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233
Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy
2025-06-14 19:20:31 +00:00
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
5285d10243 remove duplicated pybind flag in mps code (#155936)
gcc14 (at least) warns that this is already defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936
Approved by: https://github.com/Skylion007
2025-06-14 18:41:12 +00:00
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
603a54a9b3 Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
2025-06-14 17:13:53 +00:00
c219dbd2fc avoid gso in has_internal_overlap (#155870)
existing comment already explains it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870
Approved by: https://github.com/bobrenjc93
2025-06-14 17:13:20 +00:00
279cae52e7 [BE][PYFMT] migrate PYFMT for torch/ao/ to ruff format (#148185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185
Approved by: https://github.com/ezyang
2025-06-14 16:47:04 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
370fc49dde Handle aten.to at submodule boundaries (#153972)
Summary: #buildall

Test Plan: CI

Differential Revision: D74582970

When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
2025-06-14 16:13:29 +00:00
d42c11819f [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-06-14 16:09:41 +00:00
70b68caf58 Fix logging of failed tensorified ops (#155982)
Tested via

```
TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow>
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982
Approved by: https://github.com/flaviotruzzi
2025-06-14 14:23:54 +00:00
e375d21bb9 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-14 14:12:38 +00:00
1a568f4e5d [BE][Easy] bump isort to 6.0.1 (#155919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909, #155914
2025-06-14 12:29:01 +00:00
5467765990 [BE][Easy] bump ruff to 0.11.13 (#155914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909
2025-06-14 12:29:01 +00:00
736a15a81a [torchgen] Fix ruff format for # fmt: skip comment for function signature (#155909)
See also:

- astral-sh/ruff#18658

This fix follows the suggestion from:

- https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909
Approved by: https://github.com/ezyang
2025-06-14 12:28:55 +00:00
d859e65826 [DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912)
Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912
Approved by: https://github.com/ezyang
2025-06-14 12:14:14 +00:00
596b418391 [BE][PYFMT] migrate PYFMT for {torch,test}/{nn,optim}/** to ruff format (#144548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548
Approved by: https://github.com/ezyang
2025-06-14 11:27:04 +00:00
3e38feb05f [inductor] Add configuration control for CUTLASS operation selection. (#155770)
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.

**Fixes #155718**

## Usage Examples

```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"

# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"

# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"

# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
2025-06-14 08:19:54 +00:00
1982ec2d22 Add api info for torch._C._nn.pyi (#148405)
APis involved are as followed:

- adaptive_avg_pool2d
- adaptive_avg_pool3d
- binary_cross_entropy
- col2im

ISSUE Related:
https://github.com/pytorch/pytorch/issues/148404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405
Approved by: https://github.com/ezyang
2025-06-14 07:57:07 +00:00
7070ab3180 use guard_or_false in checkInBoundsForStorage (#155874)
this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874
Approved by: https://github.com/bobrenjc93
2025-06-14 07:21:26 +00:00
d79651571f assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869)
preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this,
and make it work for more complex unbacked sizes with ranges [-inf, inf]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869
Approved by: https://github.com/bobrenjc93
2025-06-14 07:19:56 +00:00
e7da21806f [Easy][BE] update recommanded VS Code settings (#152760)
Changes:

- Remove old invalid settings and replace with new settings.
- Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760
Approved by: https://github.com/drisspg
2025-06-14 07:11:10 +00:00
cyy
1393f71e07 Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979)
The CMake CUDA module has been deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979
Approved by: https://github.com/ezyang
2025-06-14 06:52:51 +00:00
c843909d9e [flex attention][triton pin] use new TMA API (#155771)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API).

This PR does the following:
* replace the experimental TMA APIs with the stable TMA APIs
* remove the workspace args.

Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone]

Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771
Approved by: https://github.com/mandroid6, https://github.com/nmacchioni
2025-06-14 06:34:16 +00:00
92b7ed6d07 Add Helion softmax test (#155976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976
Approved by: https://github.com/jansel
2025-06-14 05:53:21 +00:00
9338d85d45 [ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754)
Summary:
TSIA

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
eyes

Sandcastle run

Differential Revision: D76472617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
2025-06-14 04:34:29 +00:00
bf897b4cea [ONNX] Support 0/1 on dynamic dimension (#155717)
Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export.

However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export.

For example,

```python
import torch

class Model(torch.nn.Module):
    def forward(self, x, y, z):
        return torch.cat((x, y), axis=1) + z

model = Model()
x = torch.randn(2, 3)
y = torch.randn(2, 5)
z = torch.randn(1, 8)
model(x, y, z)

DYN = torch.export.Dim.DYNAMIC
ds = {0: DYN, 1: DYN}

with torch.fx.experimental._config.patch(backed_size_oblivious=True):
    ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds))

print(ep)
"""
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"):
             #
            sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0)
            sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1)
            sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0)
            sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1)
            sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0)
            sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1)

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1);  x = y = None

             #
            eq: "Sym(True)" = sym_size_int_2 == sym_size_int;  sym_size_int_2 = None
            _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'");  eq = _assert_scalar_default = None
            add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3;  sym_size_int_1 = sym_size_int_3 = None
            eq_1: "Sym(True)" = add_1 == sym_size_int_5;  add_1 = sym_size_int_5 = None
            _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'");  eq_1 = _assert_scalar_default_1 = None
            eq_2: "Sym(True)" = sym_size_int == sym_size_int_4;  sym_size_int = sym_size_int_4 = None
            _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'");  eq_2 = _assert_scalar_default_2 = None

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z);  cat = z = None
            return (add,)

Graph signature:
    # inputs
    x: USER_INPUT
    y: USER_INPUT
    z: USER_INPUT

    # outputs
    add: USER_OUTPUT

Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]}
"""
ep.module()(x, y, z)
"""
Traceback (most recent call last):
  File "/home/titaiwang/pytorch/test_export.py", line 20, in <module>
    ep.module()(x, y, z)
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__
    raise e
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook
    _check_input_constraints_for_graph(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph
    _check_symint(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint
    raise RuntimeError(
RuntimeError: Expected input at *args[2].shape[0] to be equal to 2, but got 1
"""
```

The explanation (from @pianpwk):

In the model we have `return torch.cat((x, y), axis=1) + z`.

Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are:

1) Yes -> then we must specialize `s8` to 1
2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`.

Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2).

For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717
Approved by: https://github.com/justinchuby
2025-06-14 04:04:47 +00:00
187828dcb4 [OpenReg][5/N] add set_.source_Storage for openreg (#155191)
**Changes**:
- add set_.source_Storage for openreg to support torch.load & torch.serialization
- uncomment some related tests in the test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101
2025-06-14 03:44:32 +00:00
e4fd0bf771 [OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101)
As the title stated.

**Involved testcases**:
- test_open_device_storage_pin_memory
- test_open_device_serialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181
2025-06-14 03:44:32 +00:00
1e7989cad5 [OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181)
As the title stated.

**Involved testcases**:
- test_open_device_quantized
- test_open_device_random
- test_open_device_tensor
- test_open_device_packed_sequence
- test_open_device_storage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106
2025-06-14 03:44:32 +00:00
7e5f29b2de [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106
Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019
2025-06-14 03:44:32 +00:00
676abded4b [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018
2025-06-14 03:44:32 +00:00
d3d469092f [Openreg] Split TestOpenReg into two parts (#154018)
----

- TestPrivateUse1: testing 3rd accelerator integration mechinasm itself
- TestOpenReg: testing openreg itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018
Approved by: https://github.com/albanD
ghstack dependencies: #153947
2025-06-14 03:44:31 +00:00
cafd2344d6 [OpenReg] add manual_seed related capabilities (#153947)
**Changes**:
- Add manual_seed manual_seed_all initial_seed and so on
- Delay execution of self._lazy_init more deeply
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947
Approved by: https://github.com/albanD
2025-06-14 03:44:31 +00:00
297805fd8f Typo fixes for "overridden" in comments and function names (#155944)
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
ca3cabd24a Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696)
Fixes #155028

This pull request updates the documentation  by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-14 03:32:00 +00:00
cdfa33a328 [nativert] move execution frame to torch (#155830)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76369008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830
Approved by: https://github.com/zhxchen17
2025-06-14 03:28:55 +00:00
a6084b71ed [BE][1/X] Phase out usage of use_max_autotune() (#155847)
`use_max_autotune()` is likely not what people expect it to be;

Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning.

Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847
Approved by: https://github.com/jingsh, https://github.com/masnesral
2025-06-14 03:16:20 +00:00
7982b8c703 [BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867)
Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76558294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867
Approved by: https://github.com/jingsh
2025-06-14 02:03:27 +00:00
8f5f01bf19 [BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871)
Summary:
As title.

Move the duplicate definition to the base class header `proxy_executor.h`

Test Plan:
CI

Rollback Plan:

Differential Revision: D76559180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871
Approved by: https://github.com/yushangdi
2025-06-14 01:52:43 +00:00
4574b39aa4 Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)"
This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c.

Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](bbbced94a4) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))
2025-06-14 01:43:16 +00:00
c10339559d [BE] Better uv detection in pip init (#155972)
If one has some UV and non-UV environments locally, one shoudl call `uv
pip install` only on the UV-enabled ones, which could be detected by
checking if `uv/python` path is present in `sys.base_prefix`

Fixes https://github.com/pytorch/pytorch/issues/152999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972
Approved by: https://github.com/janeyx99
2025-06-14 01:35:50 +00:00
d7e3c9ce82 Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287)"
This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f.

Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](3b6569b1ef) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))
2025-06-14 01:32:27 +00:00
c165b36a31 [MTIA Aten Backend] Migrate relu / relu_ (#155927)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate relu / relu_.

Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel.
https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

Let me know if any concern about this

Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925, #155926
2025-06-14 01:24:48 +00:00
50f6431e0a [MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate sqrt.out / rsqrt.out / sin.out / silu.out

Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925
2025-06-14 01:24:48 +00:00
7b11cb8c12 [MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate tanh.out and tanh_backward.grad_input

Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659
2025-06-14 01:24:41 +00:00
0185d3a5ed [MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_or.Tensor_out from out-of-tree to in-tree.

Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632
2025-06-14 01:24:34 +00:00
163cdaaa3a [MTIA Aten Backend] Migrate bitwise_not.out (#154632)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_not.out from out-of-tree to in-tree.

Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632
Approved by: https://github.com/egienvalue
2025-06-14 01:24:27 +00:00
04cf2c9d24 fix tensor print behavior for MAIA (#155609)
This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609
Approved by: https://github.com/soulitzer
2025-06-14 01:04:12 +00:00
dabb55baff Add resolve in add decomp to enable view (#153945)
Fixes #148950.

During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.

Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.

Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel
2025-06-14 00:41:50 +00:00
fec571cfd4 [BE][CI] Remove hardshrink integer exclusions (#155965)
As they are not called anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965
Approved by: https://github.com/dcci
2025-06-14 00:32:57 +00:00
38410cf9b5 Fix DDPOptimizer issue on static tensor index (#155746)
We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)

The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules.
bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)

Fixes #140395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746
Approved by: https://github.com/xmfan, https://github.com/bdhirsh
2025-06-14 00:15:58 +00:00
3b6569b1ef Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 00:05:57 +00:00
bbbced94a4 [BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)
followup for https://github.com/pytorch/pytorch/pull/154980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709
Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever
2025-06-13 23:10:01 +00:00
d512584718 [BE] Refactor clamp dtypes check (#155930)
By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000
ghstack dependencies: #155470
2025-06-13 23:05:02 +00:00
0cb85c188f [BE] Move optional submodules checkout to its own module (#155947)
To expand it to optional eigen checkout later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947
Approved by: https://github.com/Skylion007
2025-06-13 23:02:38 +00:00
3003c681ef Converting .rst files to .md files (#155377)
Fixes #155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:54:27 +00:00
799443605b Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297)
Fixes #155019

## Description
Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst

## Checklist
- [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html)
- [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html)
- [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html)
- [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:08:37 +00:00
764c02b78b [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-13 22:07:03 +00:00
d59ed21d0f [CI] Reuse old whl: track why failed to use the old whl (#155860)
As in title
Any other things I should track?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860
Approved by: https://github.com/malfet
2025-06-13 22:01:31 +00:00
3596c0c77f Fix test after revert (#155946)
ex
test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](06408dae49)

started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d

idk what the test does so maybe theres a better way to fix this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946
Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet
2025-06-13 21:52:07 +00:00
eef253d9f6 [CI] Keep going display on HUD: upload log when test fails (#155371)
I guess this is more of an RFC

Goal:
Enable keep going so that we can get information immediately for failures.  We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.

Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`.  If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.

I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label).  There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.

After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification.  This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one

To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run

Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed

Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones

Pros:
Not many changes outside of HUD/UI

Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
2025-06-13 21:21:55 +00:00
e5ed267f83 Update h100-distributed image (#155861)
Move non inductor workflows cuda 12.6->cuda 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861
Approved by: https://github.com/seemethere
2025-06-13 21:17:05 +00:00
20a74c370b Add error message with assert to topK if ndims() - dim > 4 (#155475)
Addressing #154890

Not really a proper fix but at least it's more informative than the current crash.

For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
2025-06-13 21:10:06 +00:00
049dc48d1e fix code chunk indentation for jit_language_reference_v2.md (#155937)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155781

Description:
As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation.

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label "module: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-06-13 21:05:23 +00:00
731351bb4a Convert rst to markdown - optim.rst #155031 (#155813)
Fixes #155031
![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1)

@pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813
Approved by: https://github.com/AlannaBurke
2025-06-13 21:03:39 +00:00
92388bb2ab [export] Remove broken check for multiple cpp files in PT2 package (#155149)
This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149
Approved by: https://github.com/angelayi
2025-06-13 21:02:31 +00:00
7d1b3f599d [Docs] Convert to markdown cond.rst, config_mod.rst (#155653)
Related to #155014

Only included 2 files in this PR:

- cond.rst
- config_mod.rst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653
Approved by: https://github.com/svekars
2025-06-13 20:58:57 +00:00
fdf5d97fa8 [cutlass backend][ez] Log timings from prescreening (#155757)
Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757
Approved by: https://github.com/ColinPeppler
2025-06-13 20:44:04 +00:00
f3e6c8e834 Fix #155016 for Docathon - convert rst to markdown (#155198)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though.

Fixes #155016

Docs comparison (check out the 'new' whenever docs build)

1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html))
2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html))
3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings
4. cudnn_rnn_determinism.rst as is because it's reused in docstrings.
5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198
Approved by: https://github.com/albanD, https://github.com/svekars
2025-06-13 20:24:34 +00:00
bf798a2f01 Change _hfstorage to hfstorage (#155837)
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D76364024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
2025-06-13 20:19:51 +00:00
77f884c2ec Optimize Tensor.backward type hints (#155656)
Fixes #81963

## Test Result

![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656
Approved by: https://github.com/soulitzer
2025-06-13 19:16:48 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
4628f1b7a9 [Hierarchical-Compile] Track mutations for setitem (#155880)
This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880
Approved by: https://github.com/anijain2305
2025-06-13 18:59:31 +00:00
344731fb25 Add CUDA 12.9.1 sbsa nightly binaries (#155819)
https://github.com/pytorch/pytorch/issues/155196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819
Approved by: https://github.com/atalman
2025-06-13 18:52:41 +00:00
ce44877961 [c10d][PGNCCL] Make watchdog thread a class (#155831)
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-06-13 18:05:22 +00:00
c5d00e150a convert: rst to myst pr 1/2 (#155840)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following two .rst files
- [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst)
- [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840
Approved by: https://github.com/sekyondaMeta
2025-06-13 18:02:28 +00:00
36bf81e363 [BE] Fix minifier when one has multiple Python runtimes (#155918)
By using `sys.executable` instead of `"python"`

Otherwise, it fails on Ubuntu with `python not found` error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519
2025-06-13 17:55:04 +00:00
093aaccae2 convert jit_language_reference_v2.rst to jit_language_reference_v2.md (#155781)
Fixes https://github.com/pytorch/pytorch/issues/155023

Description:
converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md`
**I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged.**

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781
Approved by: https://github.com/svekars
2025-06-13 17:33:10 +00:00
f0bee87eea [xla hash update] update the pinned xla hash (#155779)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779
Approved by: https://github.com/pytorchbot
2025-06-13 17:13:37 +00:00
1f3cc4875c [ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695)
This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-06-13 17:12:26 +00:00
b6add8c8ba [MPSInductor] Fix remainder implementation for int types (#155891)
Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers.

This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types.

This fixes compilation of something like
```python
@torch.compile
def f(x, y):
    return x[y % 5]
```

which beforehand failed to compile with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant long* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[x0];
        auto tmp1 = 12;
        auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1));
        auto tmp3 = 1024;
        auto tmp4 = static_cast<long>(tmp3);
        auto tmp5 = tmp2 + tmp4;
        auto tmp6 = tmp2 < 0;
        auto tmp7 = tmp6 ? tmp5 : tmp2;
        if ((tmp7 < 0) && (tmp7 > 1024)) return;
        auto tmp9 = in_ptr1[tmp7];
        out_ptr0[x0] = static_cast<float>(tmp9);
    }
 with program_source:372:28: error: array subscript is not an integer
        auto tmp9 = in_ptr1[tmp7];
                           ^~~~~
```

This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891
Approved by: https://github.com/manuelcandales
2025-06-13 16:42:56 +00:00
9462106b7e [nativert] Move graph_passes to nativert (#155411)
Summary: Move graph_passes to nativert

Test Plan:
CI

Rollback Plan:

Differential Revision: D76205048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411
Approved by: https://github.com/zhxchen17
2025-06-13 16:41:01 +00:00
338a8c7853 fix slice w/ dynamic shapes (#153131)
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.

Differential Revision: D74354663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
2025-06-13 15:53:17 +00:00
a5938ff431 [BE][c10d/Store]add check in pyi (#155855) (#155865)
Summary:

"check" is already binded https://fburl.com/code/9lx1zf9o
which is also documented in https://docs.pytorch.org/docs/stable/distributed.html
add it to pyi for type checking

Test Plan:
skip

Rollback Plan:

Differential Revision: D76547457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865
Approved by: https://github.com/fduwjj
2025-06-13 15:39:27 +00:00
bee93f9f0d Move glslc to cas to enable remote execution (#155832)
Meta:
`fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid.
See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734

This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE.

Here are commands executed:
```
% manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz
% manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz
% manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar

% ls
-rw-r--r--  1 navidq  staff   2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz
-rw-r--r--  1 navidq  staff   4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz
-rw-r--r--  1 navidq  staff   4.4M Jun 12 10:03 glslc-windows-v2024_3.tar

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz
ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz
1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar
76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984
```

Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832
Approved by: https://github.com/GregoryComer
2025-06-13 14:38:51 +00:00
ce6e0523f9 Revert "[BE] Raise NotImplementedError (#155470)"
This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82.

Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))
2025-06-13 14:32:50 +00:00
3819584f12 [precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415)
This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling,  dump it all into bytes, and then stitch it back together into a cache of callables.

## Why use CacheArtifactManager?
Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly.

## How is it different from CacheArtifactManager?
Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable.
Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so:

![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca)

Whereas we'd visualize PrecompileContext's result like so:

![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b)

For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext.

After this PR, precompile consists of three main interfaces:

### CompilePackage
- Everything needed to run one torch.compile'd function (including graph breaks)
- `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry
- `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends
- `cache_entry()` return a serializable cache entry to save

### DynamoStore
- Responsible for tracking CompilePackages on disk (and/or in memory)
- `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact
- `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data
- `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization)

### PrecompileContext
- Overarching context for serializing and deserializing precompile artifacts. Supports **global** and **local** setups.
- `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes
- `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO)
- `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key

<img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415
Approved by: https://github.com/zhxchen17
ghstack dependencies: #155118
2025-06-13 14:11:24 +00:00
b2fc9cfea1 [precompile] Add CompilePackage to serialize dynamo states. (#155118)
Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface:

1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states.
     a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object.
     b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state.
2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry.
3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions.

This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend.

Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118
Approved by: https://github.com/jamesjwu

Co-authored-by: James Wu <jjwu@meta.com>
2025-06-13 13:54:10 +00:00
670dab6c63 [AOTI] Enable OP test__weight_int4pack_mm_with_scales_and_zeros in AOTI. (#155780)
The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780
Approved by: https://github.com/jansel
2025-06-13 11:12:13 +00:00
463fe36532 fix error message on specialization with Dim.DYNAMIC (#155738)
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.

This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.

Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.

Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
2025-06-13 10:33:46 +00:00
6abe450a6f [pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631)
I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314

This diff/PR deletes it as this duplicate caused build failure for me:
```
ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type'
```

Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon
ghstack dependencies: #154589, #154591
2025-06-13 10:01:51 +00:00
1cc31b213d [MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591
Approved by: https://github.com/malfet
ghstack dependencies: #154589
2025-06-13 10:01:51 +00:00
65b9c13cce [Intel GPU] Enable safe softmax for XPU SDPA (#151999)
Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975

When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption.

With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 08:53:47 +00:00
56b03df6ac [MTIA Aten Backend] Migrate where.self and where.self_out (#154589)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589
Approved by: https://github.com/malfet
2025-06-13 08:25:13 +00:00
3d595fd559 update get start xpu (#151886)
update link and product name
add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 07:46:13 +00:00
53d06e18d9 [dynamo] add missing algorithm header (#154754)
Needed for `std::max(<initializer-list>)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-06-13 06:56:11 +00:00
6020440683 remove allow-untyped-defs from adaround_fake_quantize.py (#155621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621
Approved by: https://github.com/Skylion007
2025-06-13 06:14:22 +00:00
99e99d5bfe [a2av] Test must allocate tensors symmetrically (#155835)
This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835
Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007
ghstack dependencies: #155506
2025-06-13 06:05:38 +00:00
0860606729 [export] Add meta[val] to getattr nodes (#154934)
Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934
Approved by: https://github.com/yushangdi, https://github.com/muchulee8
2025-06-13 05:48:21 +00:00
25717da8c8 [BE] Don't run the same tests on 2xlarge and 4xlarge (#155859)
Also, speedup builds by moving them to 4xlarge instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr
2025-06-13 05:40:20 +00:00
a87dfc7480 [symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823)
We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList.

Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823
Approved by: https://github.com/Skylion007
2025-06-13 05:27:59 +00:00
70bb34929a Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check.

Docs comparison (check out the 'new' whenever docs build)

1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html))
2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html))
3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567
Approved by: https://github.com/svekars
2025-06-13 05:19:43 +00:00
b878ca0c91 [cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary:
Add fp8.

Right now FP8 only allows fast_accum.

Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         aten          | 967.1226739883423  | 1136.8895149998868 |  1.219131228979677   |         NA         |
|        triton         | 1764.6185159683228 |  623.08743664783   |  20.373826419003308  | 82.46067054670186  |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928  |  20.48663099599071   | 82.91718297956578  |
|  cutlass_lvl_default  | 790.5075550079346  | 1390.8932568835019 |  13.788519630907103  | -18.26191482535096 |
|   cutlass_lvl_3332    | 803.7384748458862  | 1367.996757884245  |  226.81587297911756  | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```

Rollback Plan:

Differential Revision: D76310809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2ba930d4ce Convert rst to markdown - profiler.rst #155031 (#155559)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559
Approved by: https://github.com/svekars
2025-06-13 05:02:54 +00:00
e8b3dfa7c0 convert jit_language_reference.rst to jit_language_reference.md (#155633)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

- converted jit_language_reference.rst to jit_language_reference.md

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 04:58:28 +00:00
3f65e38b73 Convert hub.rst to hub.md (#155483)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483
Approved by: https://github.com/svekars
2025-06-13 04:39:55 +00:00
0a6b66c881 Inductor comms reorder logs to tlparse (#155737)
Hacked test_inductor_collectives test to demonstrate this works:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Follow up: it would be nice to move the logging out of this pass and
into the broader comms pass loop, where the before/after each pass
visualization could be logged into the same tlparse file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737
Approved by: https://github.com/bdhirsh
2025-06-13 02:59:42 +00:00
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
020da74437 [Easy] Remove empty file (#155796)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796
Approved by: https://github.com/malfet
ghstack dependencies: #155772
2025-06-13 01:42:11 +00:00
905b194a2e Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318)
Fixes #136849

## Test Result

```python
>>> import torch
>>> device = torch.cuda.device_count() + 1
>>> torch.cuda.current_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream
    streamdata = torch._C._cuda_getCurrentStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.default_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream
    streamdata = torch._C._cuda_getDefaultStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.set_per_process_memory_fraction(0.5, device)  #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction
    torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: Allocator not initialized for device : did you call init?

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318
Approved by: https://github.com/albanD
2025-06-13 01:20:19 +00:00
d7e657da35 pyfmt lint more torch/utils files (#155812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782, #155783
2025-06-12 23:51:42 +00:00
4d3ecefda5 [aoti][mps] Use cpp sym-expr printer (#155646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582, #155583
2025-06-12 23:33:28 +00:00
2e65d72e1e [aoti][mps] Fix int/symint kernel args (#155583)
Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582
2025-06-12 23:33:28 +00:00
ffbda61fbe [aoti][mps] Fix dynamic dispatch size (#155582)
In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582
Approved by: https://github.com/malfet
ghstack dependencies: #155752, #154287
2025-06-12 23:33:15 +00:00
a4ab392251 [aoti][mps] mps constants support (#154287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287
Approved by: https://github.com/malfet
ghstack dependencies: #155752
2025-06-12 23:33:07 +00:00
8821a9dc4e [BE][aoti][mps] Fix tests to use common function (#155752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-12 23:32:59 +00:00
5ab6a3fb6f [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-12 23:19:12 +00:00
d9b8369f39 fix warning spam for list indexing (#155815)
Per title, #154806 incorrectly placed a warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-12 23:07:24 +00:00
2903e5ad3c pyfmt lint more export files (#155783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782
2025-06-12 23:04:11 +00:00
86b1116f22 pyfmt lint torch/_custom_op/* (#155782)
file torch/_custom_op/functional.py does not exisits
file torch/_custom_op/__init__.py is empty.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782
Approved by: https://github.com/Skylion007
2025-06-12 23:04:11 +00:00
4cdbdcdbcf Switch to miniconda for ROCm CI (#155239)
Related to https://github.com/pytorch/pytorch/issues/148335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239
Approved by: https://github.com/jeffdaily
2025-06-12 22:55:47 +00:00
f04fd4dc4e typing: allow integer in bitwise operations (#155704)
Fixes #155701 (false positives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704
Approved by: https://github.com/Skylion007, https://github.com/aorenste
2025-06-12 22:40:17 +00:00
938515fa75 [aoti] Update cshim for all backends (#155604)
Fixes https://github.com/pytorch/pytorch/issues/155349
`python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604
Approved by: https://github.com/desertfire, https://github.com/janeyx99
2025-06-12 22:10:58 +00:00
38bfd462b8 Use swap_tensors path in nn.Module.to for FakeTensor (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-06-12 22:08:21 +00:00
db01f1032f Support XPU in memory tracker (#150703)
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.

Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k
2025-06-12 21:33:52 +00:00
154a39bfbd basic compile support for grouped_mm (#153384)
grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384
Approved by: https://github.com/eellison
2025-06-12 21:24:51 +00:00
f2b44424a1 [ROCm] Skip *_stress_cuda and test_ddp_apply_optim_in_backward* (#155724)
These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-06-12 21:18:04 +00:00
590fe4d2d7 Skip updating the default device distributed backend if already registered (#155320)
Motivation:

PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`).

Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However,  if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want.

Therefore, we would like to skip updating the default distributed backend if one is already registered in the map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-06-12 21:17:06 +00:00
29391c7cf9 [ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811)
9df2e8020f/1

8e8d4b13b0 (43980565863-box)

started OOMing after switching to cuda 12.8

Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-12 21:05:32 +00:00
093fd47dbe Add a Additional Example that showcases the usage of torch.autograd.functional.jacobian (#155683)
Fixes #132140

As described in the issue, I've added an example that showcases the use of higher-dimensional inputs and outputs, batched inputs, and the vectorize=True with `torch.autograd.functional.jacobian`.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155683
Approved by: https://github.com/soulitzer
2025-06-12 19:46:55 +00:00
e6d71f3789 Support re-sharding for safetensors checkpoints (#154519)
This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.

Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519
Approved by: https://github.com/saumishr
2025-06-12 19:38:29 +00:00
d3da03d6fa [2/n]passing event log handler to record function calls (#155457)
Summary: This diff modifies the elastic agent's API to pass the event log handler to the record function calls. This change enables the elastic agent to log events to a specific destination, improving the monitoring and debugging capabilities of the distributed training process.

Test Plan:
unit tests

ran an e2e training job.

Differential Revision: D75194115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155457
Approved by: https://github.com/d4l3k
2025-06-12 19:35:08 +00:00
e085012335 Fix #155020 - rst2markdown for export.rst (split PR) (#155753)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check. This is the 3rd one.

Docs comparison (check out the 'new' whenever docs build)

1. export ([old](https://docs.pytorch.org/docs/main/export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155753
Approved by: https://github.com/sekyondaMeta
2025-06-12 19:30:52 +00:00
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
7986c0dba6 rename distributed.rst to md (#155767)
Fixes #155019

For sanity checks, split PR to have this one only include distributed.rst -> distributed.md

Preview -> [distributed.md](https://docs-preview.pytorch.org/pytorch/pytorch/155767/distributed.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155767
Approved by: https://github.com/sekyondaMeta
2025-06-12 18:42:15 +00:00
bcad962550 [BE][Testing] Delete some unused code (#155760)
- Fix typo in class name `OpenRgistration`->`OpenRegistration`
- Use existing `common` alias of `torch.testing._internal.common_utils`, i.e. `s/torch.testing._internal.common_utils.markDynamoStrictTest/common.markDynamoStrictTest/`
- Remove unused `TEST_CUDA` and `TEST_ROCM` are unused in that file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155760
Approved by: https://github.com/albanD
2025-06-12 18:41:53 +00:00
fac0cc16ef [scan] fix doc of scan and list the restrctions. (#155577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155577
Approved by: https://github.com/zou3519
2025-06-12 18:22:28 +00:00
a1257446f8 [AOTInductor] Memory leak fix for Fallback Kernels (#155642)
Summary:
We generate AtenTensorHandles for Fallback kernels regardless of the arg
type. If we indeed "fallback", we will regenerate the AtenTensorHandles
that will cause the first handle being generated not recycled, thus a
memory leak would occur.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-12 17:42:56 +00:00
0d3d84d866 [CD] Windows Magma build 12.9 and cuda scripts (#155799)
Scripts needed to build Magma and CUDA on windows
Same as https://github.com/pytorch/pytorch/pull/146653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155799
Approved by: https://github.com/jeanschmidt
2025-06-12 17:41:24 +00:00
430cc1c636 Run tests on Amazon EC2 M8g Instances (#153940)
Requires machines configured here: https://github.com/pytorch/test-infra/pull/6642

This adds additional test runs against AWS Graviton4 processors, alongside existing runs against AWS Graviton3 and AWS Graviton2 processors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153940
Approved by: https://github.com/fadara01, https://github.com/malfet
2025-06-12 17:33:08 +00:00
522a18bd6c Fix provenance unit test (#155747)
Summary: Fix the test to adapt added provenance tracking in D75837494

Test Plan:
```
 buck2 run @//mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_graph_provenance
```

Rollback Plan:

Differential Revision: D76466778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155747
Approved by: https://github.com/YUNQIUGUO
2025-06-12 17:26:43 +00:00
50d8168c8b [DTensor] Support in gradient placement for local_map() (#155181)
Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map().  The argument helps enforce placement of gradient of the input Dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181
Approved by: https://github.com/wanchaol
2025-06-12 17:07:04 +00:00
6c0b42fd2f [inductor][cutlass backend] Log prescreening elpase (#155508)
Differential Revision: [D76311352](https://our.internmc.facebook.com/intern/diff/D76311352/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155508
Approved by: https://github.com/jingsh
2025-06-12 16:48:52 +00:00
c1ae768baa Basic MTIA ATen CMake (#155477)
Summary: Basic ATen CMake

Differential Revision: D75203592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155477
Approved by: https://github.com/andyanwang, https://github.com/cyyever
2025-06-12 16:29:32 +00:00
f4376cac54 unify symbolic_shapes and sizevars dynamic shapes APIs naming 1 (#154774)
Inductor have a set of APIs that allows performing symbolic evaluations similar to that of symbolic shapes
but it operates on sympy expressions instead of symnodes. Namings are not consistent making them consistent
in this stack.

Step 1 : unify statically_know_true naming! for consistent experience.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154774
Approved by: https://github.com/drisspg, https://github.com/bobrenjc93, https://github.com/eellison
2025-06-12 16:11:55 +00:00
9df2e8020f fix code indentation for fx.md (#155764)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155482

Description:
As discussed here https://github.com/pytorch/pytorch/pull/155482#pullrequestreview-2918032289, I removed indentation for python code blocks as a follow-up modification for fx.md

Checklist:

- [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155764
Approved by: https://github.com/svekars
2025-06-12 16:02:33 +00:00
75824035d3 [dynamic shapes] skip fused linear path if not definitely contiguous (#155051)
Falls back to non-fused linear -> add bias path for non-contiguous tensors with unbacked sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155051
Approved by: https://github.com/laithsakka
2025-06-12 15:55:21 +00:00
51560797ce [CI] Reuse old whl: switch default to always (#155572)
Switch default to always reuse old whl

I have a few worries about API rate limits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155572
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 15:43:29 +00:00
62fa3f5aeb Support tuning of _grouped_mm (#153953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153953
Approved by: https://github.com/ngimel
2025-06-12 15:39:35 +00:00
6b3eef6d31 [cutlass backend] Only consider to use re worker if nvcc doesn't exist (#155745)
Differential Revision: D76463340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155745
Approved by: https://github.com/masnesral
2025-06-12 15:23:52 +00:00
851a6fa82d [MPS] Migrate softshrink (forward and backward) to Metal kernel (#155586)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155586
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571
2025-06-12 15:02:43 +00:00
2a3b41cbd0 Revert "[CI] Use setup-python from for Mac tests (#155698)"
This reverts commit 2b9d638e3333e6e9ae324e1486774e83292e1883.

Reverted https://github.com/pytorch/pytorch/pull/155698 on behalf of https://github.com/malfet due to It causes weird flaky failures in MPS and do not upload usage logs anymore ([comment](https://github.com/pytorch/pytorch/pull/155698#issuecomment-2967120676))
2025-06-12 14:42:32 +00:00
0fd711df19 [export] Allow user frame to be None when symbolic shape tries to get stacktrace. (#155744)
Summary: Fixing https://github.com/pytorch/pytorch/issues/155605

Test Plan:
CI

Rollback Plan:

Differential Revision: D76463358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155744
Approved by: https://github.com/angelayi
2025-06-12 14:36:29 +00:00
dd1b6621bc Remove C10_DEPRECATED references in c10 (#151058)
Summary:
Revive https://github.com/pytorch/pytorch/pull/138406.  Only limit the scope to files in c10.

Summary from the original PR,
```
Looking in the code I see

// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]].

Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED.
```

Test Plan: CI

Differential Revision: D72762767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058
Approved by: https://github.com/r-barnes
2025-06-12 13:38:03 +00:00
d632cf2cc9 [Easy][Code Clean] Remove the unused and undefined function in pickler (#155772)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155772
Approved by: https://github.com/malfet
2025-06-12 13:03:36 +00:00
8e8d4b13b0 [XPU] Simplify XPU make triton by install from PyTorch source (#155675)
Remove install from source code build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155675
Approved by: https://github.com/atalman
2025-06-12 13:02:23 +00:00
132babe7e0 [user triton] dynamo support for new host-side TMA API (#155662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155662
Approved by: https://github.com/aakhundov
ghstack dependencies: #155510
2025-06-12 12:56:23 +00:00
9cced33c7c [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-12 12:50:36 +00:00
c199a4d0fd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-12 12:42:34 +00:00
eecaa0bbc6 [Multiprocesing] Fix _release_ipc_counter missing in rebuilding cuda ipc tensor with UntypedStorage (#155312)
Fixes https://github.com/pytorch/pytorch/issues/155311

To avoid `torch.multiprocessing.reductions::rebuild_cuda_tensor` failed on untyped storage, this FIX PR adds the `_release_ipc_counter` into UntypedStorage like the previous legacy typed storage.

e2d141dbde/torch/storage.py (L1466-L1469)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155312
Approved by: https://github.com/mikaylagawarecki
2025-06-12 10:41:58 +00:00
0029259bdf Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-12 09:58:15 +00:00
d3d655ad14 [Hierarchical-Compile] Hash int args in addition to input shapes (#155655)
Fixes Swsl_resnext101_32x16d in TIMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155655
Approved by: https://github.com/anijain2305
2025-06-12 06:35:12 +00:00
c3ecabf059 [inductor][triton pin] add support for new TMA API for mm.py templates (#155723)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488

For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs).

For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API).

For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2.

Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2.

Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723
Approved by: https://github.com/NikhilAPatel
2025-06-12 06:25:47 +00:00
2b9d638e33 [CI] Use setup-python from for Mac tests (#155698)
Instead of `setup-miniconda`
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
- Skip `TestMultiprocessing.test_fs_sharing` as even though it completes, it hangs on the shutdown both in CI and in all local setups I have
- Skip `TestCppExtensionOpenRgistration.test_base_device_registration` as it hangs on the shutdown as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515, #155697
2025-06-12 04:58:00 +00:00
57e4d7b5cc [nativert] Move DelegateExecutor to PyTorch core (#155581)
Summary:
Moves DelegateExecutor base class to PyTorch core. It provides the extension point of backend delegation for NativeRT.
Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
This is only a virtual base class. So relying on internal CI is sufficient.

Rollback Plan:

Differential Revision: D76351984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155581
Approved by: https://github.com/zhxchen17
2025-06-12 04:33:31 +00:00
a9d5157e25 [dynamo] Use BINARY_SUBSCR for pre-graph bytecode for regular dict accesses (#155727)
vLLM profiler sets with_stack=True that shows the dict_getitem on the profiler, both inflating the numbers and confusing compile users. This PR keeps BINARY_SUBSCR for regular dicts, while using `dict.__getitem__` only for dict subclasses.

Using binary_subscr is little bit faster, but not enough to make any major latency improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155727
Approved by: https://github.com/zou3519, https://github.com/StrongerXi, https://github.com/jansel
2025-06-12 04:02:29 +00:00
c9e9a0c823 [inductor][invoke_subgraph] Mark invoke_subgraph outputs as user_visible to constrain output strides (#155395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155395
Approved by: https://github.com/zou3519
2025-06-12 03:58:16 +00:00
9f5153b1a4 Preserve GrpahModule node stack trace after torch package deserializaion re-tracing (#155638)
Summary:
urrently the node.meta["stack_trace"] is not preserved when we torch package/load GraphModule, which means the original stack trace is lost. When we re-trace the packaged graph module, we just get a stack trace like fx-generated._0......

Adding the node.meta["stack_trace"] to torch packaged graph module

Test Plan:
```
buck2 run @//mode/dev-nosan fbcode//caffe2/test:package -- -r  TestPackageFX
```

Rollback Plan:

Differential Revision: D76379692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155638
Approved by: https://github.com/angelayi
2025-06-12 03:48:27 +00:00
ce9ba071fd [BE] Fix warning in open_registration_extension.cpp (#155755)
Namely
```
/Users/nshulga/git/pytorch/pytorch/test/cpp_extensions/open_registration_extension.cpp:306:33: warning: left operand of comma operator has no effect [-Wunused-value]
  306 |   at::Tensor first = at::empty((2,3)).to(at::DeviceType::PrivateUse1);

```

Or switching between Python and C++ is hard
In Python `(2, 3)` creates a tuple, in C/C++ it's just a integral literal 3

P.S. I could have vibe-coded the fix with Claude: https://claude.ai/share/82479e88-84cb-4299-aa2f-dafd28ee2d55

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155755
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-06-12 03:01:30 +00:00
d96dec8415 [export] Fix serialization for call_torchbind hop with as_none argument (#155647)
Summary:
As title.

D75251816 broke one internal test. This diff fixes it.

Test Plan: Internal CI

Differential Revision: D76383202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155647
Approved by: https://github.com/ydwu4
2025-06-12 02:59:03 +00:00
b00b641ff1 [Docs] Convert to markdown: accelerator.rst, amp.rst, autograd.rst, backends.rst, benchmark_utils.rst (#155762)
Fixes #155013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155762
Approved by: https://github.com/svekars
2025-06-12 02:55:06 +00:00
b6f84b3b0f [Inductor][CPU] Use AMX-based microkernels when M > 4 for GEMM template for INT4 weight (#155444)
**Summary**
GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
2025-06-12 02:28:48 +00:00
212575f994 [ca] Annotate AccumulateGrad branching and add polyfill tests (#155289)
Annotates AccumulateGrad and tracks the semantics for AccumulateGrad's polyfill , except for Scenario 1.4: Cloning MKLDNN new_grad and Scenario 2.2: Vmap-incompatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155289
Approved by: https://github.com/jansel, https://github.com/albanD
2025-06-12 02:10:52 +00:00
d84efde3f0 Move _storage_Use_Count to be gerneric (#155451)
# Motivation
`torch._C._storage_Use_Count` should be a generic API that is not aware of device type. It is also used in 337cd7c53d/torchtune/training/_activation_offloading.py (L323) to do some memory optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155451
Approved by: https://github.com/albanD
2025-06-12 01:39:04 +00:00
8372d0986a Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424)"
This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920.

Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](e1db10e05a) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))
2025-06-12 01:38:34 +00:00
9b122aab5d Fix set per proc memory fraction when running tests (#155631)
env setting needs to happen before pool creation for it to take effect

In theory this should fix some OOMs and also cause some OOMs, but this PR is green so idk

alt options: use initializer?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155631
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 01:28:08 +00:00
8ad6197b46 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-12 01:18:57 +00:00
4e19477196 [nativert] Move Pytree (#155136)
Summary: fbcode/sigmoid/core/common -> fbcode/caffe2/torch/nativert/common

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:pytree_test
```

OSS CI

Rollback Plan:

Differential Revision: D75965059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155136
Approved by: https://github.com/zhxchen17, https://github.com/XuehaiPan, https://github.com/zou3519
2025-06-12 01:10:34 +00:00
ee5c2908cb [dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592)
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.

Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.

Also found some utils can be merged to OpSchema so included together in
this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
2025-06-12 00:51:36 +00:00
7485ef078f Run torch.compile benchmark more frequently on H100 (#155719)
We have more capacity now with 20+ `linux.aws.h100` runners, half of them are idle.  Running benchmark more frequently would utilize these runner better and provide early signals multiple times per day.  Running every 8 hours to start with.  The workflow usually finishes within 5 hours https://github.com/pytorch/pytorch/actions/runs/15578331612/job/43878878434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155719
Approved by: https://github.com/atalman
2025-06-12 00:24:21 +00:00
9e9484d022 [SymmMem] Enable NVSHMEM for Triton (#155506)
(This is an **Experimental** feature)
Allow Triton kernels to invoke NVSHMEM device functions.

### Example Triton program
Key parts:
- Call `nvshmem.enable_triton()` to initialize;
- Call `nvshmem.putmem_block` in Triton kernel;
- Add `extern_libs` kwarg at kernel invocation.

```
import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem

@triton.jit
def put_kernel(
    dst_ptr,
    src_ptr,
    numel: tl.constexpr,
    peer: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

if __name__ == "__main__":
    # Enable NVSHMEM for Triton
    nvshmem_lib = nvshmem.enable_triton()

    # Use torch Symmetric Memory to allocate Symmetric tensors
    ...

    peer = 1 - rank
    if rank == 0:
        kernel = put_kernel[(1, 1, 1)](
            dst_ptr,
            src_ptr,
            numel=numel,
            peer=peer,
            BLOCK_SIZE=BLOCK_SIZE,
            extern_libs=nvshmem_lib,
        )

    dist.barrier()
    if rank == 1:
        print(f"Rank {rank}: received {out=}")
```

### Test output:
```
$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put
Rank 0: writing value 5 to Peer 1
Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506
Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj
2025-06-12 00:22:49 +00:00
cf9878d7a2 Fix #155022 rst to markdown conversion (#155540)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155022

Docs comparison (check out the 'new' whenever docs build)

1. func.ux_limitations ([old](https://docs.pytorch.org/docs/main/func.ux_limitations.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.ux_limitations.html))
2. func.whirlwind_tour ([old](https://docs.pytorch.org/docs/main/func.whirlwind_tour.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.whirlwind_tour.html))
3. future_mod ([old](https://docs.pytorch.org/docs/main/future_mod.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/future_mod.html))
4. futures ([old](https://docs.pytorch.org/docs/main/futures.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/futures.html))
5. fx.experimental ([old](https://docs.pytorch.org/docs/main/fx.experimental.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/fx.experimental.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155540
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-12 00:21:22 +00:00
7918978653 [dynamo] uploaded full json file of all unimplemented_v2() calls currently in repository (#155758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155758
Approved by: https://github.com/williamwen42
2025-06-12 00:17:28 +00:00
a6210fd07b [c10d] Enhance get_process_group_ranks() to accept group=None (#154902)
Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified.

Test Plan:
contbuild & OSS CI

Rollback Plan:

Differential Revision: D75817800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902
Approved by: https://github.com/wz337
2025-06-11 23:41:03 +00:00
eqy
bd3c32916c [cuDNN] Enabled dilation for deterministic convolutions in cuDNN (#154292)
Provides order-of-magnitude speedup over fallback impl.

https://github.com/pytorch/pytorch/issues/28777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154292
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-11 23:35:52 +00:00
c13e725edd Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518)
As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.

Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518
Approved by: https://github.com/saumishr
2025-06-11 23:35:05 +00:00
1b032384b1 Convert rst files to md (#155369)
Fixes #155021
Fixes #155158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369
Approved by: https://github.com/svekars, https://github.com/malfet
2025-06-11 23:00:52 +00:00
48921721d8 [MPS] Fix binary builds (#155733)
Introduced by https://github.com/pytorch/pytorch/pull/155611

All functions in those headers must be static and inline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155733
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-06-11 22:55:33 +00:00
c1446e1e9d [easy] revert unintended changes from #152579 (#155614)
Summary:
I accidentally removed a test and a small change in my pr:
https://github.com/pytorch/pytorch/pull/152579
- `test_load_package_multiple_gpus` from https://github.com/pytorch/pytorch/pull/152093

Rollback Plan:

Differential Revision: D76370555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155614
Approved by: https://github.com/jingsh
2025-06-11 22:54:58 +00:00
4a954fc185 [refactor] make do_auto_functionalize_v2 take HopInstance (#154192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154192
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072, #154191
2025-06-11 22:52:37 +00:00
d6be87648f [hop schema] add schema.tree_spec to support pytree inputs (#154191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154191
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072
2025-06-11 22:52:37 +00:00
6ded656aee [hop] auto functionalize invoke_subgraph (#154072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154072
Approved by: https://github.com/zou3519
ghstack dependencies: #155261
2025-06-11 22:52:28 +00:00
20fb8f5d1f [refactor] make check input alias and mutation easier to use (#155261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155261
Approved by: https://github.com/zou3519
2025-06-11 22:52:21 +00:00
61e13782dd [inductor] handle -1 for pointless view pairs (#155295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155295
Approved by: https://github.com/laithsakka, https://github.com/jansel
2025-06-11 22:20:36 +00:00
458cc7213b DOC: Convert to markdown: mobile_optimizer.rst, model_zoo.rst, module_tracker.rst, monitor.rst, mps_environment_variables.rst (#155702)
Fixes #155026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155702
Approved by: https://github.com/sekyondaMeta, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 22:16:04 +00:00
e1db10e05a [PT2][partitioners] Add aten.split to view_ops list (#155424)
Summary: Add `aten.split` to view_ops list in partitioners.py

Test Plan:
na

Rollback Plan:

Differential Revision: D76011951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424
Approved by: https://github.com/xuanzhang816
2025-06-11 22:12:13 +00:00
f59c76b549 Revert "[BE]: Update cudnn to 9.10.2.21 (#155576)"
This reverts commit 2d3615f577894c7a117a55e85bb8371bb598ec50.

Reverted https://github.com/pytorch/pytorch/pull/155576 on behalf of https://github.com/malfet due to breaks the same test again (I remember there were a version that adjusted tolerances), see bc3972b80a/1 ([comment](https://github.com/pytorch/pytorch/pull/155576#issuecomment-2964404710))
2025-06-11 22:03:45 +00:00
bc3972b80a [reland] Add stack_trace on make_fx (#155486)
Summary:
Previosuly, we only add stack trace in class _ModuleStackTracer(PythonKeyTracer) for non-strict export. I moved this stack trace logic to the parent class PythonKeyTracer, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use make_fx on the Module, and then run export on the resulting graph. If the result of make_fx doesn't have stack trace, the stack trace information is lost.

**User needs to turn this on by passing in `stack_trace=True` to make_fx. We don't make this the default option since this might increase inductor compilation time (`make_fx` is used in inductor to trace graph patterns for pattern matching). It's also turned on if `_inductor.config.trace.enabled` is True.**

**preserving stack trace is on by default for ModuleStackTracer, which is used for non-strict export.**

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
buck run fbcode//caffe2/test/dynamo:test_dynamo -- -k test_autocast_ordering
```

Rollback Plan:

Differential Revision: D76298692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155486
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-11 21:27:43 +00:00
9bd0830ed8 [dynamic shapes] guard_or_false for cat, repeat (#155290)
Summary:
assumes:
- specified repeats are non-negative
- 1d cat arguments like [u0] aren't non-zero sized (replaces existing size-oblivious)

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76092011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155290
Approved by: https://github.com/laithsakka
2025-06-11 21:03:32 +00:00
4609699bfd [MPS] Migrate leaky_relu (forward and backward) to Metal kernel (#155571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155571
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479
2025-06-11 20:58:46 +00:00
f8d93b3783 [MPS] Migrate hardswish (forward and backward) to Metal kernel (#155479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155479
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462
2025-06-11 20:58:46 +00:00
db5970c1a6 [coreml-backend-tool] fix pytorch-backended issue on new coremltools (#155543)
Summary:
the new coreml tool is export mlpakage instead mlmodel in default option.  when we use new 8.0 coreml tool to convert to backend, the error is

```
Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Please provide that as well, using the 'weights_dir' argument.
```

Test Plan:
tested with internal workflow

Rollback Plan:

Differential Revision: D76325462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155543
Approved by: https://github.com/shoumikhin
2025-06-11 20:52:26 +00:00
cec264c8c6 remove single remaining gso from compute_stride (#155635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155635
Approved by: https://github.com/ColinPeppler
2025-06-11 20:36:21 +00:00
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
2d3615f577 [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 20:32:07 +00:00
94ae615337 [trymerge] Error on ghstack commit with multiple PRs (#154941)
see https://github.com/pytorch/pytorch/issues/154427#issuecomment-2932941343 for context

Errors if do not find 1 match in ghstack commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154941
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-11 20:26:50 +00:00
b7a73a2cdb Convert to markdown: export.programming_model.rst (#155659)
Converts only export.programming_model.rst to markdown

Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020, but split into a second PR to pass sanity check

Docs comparison (check out the 'new' whenever docs build)

1. export.programming_model ([old](https://docs.pytorch.org/docs/main/export.programming_model.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155659/export.programming_model.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155659
Approved by: https://github.com/sekyondaMeta
2025-06-11 20:23:46 +00:00
1b6772a90f A small fix in do_bench_using_profiling (#155500)
Summary: Results: https://docs.google.com/document/d/1B_4rtiDFPH_jV3VpnqLPnInwDMpF7yX29G82UoJTcu8/edit?tab=t.0

Test Plan:
```
buck2 run mode/opt  -c fbcode.enable_gpu_sections=true ai_acceleration/float8/benchmarks/bench:bench_fp8_shapes_eval 2>&1 | tee output44.txt
```

Rollback Plan:

Differential Revision: D76298690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155500
Approved by: https://github.com/yoyoyocmu, https://github.com/nmacchioni
2025-06-11 20:06:19 +00:00
1dd0b1d12b Unbreak torch.is_vulkan_available() on Mac (re-send of #154675, please stamp) (#155595)
This is a new PR duplicating #154675 due to merge issues with that PR coming from my old (now updated) version of ghstack.

I am a Vulkan noob, but this extension and flag seem to be necessary. See "Encounted VK_ERROR_INCOMPATIBLE_DRIVER" at https://vulkan-tutorial.com/Drawing_a_triangle/Setup/Instance .

(For anyone trying to repro at home, I have the following homebrew packages installed, not all of which may be necessary: molten-vk, vulkan-headers, vulkan-loader, vulkan-tools, vulkan-utility-libraries. I also have VK_ICD_FILENAMES set to /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json, and I built PyTorch with USE_VULKAN=1. Making sure vkcube works helped me debug this setup.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155595
Approved by: https://github.com/malfet
2025-06-11 19:51:35 +00:00
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
f80a61adf5 Revert "[dynamo] added github_cli to detect unimplemented_v2 calls (#155610)"
This reverts commit 5dd07c70e53a86b73f49711b8186d86dc4f1b32a.

Reverted https://github.com/pytorch/pytorch/pull/155610 on behalf of https://github.com/malfet due to Looks like it fails on every pull request, based on https://github.com/pytorch/pytorch/actions/workflows/check-unimplemented-calls.yml, but it does not run on trunk ([comment](https://github.com/pytorch/pytorch/pull/155610#issuecomment-2963929765))
2025-06-11 19:31:55 +00:00
1e373d02d5 [ONNX] Change deprecation message from 2.8 to 2.9 (#155580)
~~The PR: https://github.com/pytorch/pytorch/pull/152478 did not respect the release policy that the deprecation should happen after the deprecation message has been set for 2 releases. This PR postpone 2.8 to the rightful version 2.10.~~

~~NOTE: "as early as" 2.10 shall give ONNX users more time to adapt and provide feedback.~~

To follow the upcoming torchscript deprecation, `torch.onnx.export` expects to switch dynamo=True (also turn on fallback=True for bc) on torch 2.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155580
Approved by: https://github.com/justinchuby, https://github.com/tugsbayasgalan
2025-06-11 19:31:29 +00:00
3f29642ecf Update XLA pin (#155471)
Update pin after XLA PR https://github.com/pytorch/xla/pull/9312 landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155471
Approved by: https://github.com/laithsakka
2025-06-11 19:16:52 +00:00
f8baec8984 Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-06-11 19:12:52 +00:00
6dfada220e [ca] better error message for subclasses not supported by FakeTensor (#155481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155481
Approved by: https://github.com/jansel
ghstack dependencies: #155473, #155570
2025-06-11 19:09:29 +00:00
5dcc718a77 [dynamo][ci] update PYTORCH_TEST_WITH_DYNAMO xfail/skips script for 3.13 (#155570)
No more 311 runners, tested by generating the files for the next PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155570
Approved by: https://github.com/zou3519
ghstack dependencies: #155473
2025-06-11 19:09:29 +00:00
87b002b6fb [ca] make torch.compile API respect ambient disable contexts (#155473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155473
Approved by: https://github.com/jansel
2025-06-11 19:09:29 +00:00
be124a61a4 [MPS] Migrate hardsigmoid (forward and backward) to Metal kernel (#155462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155462
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316
2025-06-11 19:09:23 +00:00
c04a4e7094 Add types to torch/utils/_triton.py (#155612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155612
Approved by: https://github.com/jamesjwu
2025-06-11 19:04:10 +00:00
2002e3a311 [Docs] Convert to markdown: torch.compiler_transformations.rst, torch.compiler.config.rst (#155347)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155347
Approved by: https://github.com/svekars
2025-06-11 18:55:30 +00:00
925fbfca27 Convert fx.rst to fx.md (#155482)
Part of changes #155023 (parent PR #155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155482
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 18:46:35 +00:00
4d9d884c3f [NCCL] Expose new ncclConfig_t flags in 2.27 (#155379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155379
Approved by: https://github.com/Skylion007
2025-06-11 18:26:55 +00:00
247f83e0a4 [dynamic shapes] guard individual terms in sym_and; user-code-friendly sym_and/sym_or (#154737)
Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`.

This handles that, and also makes `sym_and/or` user-code friendly, for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737
Approved by: https://github.com/laithsakka
2025-06-11 18:08:06 +00:00
c1cbaca7fd [CI] Move setuptools requirements from conda to pip (#155697)
Needed for `import z5` to work without warning, otherwise
`LoggingTests.test_logs_out` will fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155697
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515
2025-06-11 18:03:18 +00:00
3a43dba21f Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit dc5e8f7999cccb51efcf0f5fe197a740a918c73d.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/malfet due to It broke ROCM, see c75c732481/1 ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2963708109))
2025-06-11 18:01:08 +00:00
c75c732481 [CI] Disable ET tests (#155708)
I'm tired of seeing red on PRs and it has been consistently broken since May 30th per 59eb61b2d1/10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155708
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-06-11 17:56:52 +00:00
59eb61b2d1 [inductor] Improve GEMM logging to display batch size for batched operations (#155544)
Improves the GEMM overview logging in PyTorch Inductor to properly display batch size information for batched matrix operations like `torch.bmm` and `torch.baddbmm`.

**Fixes #155307**

## Problem

The current GEMM logging for `torch.bmm` shows:
```python
# Repro
import os
os.environ["TORCH_LOGS"] = "inductor"
import torch

M, N, K = 1024, 1024, 1024
dtype = torch.bfloat16
A = torch.randn(10, M, K, device="cuda", dtype=dtype)
B = torch.randn(10, K, N, device="cuda", dtype=dtype)

compiled_model = torch.compile(torch.bmm, fullgraph=True)
_ = compiled_model(A, B)
```

**Before:**
```
Name                 | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------
aten.bmm             | 1024                 | 1024                 | 1024                 | 1
----------------------------------------------------------------------------------------------------
```

The batch size (10) is missing from the logs, making it unclear what the actual operation dimensions were.

## Solution

**After:**
```
Name                           | B                    | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------------------------------------
aten.bmm                      | 10                   | 1024                 | 1024                 | 1024                 | 1
aten.mm                       | -                    | 1024                 | 1024                 | 1024                 | 2
----------------------------------------------------------------------------------------------------------------------------------
```

## Changes Made

### 1. Enhanced Parsing Logic in compile_fx.py
- Detects batched operations by checking if operation name ends with `'bmm'` or `'baddbmm'`
- For batched operations: takes last 4 parts as `batch, m, n, k`
- For non-batched operations: takes last 3 parts as `m, n, k`
- **Dedicated "B" column**: Added separate column for batch size instead of embedding in operation name
- Shows batch size for batched operations, shows "-" for non-batched operations

### 2. Updated All MM Operations for Consistency
- **bmm.py**:
  - Extract batch size from `mat1.get_size()[0]` for both `tuned_bmm` and `tuned_baddbmm`
  - Use positional counter keys: `aten.bmm_{batch_size}_{m}_{n}_{k}`
  - Enhanced log messages to include batch size information

- **mm.py**: Updated counter keys for consistency:
  - `aten.mm_{m}_{n}_{k}` (no batch dimension)
  - `aten.addmm_{m}_{n}_{k}` (no batch dimension)
  - `aten._int_mm_{m}_{n}_{k}` (no batch dimension)
  - `aten._scaled_mm.default_{m}_{n}_{k}` (no batch dimension)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155544
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-06-11 16:57:40 +00:00
7b7cd56f5e [export] support linear & layer_norm unbacked (#155260)
## What
- use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE.
- use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE.

## Example DDEs

- Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl`
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)).  (Size-like symbols: u0)

Caused by: layer_norm = self.layer_norm(linear)  # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl)
```

- Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten)
```
  File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object*, _object*, _object*)
  File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const
  File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const

torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128*((u0//387)), 0) (unhinted: Eq(128*((u0//387)), 0)).  (Size-like symbols: u0)

Caused by: (_refs/__init__.py:3302 in native_layer_norm)
```

- Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous
```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387*((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2).  (Size-like symbols: u0)

Caused by: (_prims_common/__init__.py:279 in is_contiguous)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260
Approved by: https://github.com/laithsakka
ghstack dependencies: #155499
2025-06-11 16:47:34 +00:00
b49edc0d6c [Export] Fix some typos in docstring (#155485)
Summary: nit change, fix the doc string

Test Plan:
CI

Rollback Plan:

Differential Revision: D76297740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155485
Approved by: https://github.com/ColinPeppler
2025-06-11 16:44:38 +00:00
18bf6addc4 set_grad_enabled add str and repr for prints (#155681)
Fixes #86718

## Test Result

```python
>>> import torch
>>> torch.set_grad_enabled(False)
torch.autograd.grad_mode.set_grad_enabled(mode=False)
>>> print(torch.set_grad_enabled(False))
torch.autograd.grad_mode.set_grad_enabled(mode=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155681
Approved by: https://github.com/soulitzer
2025-06-11 16:01:03 +00:00
dc5e8f7999 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-11 15:20:48 +00:00
45c5a23237 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit 5264f8cd8d08272003298cdefe6bd60b1b8c80b4.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Just testing if it will fix PR time benchmarks signal ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2963232606))
2025-06-11 15:18:47 +00:00
359e8f5d69 [CI] Use setup-python from test-infra to do MacOS builds (#155515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155515
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601
2025-06-11 15:11:38 +00:00
9328a7fb58 [triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510)
This splits out the tests so we can independently test both the new and old API.

Note: the new API doesn't work yet - we still need to fix those tests.

Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510
Approved by: https://github.com/oulgen
2025-06-11 13:52:15 +00:00
4c3da611c2 Add CUDA 12.9.1 x86 nightly binaries (#154980)
Adding CUDA 12.9.1 to nightly binaries matrix for linux (x86) builds.
Add sbsa and libtorch build docker images, builds addition will be follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154980
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 13:43:17 +00:00
013cf1e330 [MPS] Move expm1 op to Metal (#155611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611
Approved by: https://github.com/malfet
2025-06-11 13:06:14 +00:00
44df7cf28d [AOTI] Fix embed_kernel_binary error when max_autotune is ON (#155569)
Summary: Stop removing cubin files so that it won't be missing when max_autotune is ON.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155569
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-11 12:27:36 +00:00
f34ab1628b [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-06-11 10:22:45 +00:00
eaceb243df [BE] Update the XPU support package to 2025.1.3 (#154346)
Fixes #153632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154346
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-06-11 09:46:18 +00:00
2585960b47 remove redundent type_id (#155539)
Those were added in https://github.com/pytorch/pytorch/pull/92229 to prevent confusion of overloads.
but the variants that accepts SymBool are all removed in https://github.com/pytorch/pytorch/pull/112890
with the introduction of SymbolicShapeMeta.
Hence that dummy arg is not needed anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155539
Approved by: https://github.com/ezyang
2025-06-11 08:46:56 +00:00
717a099d42 Revert "[flex attention][triton pin] triton_helpers shim for TMA apis (#154858)" (#155640)
This reverts commit ea7b233015ff00098df687884be4e2efbf7a55fa.

It fails internal tests in fbcode, but they weren't running so we didn't get signal

Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal.

Differential Revision: [D76380887](https://our.internmc.facebook.com/intern/diff/D76380887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155640
Approved by: https://github.com/nmacchioni, https://github.com/danzimm
2025-06-11 07:37:47 +00:00
0e2013a12d Add helion x pt2 test (#155513)
This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774

In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-11 07:08:06 +00:00
5b9db4335e Include c++ stack traces when we hit constraint violation (#155603)
Example new error message

```
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.

Framework stack:
  File "??", line 0, in _start
  File "", line 0, in __libc_start_main_alias_2
  File "??", line 0, in __libc_start_call_main
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module>
    foo(x)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
    return fn(*args, **kwargs)
  File "offloadstuff.c", line 0, in dynamo__custom_eval_frame
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__
    return self._torchdynamo_orig_callable(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__
    result = self._inner_convert(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__
    return _compile(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform
    tracer.run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run
    super().run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run
    while self.step():
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl
    self.push(fn_var.call_function(self, self.popn(nargs), {}))
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda>
    return lambda tx, args, kwargs: obj.call_function(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph
    return wrap_fx_proxy(tx, proxy)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls
    return _wrap_fx_proxy(
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception
    return fn()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node
    return node.target(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL
  File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS
  File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*)
  File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*)
  File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName)
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl
    return self._dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl
    return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl
    final_shape = infer_size(final_shape, shape)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size
    torch._check(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check
    _check_with(RuntimeError, cond, message)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with
    if expect_true(cond):
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true
    return a.node.expect_true(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true
    return self.guard_bool(file, line)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool
    r = self.evaluate()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate
    return self.shape_env.evaluate_sym_node(self, size_oblivious)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node
    return self.evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr
    return self._inner_evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr
    return self._evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr
    self._maybe_guard_rel(g)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel
    self._refine_ranges(expr)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges
    self._set_replacement(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement
    self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract
    torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp),
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call
  File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
  File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
  File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool)
  File "??", line 0, in torch::unwind::unwind()

User stack:
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo
    return torch.randn(5) * x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603
Approved by: https://github.com/zou3519, https://github.com/cyyever
ghstack dependencies: #155133
2025-06-11 05:00:36 +00:00
84c14361c2 [ez][AOTI] Add test for std::nullopt return in custom op (#155636)
Summary: As title. Follow up of https://github.com/pytorch/pytorch/pull/154286

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_nullopt_output

Rollback Plan:

Differential Revision: D76378892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155636
Approved by: https://github.com/zou3519, https://github.com/cyyever
2025-06-11 03:52:31 +00:00
1e690b6c41 Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK in set_history (#155453)
Fixes #154357

## Test Result

```bash
>>> import torch
>>>
>>> x = torch.tensor(1, device=torch.device('cpu'))
>>> y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
>>> z0 = (x.abs() * y).prod(dtype=torch.int16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Autograd not support dtype: Short
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155453
Approved by: https://github.com/albanD, https://github.com/soulitzer
2025-06-11 03:46:48 +00:00
110ae0f433 Custom Op handle 1-element tuples (#155447)
Fixes #150472

Modification of [PR 151408](https://github.com/pytorch/pytorch/pull/151408). This PR modifies the return parsing in `infer_schema` to handle the case of a Tuple with a single element.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155447
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-11 03:43:40 +00:00
a2b0b2698d inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
2025-06-11 01:33:24 +00:00
5264f8cd8d Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-11 01:22:06 +00:00
3040ca6d0f [Cutlass] Include fp8 headers in aoti cpp wrapper (#155173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155173
Approved by: https://github.com/desertfire
ghstack dependencies: #154829, #154835, #155195
2025-06-11 01:21:16 +00:00
1ed243f01c Add missing attr access check for legacy autograd.Function (#155055)
Fixes https://github.com/pytorch/pytorch/issues/154981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155055
Approved by: https://github.com/albanD
ghstack dependencies: #154509, #154852
2025-06-11 01:03:49 +00:00
5dd07c70e5 [dynamo] added github_cli to detect unimplemented_v2 calls (#155610)
This PR adds the workflow of whenever a dev makes a PR that contains files under torch/_dynamo, we check for any unimplemented_v2() callsites and if any of them have been modified in some sort of way, the workflow fails and lists them exactly which callsites and let's them know what the command lines are to update the registry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155610
Approved by: https://github.com/williamwen42
2025-06-11 00:40:56 +00:00
3580b8dde4 [BE] Mention debug=True in AC error messages (#155593)
See https://github.com/pytorch/pytorch/issues/155171#issuecomment-2949415407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155593
Approved by: https://github.com/janeyx99
2025-06-11 00:32:41 +00:00
dbec08bc1c Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566)
Summary:

As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Test Plan: test_hf_storage.py

Reviewed By: saumishr

Differential Revision: D75099862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
2025-06-10 23:37:47 +00:00
2161be8497 Move unary trig ops to metal kernels (#154465)
Move inverse trig unary ops, sinh, & cosh to metal kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154465
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-10 22:56:59 +00:00
c4b93e6579 Replace frame_traced_fn hook with get_traced_code() util (#155249)
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.

This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
2025-06-10 22:40:58 +00:00
8892b782a8 [nativert] move execution planner to torch (#155374)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revidsion: D76167093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155374
Approved by: https://github.com/zhxchen17
2025-06-10 22:36:06 +00:00
ffc6cbfaf7 [symm_mem] Move all symm mem code into a dedicated folder (#155573)
We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573
Approved by: https://github.com/fegin, https://github.com/d4l3k
2025-06-10 22:30:11 +00:00
3e131f7779 [CI] Move tlparse to requirements files (#155601)
Not sure why we had it that way to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155601
Approved by: https://github.com/seemethere
ghstack dependencies: #155476, #155493
2025-06-10 22:25:47 +00:00
94da4523ec Disable foreach tests that depend on profiler for CUDA 12.6 (#155596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-06-10 22:21:06 +00:00
672ac2ec86 Reapply "Cleanup VS 2019 refs in pytorch (#145863)" (#152613) (#155478)
This reverts commit e4f2282.
I believe fix PR was landed https://github.com/pytorch/pytorch/pull/153480 that triggered the revert.
Hence this is reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155478
Approved by: https://github.com/malfet
2025-06-10 22:20:14 +00:00
3b7c5e6fa5 Revert "[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)"
This reverts commit b07725a9516028a485153c4b5356b3e33b990f81.

Reverted https://github.com/pytorch/pytorch/pull/155182 on behalf of https://github.com/davidberard98 due to fails on triton 3.2 (internally) ([comment](https://github.com/pytorch/pytorch/pull/155182#issuecomment-2960664845))
2025-06-10 21:53:01 +00:00
d2f06d2b06 [logs] Change autotune data into separate items (#155525)
Summary: Split the autotune data into multiple keys and items : this is better for storage of the data and easier querying.

Test Plan:
```
 TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run (sample)
```

Rollback Plan:

Differential Revision: D76303514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155525
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-06-10 21:47:07 +00:00
14f3639e09 Convert to .md: onnx_verification.rst, onnx.rst, package.rst, (#155556)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [onnx_verification.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_verification.rst)
* [onnx.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx.rst)

* [package.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/package.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155556
Approved by: https://github.com/AlannaBurke, https://github.com/sekyondaMeta
2025-06-10 21:40:40 +00:00
ae0f1f8984 Convert to markdown onnx rst (#155228)
Fixes #155030

Converts the following files to MyST markdown and ensure that the doc tests are green:

- [x] [onnx_dynamo_onnxruntime_backend.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo_onnxruntime_backend.rst)
- [x] [onnx_dynamo.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo.rst)
- [x] [onnx_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_ops.rst)
- [onnx_torchscript_supported_aten_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript_supported_aten_ops.rst) - not changed as it is autogenerated
- [onnx_torchscript.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript.rst) - fixed in #155390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155228
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 21:33:07 +00:00
7a03b0d2ca [BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555)
Please see: https://github.com/pytorch/pytorch/issues/147383

1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4
2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-06-10 21:32:08 +00:00
40fefe2871 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit 73220d52fd67b5f4f5b15e0e0433e09733c93f31.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/atalman due to wrong pr description ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2960592004))
2025-06-10 21:13:18 +00:00
8a396c5635 DOC: Convert to markdown: torch.compiler_best_practices_for_backends.rst, torch.compiler_cudagraph_trees.rst, torch.compiler_custom_backends.rst, torch.compiler_dynamic_shapes.rst, torch.compiler_dynamo_deepdive.rst (#155137)
Fixes #155037

[torch.compiler_best_practices_for_backends.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/torch.compiler_best_practices_for_backends.rst) shows error 404

cc  @svekars @sekyondaMeta @AlannaBurke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155137
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 20:51:05 +00:00
01b8f5e685 Convert to markdown: testing.rst, threading_environment_variables.rst, torch_cuda_memory.rst, torch_environment_variables.rst, torch_nccl_environment_variables.rst (#155523)
Fixes #155035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155523
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-10 20:38:36 +00:00
545fbd58dc [export] inline jit.scripted function in export (#155180)
When we export a scripted function, we inline the original callable stored in "_torchdynamo_inline", this is the same strategy as torch.compile path.

We do the same thing for script method, where a "\_\_wrapped\_\_" attribute points to the original callable in most cases. There are some corner cases we identified: top-level jit.scripted modules' method doesn't have a \_\_wrapped\_\_. In this case, we fall back to the original scripted approach. Maybe there're more such cases but need verification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155180
Approved by: https://github.com/zou3519
2025-06-10 20:34:12 +00:00
a666cf3b38 [xla hash update] update the pinned xla hash (#154348)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154348
Approved by: https://github.com/pytorchbot
2025-06-10 20:33:31 +00:00
c9404faacb [refactor] is_known_channels_last_contiguous* -> definitely_channels_last_contiguous* (#155499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155499
Approved by: https://github.com/laithsakka
2025-06-10 20:29:46 +00:00
94763f5ca7 [ROCm][Inductor][CK] add kBatch as runtime parameter to CK-tile gemms (#155389)
Similar to old-CK gemms

### Testing

Rely on existing coverage in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155389
Approved by: https://github.com/chenyang78
2025-06-10 20:25:02 +00:00
ab51a93737 [CI] Set PATH during build to include location of sccache wrapped nvcc (#155464)
Sccache wasn't working for nvcc on jammy, so manually set the path to include where nvcc is

I had problems with always making nvcc a wrapper in some inductor tests where I got
```
sccache: encountered fatal error
sccache: error: PCH not supported by nvcc
sccache: caused by: PCH not supported by nvcc
```
and I also got an error (only on clang) when trying to set CMAKE_CUDA_COMPILER_LAUNCHER to /opt/cache/bin/sccache or sccache
```
ccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: "nvcc warning : Support for offline compilation for architectures prior to \'<compute/sm/lto>_75\' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\nnvcc fatal   : Failed to preprocess host compiler properties.\n"
```

Non jammy cuda jobs' docker images used a different dockerfile, which set CMAKE_CUDA_COMPILER_LAUNCHER e895e9689c/.ci/docker/ubuntu-cuda/Dockerfile (L110)

Alt solution:
Given that I only get the error on clang, I could set CMAKE_CUDA_COMPILER_LAUNCHER=sccache only when not using clang

Setting CUDA_NVCC_EXECUTABLE doesn't fail but also doesn't result in cache hits/misses

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155464
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 20:23:33 +00:00
35e8f2593c [CUDA] Fix missing bounds check in Softmax.cu (#154778)
Uncovered by @ngimel, same as change in #144009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154778
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/malfet
2025-06-10 20:03:54 +00:00
0ca2a79f5b [TEST] Modernize test_sort_large (#155546)
Since its introduction ~4 years ago, the test `test_sort_large` has always been deselected because it requires 200GB of CUDA memory. Now, as we do have GPUs this big, it gets selected, but fails with `var_mean` not being a member if `torch.Tensor` and `var_mean` accepting only floating point tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155546
Approved by: https://github.com/eqy
2025-06-10 19:59:12 +00:00
ea23eb4b98 [ez][CI] Reuse old whl: turn off on releases, add docs files to ok list (#155346)
Add docs/**/*.md and docs/**/*.rst to files that are ok to reuse old whls

Prevent using old whls on release branches

Move check for changed files earlier to reduce api usage?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155346
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 19:57:40 +00:00
8a22551300 Fixes OpInfo gradient checks for ctc_loss (#154590)
Fixes #67462

Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent.

The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see
https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590
Approved by: https://github.com/soulitzer
2025-06-10 19:56:39 +00:00
954ce94950 Add __main__ guards to quantization tests (#154728)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In quantization tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154728
Approved by: https://github.com/ezyang
2025-06-10 19:46:07 +00:00
07eb374e7e [dynamo] Avoid unncessary caching source codegen (#155376)
We only need to cache a source (e.g., `x.y.z`) into a temporary local if
it's used multiple times in the codegen, otherwise we'd just be creating
redundant `DUP` and `STORE_FAST tmp_...` instructions, which might
degrade perf and definitely makes generated bytecode harder to read.

Example:
```python
import torch

@torch.compile(backend="eager")
def fn(x, y):
    return x + y

fn(torch.ones(2), torch.ones(1))
```

Original bytecode:
```verbatim
[0/0] [__bytecode]   3           0 RESUME                   0
[0/0] [__bytecode]
[0/0] [__bytecode]   5           2 LOAD_FAST                0 (x)
[0/0] [__bytecode]               4 LOAD_FAST                1 (y)
[0/0] [__bytecode]               6 BINARY_OP                0 (+)
[0/0] [__bytecode]              10 RETURN_VALUE
```

Modified bytecode (before this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_578c8d9a_2a9b_4d15_bac7_267591cdee32)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 COPY                     1
[__bytecode]              18 STORE_FAST               3 (tmp_1)
[__bytecode]              20 LOAD_FAST                1 (y)
[__bytecode]              22 COPY                     1
[__bytecode]              24 STORE_FAST               4 (tmp_2)
[__bytecode]              26 PRECALL                  2
[__bytecode]              30 CALL                     2
[__bytecode]              40 STORE_FAST               2 (graph_out_0)
[__bytecode]              42 LOAD_FAST                2 (graph_out_0)
[__bytecode]              44 LOAD_CONST               1 (0)
[__bytecode]              46 BINARY_SUBSCR
[__bytecode]              56 DELETE_FAST              2 (graph_out_0)
[__bytecode]              58 RETURN_VALUE
```

Modified bytecode (after this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_2c498af2_ce5c_49cb_abba_a0c7489b09ce)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 LOAD_FAST                1 (y)
[__bytecode]              18 PRECALL                  2
[__bytecode]              22 CALL                     2
[__bytecode]              32 STORE_FAST               2 (graph_out_0)
[__bytecode]              34 LOAD_FAST                2 (graph_out_0)
[__bytecode]              36 LOAD_CONST               1 (0)
[__bytecode]              38 BINARY_SUBSCR
[__bytecode]              48 DELETE_FAST              2 (graph_out_0)
[__bytecode]              50 RETURN_VALUE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155376
Approved by: https://github.com/williamwen42
2025-06-10 19:38:15 +00:00
91ee9ee82d [MPS][BE] Refactor round_decimals shader code to leverage new macro (#155316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155316
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #155304
2025-06-10 19:29:57 +00:00
b1b8e57cda Add __main__ guards to ao tests (#154612)
This is the first PR of a series in an attempt to get the content of #134592 merged as smaller PRs (Given that the original one was closed due to a lack of reviewers).

This specific PR contains:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Update ao tests.

There will be follow up PRs to update the other test suites but I don't have permissions to create branches directly on pytorch/pytorch so I can't create a stack and therefore will have to create them one at the time.

Cc @jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154612
Approved by: https://github.com/jcaip
2025-06-10 18:33:09 +00:00
0b677560e6 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-10 18:30:43 +00:00
0f47e76937 [MPS] Implement hardshrink metal kernel (#155304)
Implements the forward and backward hardshrink operators as Metal kernels.
In order to support the lambda parameter, we extend the `exec_unary_kernel`  and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar.
We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304
Approved by: https://github.com/malfet
2025-06-10 18:20:27 +00:00
8347268edc Revert "Make open device registration tests standalone (#153855)"
This reverts commit 8823138e47a3200c313f6bf2d21eb689d8150f39.

Reverted https://github.com/pytorch/pytorch/pull/153855 on behalf of https://github.com/clee2000 due to causing some linux aarch64 tests to fail [GH job link](https://github.com/pytorch/pytorch/actions/runs/15566289293/job/43832373302) [HUD commit link](8823138e47), should be easy fix, rename in places where its mentioned, there might be more than just aarch64 though ([comment](https://github.com/pytorch/pytorch/pull/153855#issuecomment-2960191503))
2025-06-10 18:11:24 +00:00
cb9b479f4f XPU enable XCCL by default (#154963)
Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-10 17:56:13 +00:00
0b6c0898e6 [dynamo] added additional_info functionality (#155526)
There is now functionality for the developer to add a  --additional-info arg to the add and update dev terminal command to include any additional info the dev might want to remark about the graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155526
Approved by: https://github.com/williamwen42
2025-06-10 17:40:50 +00:00
8823138e47 Make open device registration tests standalone (#153855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153855
Approved by: https://github.com/janeyx99
2025-06-10 17:33:26 +00:00
c88e87f355 [Inductor] Set Triton Allocator Function For Use with New TMA API in Inductor (#155373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155373
Approved by: https://github.com/davidberard98
2025-06-10 17:09:04 +00:00
73220d52fd [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-10 16:59:00 +00:00
38c4d05535 [precompile] Ensure @disable()-ed function won't trigger recompile from precompile bytecode. (#155363)
In a precompiled bytecode, it looks like the following:
```
pre-graph bytecode
...
compiled graph code
...
post-graph bytecode
```

In pre-graph bytecode we have calls into helper functions like torch._dynamo.utils.call_size which will invoke @disable inside the bytecode.

Normally torch.compile() will handle these frames fine, but for precompile we will load bytecode from a clean state of dynamo and we want a way to assert recompile never happen, so the current way to ensure this is by doing set_stance("fail_on_recompile") (open to any other idea to test this, but IMO this is the closest thing we have today).

This approach doesn't work when util functions like call_size() is involved and this PR fixes a bunch of places to make sure "fail_on_recompile" can skip through the functions meant to be skipped during compilation.

Differential Revision: [D76156867](https://our.internmc.facebook.com/intern/diff/D76156867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155363
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #155329
2025-06-10 16:13:38 +00:00
ddee927f31 [precompile] Add low level C API to load precompiled dynamo code on functions. (#155329)
While loading deserialized dynamo states back from disk, precompile will need a direct way to access ExtraState and populate guarded bytecode as cache entries.

This diff adds two API at code level to load precompiled guard + bytecode entries.
1. _load_precompile_entry() will append an entry to a precompile entry list per code object. This precompile entry will be looked up before normal compiled entries.
2. _reset_precompile_entries() will clean up all the installed existing entries. This is useful to prevent a case where user call loading multiple times and explode the number of entries on the list.

Differential Revision: [D76083247](https://our.internmc.facebook.com/intern/diff/D76083247/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155329
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-06-10 16:13:38 +00:00
e8d29c45e0 [ROCm][TunableOp] Unit test to verify that there is only one kernel launch per PyTorch API invocation. (#155077)
TunableOp UT covers breakage that was fixed in this PR: https://github.com/pytorch/pytorch/pull/153764

After tuning is complete, verify that there is only one kernel launch. for each PyTorch API invocation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155077
Approved by: https://github.com/jeffdaily
2025-06-10 16:11:43 +00:00
08d15d3ec1 [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155514)
Part of changes #155040 (parent PR #155120)

Follow-up of #155351. I split the changes of `torch.compiler_troubleshooting.rst ` into #155351 and this PR due to the 2000-line limit in one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155514
Approved by: https://github.com/svekars
2025-06-10 15:41:31 +00:00
eb152ab1dd Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 060838c2312ad207c7afe2c86f8a484afea5f328.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](060838c231) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))
2025-06-10 15:38:40 +00:00
b44306d368 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-10 14:52:26 +00:00
68f36683f0 [Testing] Add more models to MPSInductor tests (#155494)
Enable all 46 HuggingFace models, only GPT2ForSequenceClassification fails to compile with a rather strange `array subscript is not an integer` error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 14:40:59 +00:00
c8d39a1045 [docs] Add docstring indicating UB for converting inf to int (#154781)
Fixes #154724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154781
Approved by: https://github.com/malfet
2025-06-10 14:04:50 +00:00
805297981a Revert "[Testing] Add more models to MPSInductor tests (#155494)"
This reverts commit f154f9b3040369a7979d5de7acb6fe21433eda83.

Reverted https://github.com/pytorch/pytorch/pull/155494 on behalf of https://github.com/malfet due to I'm blind ([comment](https://github.com/pytorch/pytorch/pull/155494#issuecomment-2959319787))
2025-06-10 13:45:32 +00:00
e53ddaf1f6 Adapt dtensor tests to be device agnostic (#154840)
##MOTIVATION
This PR includes minor changes to skip some unsupported tests on Intel Gaudi devices as well as to make some of the tests more device agnostic.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

##CHANGES
- test_dtensor_compile.py : Make some of the tests device agnostic . ( Replace "cuda" hard codings with self.device_type)
- test_dtensor.py and test_comm_mode_features.py: Skip some tests which are unsupported on Intel Gaudi devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154840
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-06-10 12:43:16 +00:00
f154f9b304 [Testing] Add more models to MPSInductor tests (#155494)
Enable all hugging face models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 12:30:38 +00:00
75f258dd1f Fix spelling mistake (#155495)
Summary: Change "primtivies" to "primitives".

Test Plan:
n/a

Rollback Plan:

Differential Revision: D76229938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155495
Approved by: https://github.com/angelayi, https://github.com/cyyever
2025-06-10 09:06:58 +00:00
a205e8fd73 Apply all replacements on backward graph args during inductor codegen. (#155469)
Summary: temporary mitigation for https://github.com/pytorch/pytorch/issues/155468

Test Plan:
NA

Rollback Plan:

Differential Revision: D76096355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155469
Approved by: https://github.com/bobrenjc93
2025-06-10 08:56:18 +00:00
5116293f7e [XPU] Split triton version as 2 files to decouple triton version bump (#155313)
Triton XPU shares its version file with the community one. When the community updates Triton version, it will temporarily break the XPU CI/CD because they use different repositories and commits. To decouple Triton version bumps between the community and XPU, we propose splitting the version into two separate files.

Refer the latest community triton version bump PR #153117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155313
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/atalman
2025-06-10 08:49:03 +00:00
2cdcd16e83 [Easy] update pip sources for CUDA in nightly pull tool (#149143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149143
Approved by: https://github.com/ezyang, https://github.com/cyyever
ghstack dependencies: #145685
2025-06-10 08:07:30 +00:00
0319044e92 [Easy] update pip sources for ROCm in nightly pull tool (#145685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145685
Approved by: https://github.com/ezyang
2025-06-10 08:07:30 +00:00
9d2d227003 [CI] Fix XPU runner setup status issue (#155443)
Flow with PR #155194, fix the timeout exit code issue refer https://github.com/pytorch/pytorch/actions/runs/15526078422/job/43706927778?pr=154962#step:3:74
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155443
Approved by: https://github.com/etaf, https://github.com/atalman, https://github.com/EikanWang
2025-06-10 08:06:37 +00:00
5dfe1787b5 [Inductor] Limit fusions to a node distance of 64 (#154688)
fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/

[window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688
Approved by: https://github.com/eellison, https://github.com/Raymo111
2025-06-10 07:32:23 +00:00
8b8684466a Add a stub AGENTS.md for Codex (#155459)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155459
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-10 07:20:21 +00:00
b07725a951 [inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)
Follow-up to #154858.

Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API.

First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr).

Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim.

Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155182
Approved by: https://github.com/drisspg
2025-06-10 06:48:42 +00:00
8153340d10 [CI/CD] Remove CUDA 11.8 builds (#155509)
This removes CUDA 11.8 from CI/CD
Please see: https://github.com/pytorch/pytorch/issues/147383

TODO: Will followup of cleaning CUDA 11.8 config from scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155509
Approved by: https://github.com/cyyever, https://github.com/huydhn, https://github.com/malfet
2025-06-10 05:16:41 +00:00
671a9d175b Add warning for module full backward hook when no input requires gradient (#155339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155339
Approved by: https://github.com/Skylion007
2025-06-10 04:42:06 +00:00
e25ce0f928 [invoke_subgraph] Use eager input vals to constrain input strides (#155291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155291
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-10 04:06:09 +00:00
660695f11d Revert "Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)"
This reverts commit ede6ead8cd8e925cb093f2b3016342e645bd728d.

Reverted https://github.com/pytorch/pytorch/pull/155234 on behalf of https://github.com/clee2000 due to causing a bunch of tests to fail?  ex test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545607752/job/43773157441) [HUD commit link](ede6ead8cd), some of the failures attributed to broken trunk on friday seem real? ([comment](https://github.com/pytorch/pytorch/pull/155234#issuecomment-2957578769))
2025-06-10 03:34:36 +00:00
76644c9ff5 Make require_contiguous require exact strides instead of stride order (#148424)
Make `require_contiguous` require exact strides instead of stride order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148424
Approved by: https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-06-10 02:36:18 +00:00
b916d8a583 [Testing] Shard MacOS inductor perf tests (#155493)
One dtype per shard, as current job takes 2+ hours to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155493
Approved by: https://github.com/dcci
ghstack dependencies: #155476
2025-06-10 02:26:22 +00:00
52edfb2cbc Updates to README about CUDA install dir and conda not required (#155458)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155458
Approved by: https://github.com/malfet
2025-06-10 01:30:34 +00:00
f34335bf33 Convert compiler rst files to markdown (#155335)
Convert following compiler rst files to md file.
torch.compiler_inductor_profiling.rst
torch.compiler_ir.rst
torch.compiler_nn_module.rst
torch.compiler_performance_dashboard.rst
torch.compiler_profiling_torch_compile.rst

Fixes #155039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155335
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 01:12:11 +00:00
1851f50866 [AOTI] Add int return type support for custom op in proxy executor (#155465)
Summary:
When a custom op has int return type in its schema. The returned value will be specialized and such behaviour is different from a symint return type. This diff **only added support for int return type**.

As the returned int will be specialized and fused into downstream kernels (if being used), we can simply skip the int return type in the proxy executor.

Note that in the eager run, the returned int will be specialized to the value defined in the real impl of the custom op. In exported program or in AOTI, the returned int will be specialized to the value defined in the fake impl of the custom op. So the definitions of the return value should be consistent across real and fake impl of the custom op. Otherwise the eager run and AOTI run will have different results.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_int_output
```

Rollback Plan:

Differential Revision: D76159406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155465
Approved by: https://github.com/angelayi
2025-06-10 01:07:15 +00:00
da50835bde [aoti] Support c10 calls (#155256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155256
Approved by: https://github.com/malfet
2025-06-10 00:45:59 +00:00
07e340e29c Build magma-cuda 129 (#155496)
followup for https://github.com/pytorch/pytorch/pull/155340
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155496
Approved by: https://github.com/atalman
2025-06-10 00:32:24 +00:00
e7698ff5cf [MPS] Move abs op to Metal (#155474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155474
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-10 00:23:59 +00:00
7a48cc6990 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit b8bc2c2660e84034ff15232e2161e3ef9a6656d0.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2957346976))
2025-06-10 00:18:23 +00:00
a9a0501ec4 [user triton] mutation analysis for on-device TMA (#155380)
Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA.

On-device TMA works like this:

```
@triton.jit
def kernel(A_ptr, workspace_ptr, ...):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...)
    tl._experimental_descriptor_store(workspace_ptr, data, ...)
```

The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case.

The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example).

Consider this example, which I'll used to describe the pros/cons of this approach.

```
@triton.jit
def create_tma(global_ptr, workspace_ptr):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...)

@triton.jit
def kernel(A, B, workspace_ptr):
    create_tma(A, workspace_ptr)
    workspace_B = workspace_ptr + 128
    create_tma(B, workspace_B)
    data = tl._experimental_descriptor_load(workspace_ptr, ...)
    tl._experimental_descriptor_store(workspace_B, data, ...)
```

An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`).

A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor.

Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380
Approved by: https://github.com/oulgen
2025-06-10 00:07:18 +00:00
2578796e23 Fix sqlite3 in x86 Docker container. (#155211)
Some core modules for versions of python installed in /opt depend on libraries in /usr/local but those libraries are not copied over from the base container.

For example: /opt/python/cp312-cp312/bin/python3 -c "import sqlite3"
ImportError: libsqlite3.so: cannot open shared object file: No such file or directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155211
Approved by: https://github.com/huydhn
2025-06-09 23:42:02 +00:00
5df3bf13ec [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155351)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155351
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-09 23:18:31 +00:00
e12597090c Revert "Update auto-tuning support for _scaled_grouped_mm (#150944)"
This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f.

Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))
2025-06-09 23:12:56 +00:00
40d02eb481 [Cutlass] Allow filtering by fast_accum for scaled_mm (#155195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155195
Approved by: https://github.com/drisspg
ghstack dependencies: #154829, #154835
2025-06-09 22:46:18 +00:00
2c1a93a0ae Revert "[Graph Partition] move cpu scalar tensor to gpu (#154464)"
This reverts commit c1f531f0b0e6faf443d90f8de2936e866c8c27c2.

Reverted https://github.com/pytorch/pytorch/pull/154464 on behalf of https://github.com/clee2000 due to some of the newly added tests are failing internally, along with some other tests, D75913292 ([comment](https://github.com/pytorch/pytorch/pull/154464#issuecomment-2957201054))
2025-06-09 22:43:20 +00:00
82e6475d92 Add doc for missing functions for torch.special module (#155074)
Fixes #132178

Added all the missing functions that had a docstring but were not present in the documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155074
Approved by: https://github.com/albanD
2025-06-09 22:28:26 +00:00
bdbf2792a8 Fix docs build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

~Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-09 22:25:20 +00:00
034a7f6437 [BE] Raise better exception in torch.[con]cat[enate] (#155460)
By replacing `TORCH_CHECK` with `TORCH_CHECK_VALUE`

Also make redispatching from aliases an even simpler, by just calling
respective original class

Addresses feedback raised in https://github.com/pytorch/pytorch/pull/155383/files#r2133952368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155460
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-09 22:18:00 +00:00
398fca9dcf Add almalinux CUDA 12.9 docker build, required for magma build (#155340)
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155340
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-09 22:10:24 +00:00
ede6ead8cd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-09 22:04:19 +00:00
060838c231 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-09 21:43:21 +00:00
b95dadd717 [MPS] Enable RProp test for non-contiguous (#155439)
I believe this issue has already been fixed, but I don't know the hero PR. I'm relying on ci signals to verify it's fixed across macOS versions.

Fixes #118117

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155439
Approved by: https://github.com/Skylion007
2025-06-09 21:29:09 +00:00
3490a4f906 [MPS] Enable optimizer tests affected by addcdiv (#155437)
Tracked in #118115. Fixed in #124442. This PR unskips the tests.

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155437
Approved by: https://github.com/Skylion007
2025-06-09 21:27:37 +00:00
b8bc2c2660 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-09 21:23:32 +00:00
eba5fc91ac [nativert] Move serialization to PyTorch core (#155229)
Summary:
Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h`

Test Plan:
buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test

Rollback Plan:

Differential Revision: D76012641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229
Approved by: https://github.com/zhxchen17
2025-06-09 21:12:30 +00:00
1e6a653234 [ROCm][Inductor][CK] Split ck and ck-tile inductor backend(s) (#155294)
... and fix ck-tile instances not being generated due to incorrect caching

### Testing

Added test cases for CKTILE instances

```
pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294
Approved by: https://github.com/coconutruben
2025-06-09 20:40:26 +00:00
620415e018 Revert "Add stack_trace on make_fx (#155155)"
This reverts commit d4d0ede6bacb4b3b33c0e4aa4cb0e79d34e697ec.

Reverted https://github.com/pytorch/pytorch/pull/155155 on behalf of https://github.com/malfet due to Not sure why it was merged, it indeed breaks those tests in CI ([comment](https://github.com/pytorch/pytorch/pull/155155#issuecomment-2956973633))
2025-06-09 20:40:13 +00:00
abbdf9f363 [BE][Testing] Unskip ones_like/zeros_like testing on MPS (#155476)
But skip `double` dtype form OpInfo variants for this test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-09 20:37:44 +00:00
ea37f72099 enable test (#155342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155342
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #154768
2025-06-09 19:26:05 +00:00
d4d0ede6ba Add stack_trace on make_fx (#155155)
Summary:
Previosuly, we only add stack trace in `class _ModuleStackTracer(PythonKeyTracer)` for non-strict export. I moved this stack trace logic to the parent class `PythonKeyTracer`, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use `make_fx` on the Module, and then run `export` on the resulting graph. If the result of `make_fx` doesn't have stack trace, the stack trace information is lost.

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
```

Rollback Plan:

Differential Revision: D75985427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155155
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-09 18:31:57 +00:00
2aade5ee9f Fix weight tensor documentation #134896 (#155093)
Fixes #134896

## Description

Remove line about 'weight' tensor needing to be of floating point type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155093
Approved by: https://github.com/AlannaBurke
2025-06-09 18:07:21 +00:00
3863bbb55b [BE]: Update cusparselt to 0.7.1 (#155232)
Needed to support sparse operations on Blackwell, and implements new features for the library. Also optimizes library sizes vs 0.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155232
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-06-09 18:01:23 +00:00
79bdafe5b6 Revert "Custom FX pass for inductor's backend registration (#154841)"
This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb.

Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))
2025-06-09 16:56:45 +00:00
0083032e75 [aotd] Support mutations in reordering_to_mimic_autograd_engine (#155353)
Original issue: https://github.com/pytorch/pytorch/issues/154820

Dedicated sub-issue: https://github.com/pytorch/pytorch/issues/155242

Backward graph is reordered by partitioners.py: reordering_to_mimic_autograd_engine

Which only records in the backward graph compute that starts from tangents.

Mutation of primals(inputs) in backward can be disconnected from backward.

Handling this copy_ specifically, as we  add this mutation in framework and this is the only mutation that exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155353
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-09 16:39:47 +00:00
6c05f2fca0 [test] use JK to force graph break on slow aliasing/mutation/dynamic_shape behavior (#155257)
Summary: test to unblock shampoo, needs cleanup

Test Plan:
CI

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:aliased_inputs_with_mutation_and_dyn_shapes_killswitch
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: Set it to false.

Reviewed By: c00w

Differential Revision: D76051868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155257
Approved by: https://github.com/c00w
2025-06-09 16:21:59 +00:00
4a4cac0cef Update torch-xpu-ops commit pin (#154962)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](a3a196ccdb) includes:

- Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification
- Group Norm Backward Optimization with vectorization and parallel reduction
- Support CL path for MaxUnpooling2d and MaxUnpooling3d
- Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON
- Refactor USE_XCCL & USE_C10D_XCCL option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154962
Approved by: https://github.com/EikanWang
2025-06-09 15:54:13 +00:00
b9b84d8011 Generate unique id for tensor storage object by observing the week pointer of tensor storage object (#154859)
Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D75749065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154859
Approved by: https://github.com/eellison, https://github.com/ngimel
2025-06-09 15:46:27 +00:00
79aef14169 [ONNX] Set the name of the producing node using the value name (#155413)
When comparing two graphs exported using different opset versions, even though the value names are the same in both graphs, the node names did not match, causing model-explorer to not be able to sync the two graphs. This change updates the names of the nodes that directly produce the output values, for better correspondence across exported graphs.

![image](https://github.com/user-attachments/assets/3c00ca18-221f-4add-8429-4bcf12069036)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155413
Approved by: https://github.com/cyyever, https://github.com/xadupre
2025-06-09 13:03:58 +00:00
e15848669f [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268)
Summary:

**Problem Statement**
Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run.
*recording events to default destination:* https://fburl.com/code/7f9b0993
The default destination is "null".

***Solution***
adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line.

Test Plan:

https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Rollback Plan:

Reviewed By: kiukchung

Differential Revision: D75183591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268
Approved by: https://github.com/d4l3k
2025-06-09 10:43:52 +00:00
9968c854b6 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/tensor.py (#153146)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153146
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-06-09 06:27:50 +00:00
9b4a748e29 [nativert] Move Weights to PyTorch core (#155156)
Summary:
Moves Weights class to PyTorch core
Torch Native Runtime RFC: pytorch/rfcs#72
 README: https://github.com/pytorch/pytorch/blob/main/torch/nativert/OVERVIEW.md

Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:weights_test

Differential Revision: D75973156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155156
Approved by: https://github.com/zhxchen17
2025-06-09 05:49:32 +00:00
6fb6293159 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit c6b4f98625bb6b22bb9a60112a6d58e684a97e1b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/etaf due to This is breaking tests on xpu, detail log: https://hud.pytorch.org/pr/pytorch/pytorch/154962#43700962849 ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2954517883))
2025-06-09 03:13:27 +00:00
be2ad70cfa Fix dynamo tracing into AOTAutogradCache results in cpu tensors (#155251)
On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable:
05dd638ee9/torch/_dynamo/backends/common.py (L76)
This disables dynamo in the bw_compiler but also disables the runnable the compiler returns.

On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already.

```
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ]

```

This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical.

Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-06-09 02:06:16 +00:00
2908c10259 Document the default garbage_collection_threshold value and improve the organization of cuda docs (#155341)
Fixes #150917

As mentioned in the issue, I've updated the documentation of `garbage_collection_threshold`and improved the organization.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155341
Approved by: https://github.com/AlannaBurke, https://github.com/ngimel
2025-06-08 22:09:35 +00:00
d41f62b7a0 Fix/issue #155027 (#155252)
Fixes #155027
Converted RST files to Markdown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155252
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-08 21:17:31 +00:00
3d82a1dfb5 Add checks for empty tensor list (#155383)
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes https://github.com/pytorch/pytorch/issues/155306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: #155382
2025-06-08 18:53:19 +00:00
95448b2ce6 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 65b1aedd09e98fcafcdd893ca4924f4fa598fd18.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/clee2000 due to see henry's comment above.  This was reverted internally because it causes a memory leak and OOMs on AMD? ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2954192879))
2025-06-08 17:37:29 +00:00
30293b8b5e Preserve Enum types during torch.export serialization and deserialization (#154821)
Fixes #154674

Addresses an issue where `torch.export` does not correctly preserve Python `Enum` types during the save/load round-trip. Previously, Enum inputs were serialized by value only, causing their type to be lost after deserialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154821
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007, https://github.com/yushangdi, https://github.com/angelayi
2025-06-08 17:30:31 +00:00
27df0c56b7 Revert "[inductor] use int64 for large index (#154575)"
This reverts commit 2596e3d0617852469241be8777cf46db5c83928c.

Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](2596e3d061), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))
2025-06-08 16:58:59 +00:00
49888e6be0 [BE] Polish Makefile (#155425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155425
Approved by: https://github.com/ezyang
2025-06-08 16:37:12 +00:00
b981fb6744 Add docblock to torch/_dynamo/variables/builtin.py (#155402)
Add comprehensive module docstring explaining built-in function and type
variable tracking, including handling of Python built-ins, type constructors,
operators, and special constructs during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155402
Approved by: https://github.com/Skylion007
ghstack dependencies: #155403
2025-06-08 15:24:29 +00:00
09328eb02f Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel
2025-06-08 10:18:13 +00:00
1339e88105 Add docblock to torch/_dynamo/side_effects.py (#155403)
Add comprehensive module docstring explaining side effect tracking and
management, including mutation tracking, context changes, aliasing,
and state preservation during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155403
Approved by: https://github.com/williamwen42
2025-06-08 07:02:30 +00:00
0756ebcd48 Add docblock to torch/_dynamo/trace_rules.py (#155401)
Add comprehensive module docstring explaining the tracing rules and policies
that govern TorchDynamo's compilation decisions, including skip rules,
inlining policies, and library-specific handling.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155401
Approved by: https://github.com/williamwen42
2025-06-08 04:30:03 +00:00
abf4da0d24 [Profiler] Induce Inductor Import before Profiling (#155243)
Fixes #151829
Summary:
Currently, inductor has a lazy init which causes certain aten ops to run during a profiling run. This ends up cluttering the function events especially for smaller traces. One of the attempts to fix this was to simply remove that import from the profiler entirely but it looks like the import happens somewhere downstream anyways and the event still flood our profile.

To fix this, we induce the inductor import during prepare trace if the inductor is present. This way regardless of how the workload imports the inductor the actual init process will be done before tracing starts, resulting in more accurate tracing.

Test Plan:
Added test, also ran N7316820 manually and went from getting many events on the first run to the following output (only difference is Runtime Triggered Module Loading which is CUPTI overhead event):

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                             aten::mul_         1.40%     340.638us        99.92%      24.390ms      24.390ms       1.535us       100.00%       4.605us       4.605us             1
                                       cudaLaunchKernel         0.60%     146.533us        98.52%      24.049ms      24.049ms       0.000us         0.00%       3.070us       3.070us             1
                       Runtime Triggered Module Loading         6.14%       1.500ms         6.14%       1.500ms       1.500ms       1.535us       100.00%       1.535us       1.535us             1
                       Runtime Triggered Module Loading        91.78%      22.403ms        91.78%      22.403ms      22.403ms       1.535us       100.00%       1.535us       1.535us             1
                       void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.535us       100.00%       1.535us       1.535us             1
                        cudaDeviceSynchronize         0.08%      20.031us         0.08%      20.031us      20.031us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                   aten::mul_        82.81%     484.396us        94.26%     551.378us     551.378us       1.440us       100.00%       1.440us       1.440us             1
                                   cudaLaunchKernel        11.45%      66.982us        11.45%      66.982us      66.982us       0.000us         0.00%       0.000us       0.000us             1
                                  void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.440us       100.00%       1.440us       1.440us             1
                                  cudaDeviceSynchronize         5.74%      33.581us         5.74%      33.581us      33.581us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Rollback Plan:

Differential Revision: D76056511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155243
Approved by: https://github.com/ngimel
2025-06-07 23:58:50 +00:00
f1f49e56b0 [CI] remove xfail sm89 job (#155244)
No need to collect more data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155244
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/Skylion007
2025-06-07 21:04:57 +00:00
11bc29856d Fix some incorrect reST markups in the document (#154831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154831
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-07 19:09:46 +00:00
2596e3d061 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-07 18:41:46 +00:00
cyy
f6e18bc105 Fix CUDA 12.8 docker tag (#155087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155087
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-06-07 16:39:42 +00:00
783a4c1f50 [ROCm] fix nightly wheel, second attempt (#155388)
Fixes #155207. hipsparselt logic was still broken, but smoke test didn't catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155388
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 15:57:55 +00:00
ab56e5add9 [CUDA][BUILD] Add back the capability to use env TORCH_CUDA_ARCH_LIST (#155314)
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.

Follow up to: https://github.com/pytorch/pytorch/pull/152715

Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.

After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.

p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
2025-06-07 15:52:39 +00:00
456f40cb09 Add docblock for autotune_cache.py (#155133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155133
Approved by: https://github.com/aorenste
2025-06-07 14:50:09 +00:00
29e6033ff3 [Break XPU] Fix failed test cases which are introduced by community for XPU. (#155317)
Fixes #155186, Fixes #154701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155317
Approved by: https://github.com/jansel
2025-06-07 14:46:30 +00:00
694028f502 update get_default_device to also respect torch.device ctx manager (#148621)
Fixes https://github.com/pytorch/pytorch/issues/131328
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148621
Approved by: https://github.com/ezyang
2025-06-07 14:26:17 +00:00
db491825e0 [invoke_subgraph] Add logging (#155284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155284
Approved by: https://github.com/zou3519
ghstack dependencies: #155270
2025-06-07 11:31:53 +00:00
0f3f59784d [invoke_subgraph] Throw assertion on uncaptured speculate_subgraph (#155270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155270
Approved by: https://github.com/zou3519
2025-06-07 11:31:53 +00:00
c1f531f0b0 [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison
2025-06-07 06:59:39 +00:00
386aa72003 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-07 06:54:12 +00:00
da1f8980df [nativert] move function schema to torch (#154948)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D75826905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154948
Approved by: https://github.com/zhxchen17
2025-06-07 05:45:30 +00:00
5fbaa041e7 SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Rev,ision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2025-06-07 03:47:29 +00:00
30387ab2e4 [ROCm] Adds initialization support for PyTorch when built from ROCm wheels. (#155285)
AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels.

See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md

This operates to similar effect as how Torch can depend on CUDA wheels, with some differences:

* ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm.
* When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`.
* It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc.
* See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:59:03 +00:00
f140fac8dc [MPS] Implement erfc (#155382)
And migrate `erf` to Metal kernel

Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation

After that, `erfc(x) := 1.0 - erf(x)`

Fixes https://github.com/pytorch/pytorch/issues/155337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-06-07 02:35:12 +00:00
400f439670 [pt][easy] Rename metadata column (#155365)
Summary: Fixing typo: our logging requires autotuning_data instead of autotune_data, making it consistent

Test Plan:
Run benchmark, observe in perfetto trace proper name

Rollback Plan:

Differential Revision: D76159393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155365
Approved by: https://github.com/masnesral, https://github.com/Skylion007
2025-06-07 02:25:55 +00:00
81b0b308ca [dynamo] constant fold torch.cuda.is_initialized (#155300)
Fixes https://github.com/pytorch/pytorch/issues/129659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155300
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-07 02:21:11 +00:00
10cd1de518 [ROCm] Make optional features in LoadHIP better conditioned. (#155305)
* The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing.
* Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:20:55 +00:00
5596cefba6 Fix segfault during NumPy string tensor conversion (#155364)
By checking dtype first, but add elemnt_size check as well

Fixes https://github.com/pytorch/pytorch/issues/155328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155364
Approved by: https://github.com/Skylion007
2025-06-07 01:55:00 +00:00
be2e43264d [CI]Update windows runner to windows-2022 (#154368)
As per info in : actions/runner-images#12045
We need to change window runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154368
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-07 01:39:19 +00:00
83d22256f8 [BE][Ez]: Improve typing in torch._logging (#155345)
Add a few missing returns in torch._logging and use ruff to infer the obvious ones.
LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345
Approved by: https://github.com/ezyang
2025-06-07 00:04:39 +00:00
9b4db093cb Add C shim for at::pad and fix some typos (#155226)
As stated, we would like a pad shim to support custom ops wanting to build in an ABI stable manner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155226
Approved by: https://github.com/desertfire
2025-06-06 23:08:39 +00:00
cd82096973 DOC: Convert to markdown: ddp_comm_hooks.rst, debugging_environment_variables.rst, deploy.rst, deterministic.rst, distributed.algorithms.join.rst (#155298)
Fixes #155017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155298
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 22:44:50 +00:00
457dd79927 [BE][Ez]: Remove unnecessary accesses of dim vector (#155334)
It's better because you return less date, encapsulate more, and no longer need special handling of symvec vs nonsym vec dim(). Also removes a few casts and fixes a few potential edge cases relating to unsigned comparisons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155334
Approved by: https://github.com/ezyang
2025-06-06 21:28:25 +00:00
c95705dac2 [Docs] Convert to markdown: torch.compiler_troubleshooting_old.rst, torch.compiler.rst (#155348)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155348
Approved by: https://github.com/svekars
2025-06-06 21:26:24 +00:00
d2a2bfcb58 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-06 21:19:35 +00:00
bc5a11b581 [easy][invoke_subgraph] Remove skip from already fixed test (#155286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155286
Approved by: https://github.com/zou3519
2025-06-06 21:16:22 +00:00
0d8c029584 [FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319)
for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319
Approved by: https://github.com/mori360
2025-06-06 20:29:31 +00:00
4f5b34427b DOC: Convert to markdown: torch.overrides.rst, type_info.rst, utils.rst, xpu.rst (#155088)
Fixes #155041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155088
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 20:16:13 +00:00
067fd0b3ab [dynamo][cleanup] Simplify disabling of the helper functions on tensor properties (#155259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155259
Approved by: https://github.com/zhxchen17
2025-06-06 19:44:40 +00:00
749757ac1b [a2av] Align length of major dimension in output of 2D a2av (#155172)
Downstream consumer of the 2D all-to-all-v is often a group GEMM.
Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8.

This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed.

The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.)

The algorithm is as follows.

![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac)

In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677, #155058
2025-06-06 19:39:44 +00:00
1ccc57e428 Log backward no-op to tlparse and pt2 compile events. (#154544)
Summary: Log backward no-op to tlparse and pt2 compile events.

Test Plan:
$ rm -rf /tmp/r && TORCH_TRACE=/tmp/r buck2 run //scripts/jovian:backward_noop_repro_compile

Used print statements to verify we enter the logging code region.

Differential Revision: D75231665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154544
Approved by: https://github.com/c00w
2025-06-06 18:08:19 +00:00
2e2ea7290a [Inductor] Support autotuning in the FX backend. (#155049)
# Feature
If `config.triton.autotune_at_compile_time` is set to `True`, autotune Triton kernels during FX conversion. Else, stick with the existing behavior of using the first precompiled config.

# Test plan
Added CI tests verifying that the tuner is called iff this flag is set, with and without dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155049
Approved by: https://github.com/jansel
2025-06-06 17:44:14 +00:00
453bc9fbdf [a2av] 2D all-to-all-vdev (#155058)
A 2D AllToAllv shuffle is illustrated below:
(`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank)
```
        Source: |       Rank 0      |       Rank 1      |
                | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 |

        Dest  : |       Rank 0      |       Rank 1      |
                | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 |
```
where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`).

That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677
2025-06-06 17:35:39 +00:00
64436c38c9 [logs] Add autotuning data (#154771)
Summary: Add autotuning logging data to scuba/chrome trace.

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run //scripts/sashko:compilation_sample
```

Open https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/viewer?local_cache_key=00000000-0000-0000-92db-f23383ebf5b5, search for template_autotuning, see in metadata strides (see screenshot)

Differential Revision: D75457770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154771
Approved by: https://github.com/masnesral, https://github.com/PaulZhang12
2025-06-06 17:12:55 +00:00
706bc41c4c pass mempool arg through emptyCache (#155315)
Fixing typo in a previous PR #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315
Approved by: https://github.com/Skylion007
2025-06-06 16:14:26 +00:00
7ae7c14143 Reduce scope of s390x CI (#155208)
The purpose of this change is to reduce scope of s390x CI to stop it potentially blocking usual workflows for other users
while still keeping nightly builds and tests for me to look at.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155208
Approved by: https://github.com/malfet
2025-06-06 16:07:34 +00:00
fc77269262 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
2025-06-06 15:48:00 +00:00
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
b0fbbef136 Revert "Turn on new tiling by default (#154768)"
This reverts commit 7dcc77e422dcf97ce35991a138ab635a5cb88731.

Reverted https://github.com/pytorch/pytorch/pull/154768 on behalf of https://github.com/malfet due to Looks like it broke inductor CPU, see 231eb9902b/1 ([comment](https://github.com/pytorch/pytorch/pull/154768#issuecomment-2949468396))
2025-06-06 14:40:03 +00:00
231eb9902b [MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272)
Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running
any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to
exception
```
% MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled
test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ...
validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).'
zsh: abort      MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
```

Extend it to 4 elements and pass output dtype, which will be used by
binary_op later on anyway

Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make
sure everything passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272
Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever
2025-06-06 14:20:21 +00:00
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
348fd45065 Support detached checkout in tools/nightly.py (#154314)
Prompt for Sonnet 3.7 in Claude Code: Only inspect tools/nightly.py, all
other files are irrelevant to your task. Do not use any shell commands.
Task: Add a --detach argument to this script which instead of making a
new branch just directly checks out the correct commit in detached mode.

With two interventions:
- Branch and detach are mutually exclusive. So you should consolidate
  them into a single argument. Why don't we take over the 'None' option?
- Do you know that nightly_version is guaranteed to be a commit hash? It
  seems it would be safer to explicitly pass --detach

I tested by running `python tools/nightly.py checkout` and observing
that my worktree was detached at this point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154314
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2025-06-06 13:28:29 +00:00
907aea032d Add claude local md files (#155299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155299
Approved by: https://github.com/ezyang
2025-06-06 13:28:26 +00:00
6b1211df29 [BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130)
Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12.

The difference between the two versions of runtime_checkable is [these lines](40e22ebb2c/src/typing_extensions.py (L800-L823)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130
Approved by: https://github.com/rec, https://github.com/aorenste
2025-06-06 13:28:05 +00:00
10cef1e25d Remove torch XPU ABI=0 build logic for old compiler (#150095)
# Motivation
Follow https://github.com/pytorch/pytorch/pull/149888, this PR intends to remove ABI=0 build logic for PyTorch XPU build with old compiler (< 2025.0). For newer compilers >= 2025.0, the ABI is neutral by default without requiring additional compilation options (`-fpreview-breaking-changes`).

# Additional Context
This PR depends on XPU CI pass, which will be fixed by  https://github.com/pytorch/pytorch/pull/149843 and https://github.com/intel/torch-xpu-ops/pull/1515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150095
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 13:13:19 +00:00
58e5d20c57 [BE] Delete IS_SPMM_AVAILABLE() logic (#155296)
As it's been available on all currently supported platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155296
Approved by: https://github.com/clee2000
2025-06-06 13:12:35 +00:00
271ca679a8 [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-06 13:11:03 +00:00
9656251bb1 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit a14f427db68e54500ef4cd9ed34cb9537263bb74.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/malfet due to Looks like it breaks a bunch of tests, see 36a722e20d/1 ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2949209801))
2025-06-06 13:03:49 +00:00
36a722e20d [typo] Fix 'intialize' -> 'initialize' in proxy_tensor.py (#155301)
## Description
Fixes a typo in the comment of `torch/fx/experimental/proxy_tensor.py`, changing "intialize" to "initialize".

## Issue
None

## Type of change
- [x] Typo fix

## Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] My changes generate no new warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155301
Approved by: https://github.com/jingsh, https://github.com/ezyang, https://github.com/cyyever
2025-06-06 10:43:44 +00:00
9d59b516e9 Make device check throw specific error (#155085)
Fixes #122757

The fix is lost after revert and rebase previous PR https://github.com/pytorch/pytorch/pull/150750 (only change of tests are merged).

## Test Result

```python
>>> import torch
>>>
>>> model_output = torch.randn(10, 5).cuda()
>>> labels = torch.randint(0, 5, (10,)).cuda()
>>> weights = torch.randn(5)
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
>>> loss = loss_fn(input=model_output, target=labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3476, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155085
Approved by: https://github.com/mikaylagawarecki
2025-06-06 07:00:04 +00:00
07da8a469b [CI] fix xpu-smi hang issue on some xpu runners (#155194)
To workaround  xpu-smi hang issue on some XPU runners, refer https://github.com/pytorch/pytorch/actions/runs/15431583674/job/43431289026?pr=154962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155194
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 06:51:26 +00:00
e694280d12 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-06 06:49:44 +00:00
c6b4f98625 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-06 05:53:24 +00:00
d3d64c6db0 Revert "Add pinned numpy and fix build (#155129)"
This reverts commit a3098a74d494020dbb906c05ef047013e1921662.

Reverted https://github.com/pytorch/pytorch/pull/155129 on behalf of https://github.com/malfet due to Broke test_spectral_op, looks like missing xfail, see 0db3e0cf29/1 ([comment](https://github.com/pytorch/pytorch/pull/155129#issuecomment-2947951632))
2025-06-06 03:14:47 +00:00
0db3e0cf29 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit e1180c7228ba8c8b16cabf78706d4a67ca189a6b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Breaks doc test, but should be easily fixable ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2947935940))
2025-06-06 03:08:48 +00:00
28796f71d0 Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063)
Originally https://github.com/pytorch/pytorch/pull/153596
---------------

Summary:
via reverting D75708685

gate the ROCm failure

Test Plan:
Unit tests in OSS, sandcastle

Rollback Plan:

Bifferential Revision: D75894349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063
Approved by: https://github.com/masnesral
2025-06-05 23:40:31 +00:00
72453a6676 [PT2][comms] put visualize_overlap in a try-except block (#155222)
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.

Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.

Test Plan:
:)

Rollback Plan:

Reviewed By: Microve

Bifferential Revision: D75985733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
2025-06-05 23:39:48 +00:00
9bae2fcf99 [profiler] Enable all configured activities in CUPTI Range profiler mode (#154749)
Summary: Updates the  pytorch range profiler mode (metrics mode) to support all trace activitity types.

Reviewed By: sraikund16

Bifferential Revision: D75568693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154749
Approved by: https://github.com/sraikund16
2025-06-05 23:38:54 +00:00
26f066bb61 Add AOTI model name config (#154129)
Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files
```

Bifferential Revision: D75102034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129
Approved by: https://github.com/desertfire
2025-06-05 23:38:11 +00:00
fa705f7912 [BE] minor refactor + some comments on behavior (#154695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154695
Approved by: https://github.com/masnesral, https://github.com/eellison
2025-06-05 23:00:46 +00:00
9e88d6c857 [ROCm] manywheel missing hipsparselt deps (#155254)
Bundle libhipsparselt.so and auxiliary files into wheel.

Dependency added by hipsparselt integration #150578.

Fixes #155207.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155254
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-05 22:45:36 +00:00
e1180c7228 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-05 22:35:04 +00:00
0a092c7de6 Enable CPP Extension Open Registration tests on Arm (#144774)
Enables most tests under CPP Extension Open Registration as they pass on Arm now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144774
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet
2025-06-05 22:32:28 +00:00
0827464002 Replace runtime type parameterization (#155221)
See:

```
>>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s")
```
> `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s

Type parameterization should be on type hint, not in runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-06-05 21:43:54 +00:00
7dcc77e422 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-05 21:34:09 +00:00
a85ad55525 [ROCm][Windows] Fix offload gpu arch list in tests (#155212)
Added fix to get ROCM_PROPERTY_ARCH_LIST value in set_target_properties in c10/cuda and caffe2 tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155212
Approved by: https://github.com/malfet
2025-06-05 20:30:28 +00:00
9a42f01586 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154829
2025-06-05 20:17:01 +00:00
5911f870c0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-05 20:17:01 +00:00
606d73bde4 Adding from_node for nodes in gm.module() (#155053)
Summary:
Adding "from_node" information that indicates which nodes are unlifted in `.module()` call.
The lifted nodes will have "ExportedProgram.module().unlift()" passname in the last entry of from_node.

Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```

Rollback Plan:

Reviewed By: angelayi

Differential Revision: D75837494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155053
Approved by: https://github.com/angelayi
2025-06-05 20:11:56 +00:00
c8c892b4a5 [scan] disable functionalization key in backward tracing (#154343)
Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode.

This PR fixes it by putting the tracing code under disable functionalization ctx manager.

Fixes https://github.com/pytorch/pytorch/issues/153437.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343
Approved by: https://github.com/zou3519
2025-06-05 20:06:33 +00:00
5e93abe3c0 Address docs for clip_grad functions (#155125)
This PR takes the opinionated stance that `torch.nn.utils.<func>` should be the preferred API over `torch.nn.utils.clip_grad.<func>`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155125
Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki, https://github.com/janeyx99
2025-06-05 19:22:09 +00:00
dd41a3907c [MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183)
By using `TensorIterator::with_32bit_indexing()` primitive

Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator

TODO: Add comments to bind_tensors as well asunit test, based on
```
python  -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())"
```

Fixes https://github.com/pytorch/pytorch/issues/154828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183
Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #155150, #155178, #155184
2025-06-05 18:57:14 +00:00
05dd638ee9 Revert "Add dont constant fold flag (#154945)"
This reverts commit 196c95d463367f15999c0cddc9eb89031e9988ab.

Reverted https://github.com/pytorch/pytorch/pull/154945 on behalf of https://github.com/malfet due to This broke halide test sanity, see a3098a74d4/1 ([comment](https://github.com/pytorch/pytorch/pull/154945#issuecomment-2945598901))
2025-06-05 18:25:59 +00:00
a3098a74d4 Add pinned numpy and fix build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-05 17:44:18 +00:00
2481c4b2ea [cutlass backend] add teraflops and increase rep for benchmark script (#154944)
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/)

I think I will continue to use do_bench for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944
Approved by: https://github.com/mlazos
2025-06-05 17:20:29 +00:00
be2ab96347 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8

Regarding the revert & reland:
* This PR causes the python 3.13 version to be bumped from 3.13.2 to 3.13.3. test_deopt_from_append_list starts unexpectedly passing on 3.13.3, so I originally modified the test in https://github.com/pytorch/pytorch/pull/155167 to xfail only for <=3.13.2
* However there was a land race with https://github.com/pytorch/pytorch/pull/150796, which introduced another test that passes only for >=3.13.3.

Resolution:
* @guilhermeleobas reverted https://github.com/pytorch/pytorch/pull/150796 so I will reland this (and I've merged the test_deopt_from_append_list change into this PR. And based on Guilherme's feedback, I'm just skipping the test instead of selectively failing/passing the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2025-06-05 17:17:27 +00:00
cadcb5d368 [inductor] disable compiler on the compiled_module_main (#155169)
Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155169
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-06-05 16:37:45 +00:00
13ea0f2c0a [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
2025-06-05 16:37:22 +00:00
a14f427db6 [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-05 16:07:25 +00:00
cd361fc247 [CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437)
Fixes https://github.com/pytorch/pytorch/issues/154157

Inductor Workflows where moved from focal to jammy here: https://github.com/pytorch/pytorch/pull/154153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154437
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/davidberard98, https://github.com/huydhn
2025-06-05 15:24:07 +00:00
e895e9689c Update docs build to specify <3.13 in CONTRIBUTING (#155140)
Python 3.13 removed the deprecated imghdr module, so our docs build does not compile with 3.13+. Mention it in our contributing guide so people know before committing to the wrong version oop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155140
Approved by: https://github.com/drisspg, https://github.com/cyyever
ghstack dependencies: #155126
2025-06-05 15:16:48 +00:00
2f3f8339ec [BE] Document device memory apis in correct module (#155126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155126
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-06-05 15:16:48 +00:00
7999735d23 [CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320)
Fixes #153133

Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic.

The fix introduces double precision validation for checking whether the step sign is consistent with the range direction.

This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320
Approved by: https://github.com/malfet
2025-06-05 14:51:25 +00:00
ed661a5f11 [MPS] Fix complex scalar binding to Metal tensors (#155184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155184
Approved by: https://github.com/dcci
ghstack dependencies: #155150, #155178
2025-06-05 14:34:57 +00:00
9bf6593e96 Fix docstring for torch.UntypedStorage.from_file (#155067)
Fixes #130629

Happy to revert the second commit if we think it's making the test too fragile for the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155067
Approved by: https://github.com/malfet
2025-06-05 14:30:49 +00:00
a1057cda31 Revert "Add CPython generator/contextlib tests (#150796)"
This reverts commit d5f642211f14593c8c78af98a1fb7cfb63039ce5.

Reverted https://github.com/pytorch/pytorch/pull/150796 on behalf of https://github.com/guilhermeleobas due to This is breaking tests on trunk. https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=3.13&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/150796#issuecomment-2944469866))
2025-06-05 13:51:54 +00:00
196c95d463 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-05 13:42:44 +00:00
e01fde8213 Revert "[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)"
This reverts commit bee9c70c5d4b681ec1f2adf92eca1205b372634a.

Reverted https://github.com/pytorch/pytorch/pull/154974 on behalf of https://github.com/malfet due to Broke inductor tests, see 3c72b9fd8f/1 ([comment](https://github.com/pytorch/pytorch/pull/154974#issuecomment-2944370617))
2025-06-05 13:36:21 +00:00
3c72b9fd8f Revert "SDPA support gfx950 (#155103)"
This reverts commit b9312c56bf5f277e341c0185da748e3475d0807f.

Reverted https://github.com/pytorch/pytorch/pull/155103 on behalf of https://github.com/malfet due to looks like it broke mi300 tests, see 9a4c08ddfc/1 ([comment](https://github.com/pytorch/pytorch/pull/155103#issuecomment-2944331460))
2025-06-05 13:33:17 +00:00
523b637cbe Revert "[test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)"
This reverts commit 1c828786c28b8cd2a6be2397cc2af65e3266c5fa.

Reverted https://github.com/pytorch/pytorch/pull/155167 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
f60b2712dd Revert "Inductor unit tests: cuda 12.6 -> 12.8 (#155056)"
This reverts commit bb43ced6e2c9e1cdc17923826aaf58466c2ffd4b.

Reverted https://github.com/pytorch/pytorch/pull/155056 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
9a4c08ddfc [MPS] Parametrize test_scaled_dot_product_attention_autocast (#155005)
Also moving comments inside the function scope for some of my previous regression tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-05 13:24:53 +00:00
fa3c38c7ae Add tensor overlap check for cross (#154999)
Fixes #132031

## Test Result

```python
In [1]: import torch
   ...: torch.manual_seed(0)
   ...: torch.cuda.manual_seed(0)
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(3, 4)
   ...: torch.cross(a, b, out=a)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
      4 a = torch.randn(3, 4)
      5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)

RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
2025-06-05 10:00:01 +00:00
5b65628906 Workflow to tag trunk commits with trunk/{commit-sha} tags (#155170)
This PR adds workflow to automate tagging commits on the `main` branch. The workflow includes validation and retry with exponential backoff.

The rationale for this is to work around the github limitation on using workflow_dispatch (requires branch or tag). We want to use workflow_dispatch to rerun CI workflows with parameters (trunk, pull, etc).

---

### Testing

Tested using almost identical workflow in a personal repo (the difference is in repository_owner check and backoff settings).

* successful tag push:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454729765/job/43504630765

* validation: PR commit (fails)
   https://github.com/izaitsevfb/deleteme/actions/runs/15454743572/job/43504669720

* tagging of the old commit on main:
   https://github.com/izaitsevfb/deleteme/actions/runs/15453805748/job/43501885903

* tag already exists:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454756077/job/43504706980

* invalid sha on workflow dispatch:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454611077/job/43504286858

* retry with exponential backoff on failure (via tag rule blocklist):
   https://github.com/izaitsevfb/deleteme/actions/runs/15454768346/job/43504743486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155170
Approved by: https://github.com/huydhn
2025-06-05 09:50:58 +00:00
bee9c70c5d [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-05 07:25:04 +00:00
be16f21ca6 [Graph Partition] add symints to get_graph_inputs (#154679)
During `codegen_inputs`, we check whether there are undefined symbols:
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)

Previously, for graph partition inputs, we do not explicitly add symints.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)
We relied on sizes/strides of TensorBox for codegen symint inputs.  For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)

In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2*s0, 2]`. Since `2*s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`.

The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679
Approved by: https://github.com/eellison
2025-06-05 06:46:28 +00:00
d3c8f36ba0 Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340)"
This reverts commit 0f10df71a66cb1b0c3659381b7db8e06d95f0d67.

Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))
2025-06-05 06:46:24 +00:00
bb43ced6e2 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
ghstack dependencies: #155167
2025-06-05 05:59:06 +00:00
1c828786c2 [test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)
Not sure why, apparently this test starts passing on python 3.13.3 (while it fails on python <=3.13.2) and it's causing unexpected passes on xfail-ed tests when newer versions of python are used, e.g. in #155056.

Verified locally in a python 3.13.1 vs. python 3.13.3 conda env.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155167
Approved by: https://github.com/williamwen42
2025-06-05 05:59:06 +00:00
93012d2290 Revert "[forward fix] add support for MemoryFormat after type tightening (#154658)"
This reverts commit 0fdd568b785812da86e69d65632de77d2ee945c7.

Reverted https://github.com/pytorch/pytorch/pull/154658 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154658#issuecomment-2942752048))
2025-06-05 05:01:40 +00:00
5130ac64f4 Revert "Add randint_like tensor overload for high (#154899)"
This reverts commit 72fe1d5f42aa9bffa876932a3b4fcae052b99168.

Reverted https://github.com/pytorch/pytorch/pull/154899 on behalf of https://github.com/seemethere due to Failing internal tests see https://fburl.com/diff/bai044ob ([comment](https://github.com/pytorch/pytorch/pull/154899#issuecomment-2942740661))
2025-06-05 04:54:05 +00:00
80703ca332 [FlexAttention] Allow dispatch to SAC for flex (#150080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150080
Approved by: https://github.com/zou3519
2025-06-05 04:34:27 +00:00
fa63de0866 Handle empty linemaps in PyCodeCache (#155064)
Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536

Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments.

Test plan:
```
import os

os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1"
os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1"

import torch

@torch.compile
def fn(x: torch.Tensor):
    (x_grad,) = torch.autograd.grad(x.sum(), x)
    return x_grad

x = torch.randn(10, 10, requires_grad=True)
result = fn(x)
```

Run this twice and see that everything works as expected.

It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because:

- The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time
- The inductor generated output needs to call *back* into dynamo via `assert_size_stride`
- In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue.

All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064
Approved by: https://github.com/bdhirsh
2025-06-05 03:54:35 +00:00
450180fbcd [c10d][fr] Add the log of thread name and thread id into fr (#155142)
There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well.

Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142
Approved by: https://github.com/kwen2501
2025-06-05 03:33:01 +00:00
b9312c56bf SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Revision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/yoyoyocmu
2025-06-05 03:26:38 +00:00
a01bb9da14 [CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448)
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448
Approved by: https://github.com/kwen2501
2025-06-05 02:09:31 +00:00
5e03433443 Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit e5afbe31245287a92fe328c404b3557e5c5eca73.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Broke rocm, see 642687af29/1 ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2942415600))
2025-06-05 01:38:13 +00:00
642687af29 [MPS][BE] Some refactor in preparation for 64-bit iterators (#155178)
set input/output tensors only once

Get rid of `is_storage_dense` predicate, as `iter.is_contiguous` serves the same purpose
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155178
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #155150
2025-06-05 01:24:31 +00:00
3398d1d459 support bmm and mm_plus_mm in generated templates cache (#154904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154904
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #154891, #154892
2025-06-05 00:36:01 +00:00
21f45f7afb Add CPython int/float tests (#150795)
Tests:
* test_int.py
* test_int_literal.py
* test_float.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_int" "test_int_literal" "test_float"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150795
Approved by: https://github.com/williamwen42
2025-06-05 00:28:53 +00:00
d5f642211f Add CPython generator/contextlib tests (#150796)
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py

Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.

The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:

```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-05 00:18:29 +00:00
fb5a787a8f [HOP] Added clone for outputs of create_bw_fn that are aliasing the inputs (#153932)
This PR fixes an issue with the new way of creating the bw graph introduced for cond. In particular, there is an issue if the bw function simply aliases the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153932
Approved by: https://github.com/ydwu4
2025-06-04 23:52:52 +00:00
b0a2ca65ef support more prologue functions in generated templates cache (#154892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154892
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #154891
2025-06-04 23:45:36 +00:00
51b4c51973 add missing check for caching triton template caching (#154891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154891
Approved by: https://github.com/eellison
2025-06-04 23:45:36 +00:00
1083bc749d [Memory Snapshot] Add Flag to Toggle Global and Local Callbacks for Annotations (#154932)
Summary:
There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted

However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well.

Test Plan:
Run MVAI executable and see that the errors go away

Rollback Plan:

Differential Revision: D75831687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932
Approved by: https://github.com/mzzchy, https://github.com/sanrise
2025-06-04 23:15:19 +00:00
7cf5b36ec2 Release GIL in PG destructor (#154976)
Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time.

Test Plan: Ran

```
$ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py
```

with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976
Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj
2025-06-04 23:10:55 +00:00
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
992be94dab [MPS][BE] Better error messages (#155150)
"Can't be indexed using 32-bit iterator" is not really helpful error
This PR distinguishes between error from old indexing helper function as well as to binaryTensorIterator
Adds the same warning to unary op, otherwise it just runs and returns incorrect value

Test plan (manual, don't have machine with enough RAM to run it reliable in CI):
```
%  python  -c "import torch;print(torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps'))"
RuntimeError: add can't be indexed using 32-bit iterator for shape [1048576, 5000]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155150
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-04 22:53:51 +00:00
f5e2e4c4f1 [Inductor] Include math and torch in launcher scope (#154673)
Summary:
For grid computation, if we have sympy, it is possible we have math and torch used.
We include the math and torch module in the launcher scope to make sure those grid get computed correctly.

Test Plan: Check phabricator for internal cmd.

Differential Revision: D75642931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154673
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-06-04 22:32:19 +00:00
671553bd23 Update documentation wording for transformer-related layers (#155123)
<img width="947" alt="Screenshot 2025-06-04 at 1 33 53 PM" src="https://github.com/user-attachments/assets/4dbb66b3-43f4-4d04-afb5-dc80cec0f2cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155123
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-06-04 22:20:32 +00:00
6f23ca53bb [dynamo] sample gb_registry json file for website testing purposes (#155160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155160
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-06-04 22:14:48 +00:00
c8566a0b98 [export] Use patching in test (#155132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155132
Approved by: https://github.com/pianpwk
2025-06-04 21:41:26 +00:00
65a5eb8d27 Fix for ambiguity in linalg.norm()'s ord argument of +2 & -2 (#155148)
Fixes #136453

### Description
---
Fixed the ambiguity by referencing a hyperlink to wikipedia's SVD/Singular Values section as per past discussion (by other contributors) on the above thread.

In the ord argument, for values `+2` and `-2`, the `singular value` now points to [this section of singular values on the wiki SVD page](https://en.wikipedia.org/wiki/Singular_value_decomposition#Singular_values,_singular_vectors,_and_their_relation_to_the_SVD).

### Why not mention SVD
---
For conciseness (expanding 'largest singular value' -> 'largest singular value of a SVD' is too much, i think, wrt rest of the table)

---

I hope this is satisfactory. Please let me know if I have missed anything essential; cheers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155148
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2025-06-04 21:15:20 +00:00
b084e1b81c [HOP] Rework Autograd DispatchKey for scan and map (#153336)
This PR introduces the `py_autograd_impl` instead of the `DispatchKey.Autograd` for some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153336
Approved by: https://github.com/ydwu4
2025-06-04 20:54:02 +00:00
0404785f3b [dynamo] [3/3] added cmd_update_gb_type which supports updating an existing gb_type properties and optional arg to change gb_type name (#154985)
The user can now use the terminal to update the registry whenever they update an existing gb_type's properties. Additionally, if the user changes the gb_type description itself, they can update the registry as well.

Terminal command template for updating existing gb_type: python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite]

Terminal command template for updating existing gb_type name (can also be used if the user changed the other properties as well including the gb_type name): python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] --new_gb_type "new_name_for_existing_gb_type"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154985
Approved by: https://github.com/williamwen42
2025-06-04 20:10:02 +00:00
e5afbe3124 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-04 20:03:46 +00:00
4d576442e9 Fix incorrect get_default_qat_qconfig in prepare_qat_fx docs. (#155100)
Fixes #144522

## Description

FX QAT docs for prepare_qat_fx incorrectly used get_default_qat_qconfig when it should use get_default_qat_qconfig_mapping for a qconfig_mapping.

Previous example code incorrectly used `get_default_qat_qconfig`, resulting in a qconfig being incorrectly
passed to `prepare_qat_fx`.    `prepare_qat_fx` requires  a `qconfig_mapping`, not a single `qconfig`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155100
Approved by: https://github.com/jerryzh168
2025-06-04 18:51:40 +00:00
6c8241c089 [dynamo] [2/3] added add_new_gb_type functionality (#154886)
The user can now use the terminal to update the registry whenever they create a new unimplemented_v2() callsite.
Terminal command template: python [path to gb_id_mapping.py] add "new_gb_type" [path to file where user added callsite]
Before the user added a new gb_type:
<img width="619" alt="Screenshot 2025-06-02 at 1 33 54 PM" src="https://github.com/user-attachments/assets/7258cab1-a184-4200-9d56-7b21d243d6d8" />
After the user added a new gb_type:
<img width="366" alt="Screenshot 2025-06-02 at 1 34 47 PM" src="https://github.com/user-attachments/assets/5c383e94-268c-4f6d-9111-7b18c856222e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154886
Approved by: https://github.com/williamwen42
ghstack dependencies: #154738
2025-06-04 18:44:37 +00:00
681a8189d7 [dynamo] [1/3] updated gbid mapping for initial registry creation (#154738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154738
Approved by: https://github.com/williamwen42
2025-06-04 18:44:37 +00:00
197080337b [AOTI] Extend torchgen to generate C shim with version number (#147745)
Summary: While it is ok to add a new arg with defaul value to a fallback op in Python, it will be BC-breaking for the C shim. This PR adds an automatic approach to update C shim files when specifying a version number with a list of new args for the modified op. See https://github.com/pytorch/pytorch/pull/154848 as an example on how to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147745
Approved by: https://github.com/yushangdi
2025-06-04 18:40:34 +00:00
1d67849e43 [AOTInductor] Activate CPU test for package and update weights (#155078)
Summary:
looks like CPU is enabled for update_constant_buffer in D71177509

enable these tests as well.

Test Plan:
```
 buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_without_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_user_managed_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_update_weights" -v
```

Rollback Plan:

Differential Revision: D75908993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155078
Approved by: https://github.com/angelayi
2025-06-04 17:57:20 +00:00
956716880f [c10d][gloo] Enable using c10::Half for gloo (#153862)
Testing with https://github.com/pytorch/gloo/pull/446 and we see that the numerical issues reported in https://github.com/pytorch/pytorch/issues/152300 is indeed resolved and we added a unit test for it. Also update submodule gloo to reflect the change on the gloo side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153862
Approved by: https://github.com/d4l3k, https://github.com/clee2000, https://github.com/malfet
2025-06-04 17:53:08 +00:00
9eb7e67727 [PT2][memory] correct wait tensor output size (#153569)
This PR correctly handles the output buffer size of wait tensor nodes.
![image](https://github.com/user-attachments/assets/fdcc5eb7-58cf-42a2-84b2-ce949cb9db92)

See [this doc](https://docs.google.com/document/d/1lkKulwIb-fYL_p8jn1SD6Lh1PoAKBgpBsU5sAH80leI/edit?tab=t.0#bookmark=id.w3n4k1y4rdz8) with testing details [internal only]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153569
Approved by: https://github.com/eellison
2025-06-04 17:49:25 +00:00
34c6371d24 Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568)
NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds.
See:
https://pypi.nvidia.com/nvidia-nvshmem-cu12/
https://pypi.nvidia.com/nvidia-nvshmem-cu11/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154568
Approved by: https://github.com/atalman
ghstack dependencies: #154538
2025-06-04 17:43:24 +00:00
b3e666ae17 [easy] Bump STATIC_CUDA_LAUNCHER_VERSION=1 (#154861)
This turns on STATIC_CUDA_LAUNCHER internally for a some low risk entitlements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154861
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-06-04 17:38:06 +00:00
e9c31fb86d [torch.compile] handle a custom __delattr__ method correctly (#150899)
Fixes #150765
- handle a custom __delattr__ method correctly

Test:
```
import torch

class MyObject:
    def __init__(self, val):
        self.val = val
        # Flag to track deletion attempts instead of using print
        self.deletion_attempted = False

    def __delattr__(self, attr):
        if attr == "val":
            # Set flag instead of printing
            self.deletion_attempted = True
        else:
            super().__delattr__(attr)

@torch.compile(fullgraph=True, backend="eager")
def test(input_tensor):
    instance_a = MyObject(1)
    instance_b = MyObject(2)

    del instance_a.val
    del instance_b.val
    exists_a = hasattr(instance_a, 'val')
    exists_b = hasattr(instance_b, 'val')
    deletion_attempted_a = instance_a.deletion_attempted
    deletion_attempted_b = instance_b.deletion_attempted

    return input_tensor + 1, exists_a, exists_b, deletion_attempted_a, deletion_attempted_b

# Run the test
result = test(torch.ones(1))
print(f"Result tensor: {result[0]}")
print(f"val attribute still exists on instance_a: {result[1]}")
print(f"val attribute still exists on instance_b: {result[2]}")
print(f"Deletion was attempted on instance_a: {result[3]}")
print(f"Deletion was attempted on instance_b: {result[4]}")

```

output:
```
(base) sany@sandishs-Laptop pytorch % python3 test_delattr_fix.py
Result tensor: tensor([2.])
val attribute still exists on instance_a: True
val attribute still exists on instance_b: True
Deletion was attempted on instance_a: True
Deletion was attempted on instance_b: True
```

```
(pytorch-dev) sany@sandishs-Laptop pytorch % python3 -m pytest test/dynamo/test_repros.py::ReproTests::test_delattr_return -v
========================================================= test session starts =========================================================
platform darwin -- Python 3.12.5, pytest-8.3.5, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.12/bin/python3
cachedir: .pytest_cache
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/dynamo/test_repros.py::ReproTests::test_delattr_return PASSED [0.0659s]                                                    [100%]

========================================================== 1 passed in 1.71s ==========================================================
(pytorch-dev) sany@sandishs-Laptop pytorch %
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150899
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-06-04 17:27:20 +00:00
4405dc1487 Revert "Always set CPU affinity for benchmark jobs (#154569)"
This reverts commit 629fca295e1257c2c54d1b6316ed4fa00e6044d6.

Reverted https://github.com/pytorch/pytorch/pull/154569 on behalf of https://github.com/anijain2305 due to potentially causing compile time regressions, unsure ([comment](https://github.com/pytorch/pytorch/pull/154569#issuecomment-2940737778))
2025-06-04 16:52:15 +00:00
8f08f90b61 Bump pillow from 10.0.1 to 10.3.0 in /.github/requirements (#154416)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.0.1 to 10.3.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/10.0.1...10.3.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 10.3.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-04 09:37:13 -07:00
aed938f3a8 Enable check_gomp for Ubuntu OSes (#155119)
And ARM platform
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155119
Approved by: https://github.com/atalman
2025-06-04 15:57:08 +00:00
20912673a6 Revert "Add __main__ guards to jit tests (#154725)"
This reverts commit 1a55fb0ee87eaa8b376aaa82d95d213fe0fbe64b.

Reverted https://github.com/pytorch/pytorch/pull/154725 on behalf of https://github.com/malfet due to This added 2nd copy of raise_on_run to common_utils.py which caused lint failures, see https://github.com/pytorch/pytorch/actions/runs/15445374980/job/43473457466 ([comment](https://github.com/pytorch/pytorch/pull/154725#issuecomment-2940503905))
2025-06-04 15:42:52 +00:00
6f93ce3c86 Revert "[Cutlass] fp8 dynamic shapes test (#154829)"
This reverts commit 36596ad2a009a0906848fa264954d4b200efc50e.

Reverted https://github.com/pytorch/pytorch/pull/154829 on behalf of https://github.com/seemethere due to This is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154829#issuecomment-2940494361))
2025-06-04 15:36:27 +00:00
3fa3dbdb1f Revert "[Cutlass] EVT dynamic shapes support (#154835)"
This reverts commit 4224a7df01a9607830da771fd4884c8eba150630.

Reverted https://github.com/pytorch/pytorch/pull/154835 on behalf of https://github.com/seemethere due to This is part of a stack that is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154835#issuecomment-2940463211))
2025-06-04 15:33:09 +00:00
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb979da16b0a0b758020b330a49c72b7e7.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
a0f2544502 Revert "[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)"
This reverts commit 6c2f941e250ba34a920f476c8a9ee30e6153fc15.

Reverted https://github.com/pytorch/pytorch/pull/154867 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
1a55fb0ee8 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/Skylion007
2025-06-04 14:44:08 +00:00
3f34d26040 Add __main__ guards to distributed tests (#154628)
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.

In distributed tests:

- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Cc @wconstab @clee2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
2025-06-04 14:39:57 +00:00
c8d44a2296 Add __main__ guards to fx tests (#154715)
This PR is part of a series attempting to re-submit #134592 as smaller PRs.

In fx tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715
Approved by: https://github.com/Skylion007
2025-06-04 14:38:50 +00:00
cf9cad31df Add __main__ guards to tests (#154716)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

Add missing `if __name__ == "__main__":` guards to some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154716
Approved by: https://github.com/Skylion007
2025-06-04 14:38:13 +00:00
ca0c2985d3 [ONNX] Allow exporter to export SDPA to Attention onnx operator (#154596)
Fixes [#149662](https://github.com/pytorch/pytorch/issues/149662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154596
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-04 14:29:44 +00:00
31d12b3955 Fix avg_pool2d param kernel_size descripthon (#154353)
Fixes part of #153149

## Test Result

![image](https://github.com/user-attachments/assets/216ffd2b-dd2b-4cf6-9fca-aeed075be5e7)

![image](https://github.com/user-attachments/assets/820cd184-1f8e-4a7a-b64e-15dfb9c7dad2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154353
Approved by: https://github.com/colesbury
2025-06-04 11:55:01 +00:00
2af78d368f Skip another test file that doesn't run gradcheck for slow gradcheck (#154852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154852
Approved by: https://github.com/albanD
2025-06-04 07:47:09 +00:00
0f10df71a6 [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey
2025-06-04 07:16:56 +00:00
1e20745532 [ez][AOTI] Fix index offset for Optional Tensor Return (#155073)
Summary: As title. See added test for more context.

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output_2

Rollback Plan:

Differential Revision: D75900658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155073
Approved by: https://github.com/angelayi
2025-06-04 06:22:46 +00:00
d2bfd97d71 [export] Refactor pt2 save/load (#152495)
Refactor the pt2 archive saving to consolidate the format of torch.export.save and torch._inductor.package.package_aoti.

This PR adds the following functions, which torch.export.save and AOTI packaging calls into:
```python
package_pt2(
    f: FileLike,
    *,
    exported_programs: Optional[Union[ExportedProgram, dict[str, ExportedProgram]]] = None,
    aoti_files: Optional[Union[list[str], dict[str, list[str]]]] = None,
    extra_files: Optional[dict[str, Any]] = None,
) -> FileLike

@dataclass
class PT2ArchiveContents:
    exported_programs: dict[str, ExportedProgram]
    aoti_runners: dict[str, AOTICompiledModel]
    extra_files: dict[str, Any]

load_pt2(f: FileLike) -> PT2ArchiveContents
```

Power users directly call into these APIs if they want to bundle multiple exported programs, aoti files, or extra metadata.

This is how the pt2 archive looks like ([spec](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?tab=t.0)):
```
├── archive_format
├── version
├── .data
├── data
│   ├── aotinductor
│   │   └── model1
│   │       ├── model1.cpp
│   │       ├── model1.so  # currently AOTI automatically moves weights in here, TODO to move it out
│   │       ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin
│   │       └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin
│   ├── weights
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── constants
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── sample_inputs
│      ├── model1.pt  # TODO to dedup weights between model1/model2
│      └── model2.pt
├── extra
│   └── user_metadata.txt
└── models
    ├── model1.json
    └── model2.json
```

Future todos:
- unbundle the weights -- instead of .pt, we can use bin files, which will also allow us to dedup weights if we store multiple models
- update aoti_compile_and_package to also save the exported program
- integrate TNR with this packaging flow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152495
Approved by: https://github.com/yushangdi
2025-06-04 06:04:29 +00:00
75b24c273b Export torch::utils::tensor_to_numpy (#154178)
Fixes #154105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154178
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/youkaichao
2025-06-04 05:48:27 +00:00
7b074346e0 [Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091)
In OneDNN v3.7, SDPA has below defects:

1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.

In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
2025-06-04 05:18:36 +00:00
4d93985d13 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-06-04 04:07:07 +00:00
ec35a36820 [ROCm][Windows] Fix building tests for multiple architectures (#154979)
Fixing building C10_CUDA_ALL_TEST_FILES and Caffe2_HIP_TEST_SRCS for multiple architectures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154979
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-04 03:53:21 +00:00
72fe1d5f42 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
ghstack dependencies: #154863
2025-06-04 03:37:09 +00:00
6b0c6f2856 [BE] Delete pre-CUDA-10.1 code from SparseCUDABlas (#155079)
As latest PyTorch is no longer buildable against it CUDA-10, so this is essentially a dead code

Made small change to hipify script to rename `cusparseGetErrorString` to `hipsparseGetErrorString`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155079
Approved by: https://github.com/atalman, https://github.com/cyyever
2025-06-04 03:29:24 +00:00
9f39028629 [MPS][BE] Move sigmoid op to Metal (#155080)
Fixes https://github.com/pytorch/pytorch/issues/154895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155080
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #154936, #155002, #155081
2025-06-04 03:28:11 +00:00
437df54cc8 [Inductor] Fix a few FX conversion bugs. (#154958)
# Feature
This PR fixes two bugs with Inductor's FX backend.
1. When extracting offsets from `ReinterpretView`'s, we accidentally took the offset of the parent layout instead of the view's layout. This case is triggered when multiple kernels write into the same buffer due to `torch.cat`.
2. In certain rare cases, `V.graph.graph_inputs` can contain a constant input value. In case this happens, create a new `sympy.Symbol` for the input, for compatibility with the existing `SymbolBuffer` abstraction mapping to an FX placeholder.  This case is triggered when calling `torch._inductor.compile` on  certain modules coming from `torch.export`.

# Test plan
Added a couple of tests exposing these bugs.
1. Concat with multiple kernels writing to the same buffer.
3. `Export` -> `torch._inductor.compile` with a constant input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154958
Approved by: https://github.com/jansel
2025-06-04 03:09:44 +00:00
3e57de1251 [ONNX] Create support for rotary embeddings (#154745)
This PR registers the RotaryEmbedding op in the `torch.ops.onnx` name spaces and allows the exporter to recognize and export onnx operators.

## Design

ONNX operators of their respective opset version is implemented in torch/onnx/ops/_impl.py, and are registered in the torch.ops.onnx namespace following the following rule:

`OpType-version => torch.ops.onnx.OpType.opset{version}`

For example, `RotaryEmbedding-23` becomes `torch.ops.onnx.RotaryEmbedding.opset23`

This name is parsed by the exporter to create an onnx node in the graph without having to go through translation.

When users use the ops in the model, we provide more convenient, unversioned functions under `torch.onnx.ops` that will dispatch to the implementations based on user input (type and provided attributes). For example, users can directly call `torch.onnx.ops.rotary_embedding()` to use the op natively in their pytorch models. I chose snake case naming to make the functions more pythonic and aligned with other torch apis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154745
Approved by: https://github.com/titaiwangms
2025-06-04 03:07:43 +00:00
37e6bf8adf Switch to _apply_to_tensors for dataclass input (#154897)
Fixes https://github.com/pytorch/pytorch/issues/153077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897
Approved by: https://github.com/weifengpy
2025-06-04 02:19:52 +00:00
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
e2760544fa [PT] expose FlightRecord API for building (#154866)
Summary: as titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D75803611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154866
Approved by: https://github.com/fduwjj, https://github.com/d4l3k
2025-06-04 01:25:52 +00:00
d8e4c1c363 [BE] Define REGISTER_UNARY_TI_DISPATCH (#155081)
That creates _kernel_mps function that takes iterator and calls stub for
it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155081
Approved by: https://github.com/dcci
ghstack dependencies: #154936, #155002
2025-06-04 01:15:37 +00:00
50de6ae253 Revert "[BE][Ez]: Fully type nn.utils.clip_grad (#154801)"
This reverts commit 9ce2732b685da527308dc2dc4b2eeb4e252f57d1.

Reverted https://github.com/pytorch/pytorch/pull/154801 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154801#issuecomment-2937886337))
2025-06-04 00:41:27 +00:00
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
6c2f941e25 [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
ghstack dependencies: #154780
2025-06-04 00:05:53 +00:00
cbdacd32fe [AOTI][Intel GPU] Support multi_arch_kernel_binary option for XPU. (#154514)
Following the design of #154413, this PR add XPU support for generating kernel binary files that support multiple archs.

Fixes #154682, Fixes #154683, Fixes 154689, Fixes #154685 , Fixes #154690, Fixes #154681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154514
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-06-03 23:02:00 +00:00
8f0e3f446d [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #154091
2025-06-03 23:01:05 +00:00
6c40e6606f [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-06-03 23:01:05 +00:00
4224a7df01 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154775, #154761, #154829
2025-06-03 22:20:34 +00:00
36596ad2a0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #154775, #154761
2025-06-03 22:20:33 +00:00
1c2b9cecd2 [Cutlass] Support bias arg for fp8 GEMM (#154761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154761
Approved by: https://github.com/drisspg
ghstack dependencies: #154775
2025-06-03 22:20:27 +00:00
5735729597 [Cutlass] Cleanup gemm_template evt handling (#154775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154775
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-03 22:20:18 +00:00
71499fee6b [3/3] Add build rule and test for Graph in nativert (#154532)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. Add source file
3. **Add test and build rules**

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530, #154531
2025-06-03 21:52:05 +00:00
b4c399d445 [2/3] Add source file for Graph in nativert (#154531)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. **Add source file**
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75492405](https://our.internmc.facebook.com/intern/diff/D75492405/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154531
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530
2025-06-03 21:51:52 +00:00
55873dcb0d [1/3] Add header file for Graph in nativert (#154530)
We split the large PR for adding Graph.h and Graph.cpp to `nativert` into 3 smaller PRs:
1. **Add header file**
2. Add source file
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75491860](https://our.internmc.facebook.com/intern/diff/D75491860/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154530
Approved by: https://github.com/SherlockNoMad
2025-06-03 21:51:47 +00:00
69a57d9486 add JSON output support for operator benchmark (#154410)
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 1, N: 1, K: 1, device: cpu"
      }
    },
    "model": {
      "name": "add_M1_N1_K1_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        2.074
      ],
      "target_value": null
    }
  },
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 64, N: 64, K: 64, device: cpu"
      }
    },
    "model": {
      "name": "add_M64_N64_K64_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        9.973
      ],
      "target_value": null
    }
  },
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
2025-06-03 21:29:24 +00:00
8e1474d3c6 [inductor] small cleanups in torch/_inductor/codegen/mps.py (#154921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154921
Approved by: https://github.com/jansel, https://github.com/Skylion007
2025-06-03 20:57:25 +00:00
cyy
debd095149 Avoid index integer overflow in gemm_notrans_ (#154809)
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154809
Approved by: https://github.com/soulitzer
2025-06-03 19:28:34 +00:00
10c3e6ec43 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-03 19:21:15 +00:00
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
ea7b233015 [flex attention][triton pin] triton_helpers shim for TMA apis (#154858)
Triton 3.4 will remove the experimental TMA apis: https://github.com/triton-lang/triton/pull/6488

To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API.

Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes.

Note: we'll need to apply this for other things in inductor, this just does it for flex attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154858
Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg
2025-06-03 19:15:48 +00:00
85fb13d0d1 [BE] Cleanup cuda 12.4 artifacts from scripts and workflows (#154893)
Remove artifacts. CUDA 12.4 was deprecated. hence no need to keep this code around

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154893
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/tinglvv
2025-06-03 18:43:40 +00:00
c014e9d7cd [inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935)
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.

To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
2025-06-03 18:36:44 +00:00
e8183f8d3d add #pragma once to stable/library.h (#154920)
This shoulda been there and it was an oversight that it was not! We do not want the same translation unit to process this header multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154920
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-03 18:34:53 +00:00
6f7694f18f [dynamo] Reconstruct defaultdict properly (#154931)
`DefaultDictVariable` inherited `ConstDictVariable.reconstruct`, causing
dynamo to reconstruct a `DefaultDictVariable` into a dict rather than
defaultdict. This patch fixes that.

Fixes #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154931
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #154930
2025-06-03 18:18:40 +00:00
467235027c [AOTDispatch] Use the proper meta function for _amp_foreach_non_finite_check_and_unscale_ (#154930)
As title, this fixes part of #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154930
Approved by: https://github.com/zou3519
2025-06-03 18:18:40 +00:00
462579af11 Update merge_rules.yaml (#155008)
- add new docs reviewers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155008
Approved by: https://github.com/malfet
2025-06-03 18:09:23 +00:00
f714599c57 [MPS][BE] Extend torch.special. to integer dtypes (#155002)
By changing the functor to looks as follows
```metal
struct xlog1py_functor {
  template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(c10:🤘:xlog1py(a, b));
  }
  template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
  inline float operator()(const T a, const T b) {
    return c10:🤘:xlog1py(float(a), float(b));
  }
};
```

Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #154936
2025-06-03 17:52:41 +00:00
31405a69fb [typing] Add missing type annotations to torch.nn.init module (#154504)
## Summary

Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed.

## Changes

- Added missing type annotations to initialization functions in the module.
- Added missing typing imports: `Any`, `Callable`, `Union`
- Removed `# mypy: allow-untyped-defs` comment
- Create Literal types for kaiming initialization mode and nonlinearity.
- Created `__all__`

## Why

Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements.

Tested with existing test suite and lintrunner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154504
Approved by: https://github.com/Skylion007
2025-06-03 17:33:32 +00:00
40142978d7 Add type annotation to orthogonal_ (#154927)
Trivial charge, but I want pyright to stop yelling at me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154927
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 17:00:02 +00:00
1f131fe56b Update bug-report.yml (#154857)
Update issue template for binary data and numerical notes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154857
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-03 16:13:07 +00:00
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
283f876ab6 [PP] Fix disabled flaky tests (#154856)
Fix https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](https://github.com/pytorch/pytorch/pull/153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154856
Approved by: https://github.com/Skylion007
2025-06-03 15:55:29 +00:00
250e9af4da Removing per torch.compile audit. (#154572)
Removing https://pytorch.org/docs/stable/torch.compiler_best_practices_for_backends.html per torch.compile audit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154572
Approved by: https://github.com/williamwen42, https://github.com/svekars
2025-06-03 15:41:52 +00:00
3685b10170 Turn on compile with NVSHMEM (#154538)
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538
Approved by: https://github.com/ngimel
2025-06-03 15:24:24 +00:00
a1a268aff5 [dtensor] fix simplefsdp mixed-precision training bugs (#154975)
This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training.

In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`.

This PR fixes this issue and corrects previously added test cases.

After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly.

![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975
Approved by: https://github.com/tianyu-l
2025-06-03 14:47:36 +00:00
2608927cfb Solve for tilings (#153748)
Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses.

For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153748
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730
2025-06-03 14:37:30 +00:00
812deecaab Add option to define OpenBLAS version for manylinux Dockerfile_2_28_aarch64 (#150106)
Adds optional variable OPENBLAS_VERSION to `.ci/docker/common/install_openblas.sh` used to define which version of OpenBLAS to install. Adds argument to `Dockerfile_2_28_aarch64` image.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150106
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
2025-06-03 14:35:54 +00:00
0adbde4d35 Analyze coalesced mem (#153730)
Analyze memory expressions to see if they contain a coalescing symbol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153730
Approved by: https://github.com/jansel
ghstack dependencies: #153723
2025-06-03 14:29:06 +00:00
e9266f807a [BE] Use vendored packaging for testing (#154946)
As the rest of the torch uses it, test should rely on it as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 14:22:53 +00:00
9cdce682a1 [MPS][BE] Reimplement log1p as Metal shader (#154936)
That should make it faster than MPSGraph implementation, but also
improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$

Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} -  \frac{x}{3}))$

Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one.

Parametrize and modify regression test to check for accuracy of small values

TODOs:
 - Do proper implementation for complex values as well, perhaps using 0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)
 - May be implement it using Remez-like algorithm documented here 207f3b2b25/lib/msun/src/s_log1pf.c (L37)
 - Or use llvm's implementation from f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)
 - Benchmark which algorithm is faster and delivers better accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-06-03 14:10:13 +00:00
00dfd3891e [Tiling rewrite pt1] Normalize reads and writes to common iter space (#153723)
In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153723
Approved by: https://github.com/jansel
2025-06-03 14:04:34 +00:00
635b73e697 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
2025-06-03 11:50:57 +00:00
71a0af8a14 [TEST][Quantization] Skip test_learnable due to hypothesis (#152819)
As per comment in https://github.com/pytorch/pytorch/issues/111471#issuecomment-1866933243 the tests are failing due to hypothesis. This PR adds a skip to those tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152819
Approved by: https://github.com/eqy
2025-06-03 11:23:15 +00:00
ea5b9eca74 Combine sticky pgo key with job id (#154863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154863
Approved by: https://github.com/Mingming-Ding
2025-06-03 07:58:38 +00:00
a4da1d4a47 [Graph Partition] support standalone_compile (#154698)
For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698
Approved by: https://github.com/oulgen
2025-06-03 07:40:42 +00:00
d91c85babb [c10d][fr] Split cuda and non-cuda fr logic into two cpp file (#154929)
During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about  ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154929
Approved by: https://github.com/kwen2501
2025-06-03 07:00:14 +00:00
13044b2b04 Move c10/macros/Export.h to torch/standalone (#154850)
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.

Test Plan: CI

Bifferential Revision: D75756619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
2025-06-03 06:18:59 +00:00
a7e496a896 Revert "[dynamo] Record the pre-graph bytecode using fast record function event (#154769)"
This reverts commit 409c396a48584de1ab14e1be6957663d548ad89e.

Reverted https://github.com/pytorch/pytorch/pull/154769 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
b86aaaae0b Revert "[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)"
This reverts commit 7dee89913072f1499c5265d8e92d23c30fc6a7f1.

Reverted https://github.com/pytorch/pytorch/pull/154764 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
d375e64279 [cutlass backend][forward fix] hex the cutlass key instead of decode (#154885)
This is mainly following how it is done for torch_key.

Error was:
```
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154885
Approved by: https://github.com/jingsh, https://github.com/mlazos
2025-06-03 06:00:16 +00:00
8af447224e Improve error message for torch.fft.ihfft2 when input's dtype is complex (#149692)
Fixes #149625

For the case mentioned in the issue, will get:

```
RuntimeError: Only supports floating-point dtypes, but found: ComplexDouble
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149692
Approved by: https://github.com/malfet
2025-06-03 05:54:56 +00:00
295ea202f6 [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-06-03 04:01:49 +00:00
cyy
388912dd94 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 03:49:09 +00:00
ef92653022 Revert "Remove AttributeError constructor (#154808)"
This reverts commit 3239da0c732c4ad736df7081ea44c1cd79c01145.

Reverted https://github.com/pytorch/pytorch/pull/154808 on behalf of https://github.com/cyyever due to Need format code ([comment](https://github.com/pytorch/pytorch/pull/154808#issuecomment-2933286113))
2025-06-03 03:40:41 +00:00
b3cb0e83de [FSDP2] respect reshard_after_forward=True for root model (#154704)
resolve https://github.com/pytorch/pytorch/issues/154655

`fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases
* we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward
for recommendation model, we may have mutiple root for dense part

Change default beahvior to always respect `reshard_after_forward=True`

Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704
Approved by: https://github.com/mori360
2025-06-03 03:12:45 +00:00
ff35c0cdfd [inductor] Change _constexpr_to_value -> _unwrap_if_constexpr (#154905)
To adapt to the changes from: f480e2f697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154905
Approved by: https://github.com/davidberard98
2025-06-03 03:10:56 +00:00
cyy
e3cf73ee49 Move remaining CI jobs to VS 2022 (#154811)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154811
Approved by: https://github.com/huydhn
2025-06-03 02:21:24 +00:00
3239da0c73 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 02:18:51 +00:00
28cb3c0fe5 [test][inductor] attempt to fix duplicate registration issue (#154865)
Fixes #154216

In #154216, there's a duplicate registration error thrown from registering `test::foo` twice. I expect that this is caused by having two tests that both register a `test::foo` op in the same test file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154865
Approved by: https://github.com/NikhilAPatel, https://github.com/jingsh
2025-06-03 01:11:47 +00:00
6cb6da6ea2 [triton pin][test] relax codecache test checks for number of triton artifacts (#154879)
Triton has added another artifact that gets generated (triton-lang/triton#6992), so `test_cache_load_function` started failing as there are now 8 (instead of 7) artifacts.

Instead of figuring out a way to check exactly which set of artifacts will get generated, I instead modified the test to just check that there are _at least_ 6 artifacts, to account for different platforms (intel/amd/nvidia) and different triton versions (which may or may not have a `.source` artifact)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154879
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-06-03 00:52:54 +00:00
7f44b589be [dynamo] fix pruning locals with ShapeEnvSource (#154752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154752
Approved by: https://github.com/zhxchen17
2025-06-03 00:35:11 +00:00
47a142c3c2 [triton pin][tests] update inductor/profiler launch_(enter|exit)_hooks tests (#154894)
Fixes #154223

Triton has updated launch_(enter|exit)_hooks so that they are now in `knobs`. @danzimm already fixed this in #152457 - this just updates the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154894
Approved by: https://github.com/jingsh, https://github.com/NikhilAPatel
2025-06-03 00:14:14 +00:00
731acbfb0b [CI] Reuse old whl on PRs (#154662)
Turn off main branch only gating for reusing old whls
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154662
Approved by: https://github.com/huydhn
2025-06-03 00:10:39 +00:00
af9f18e87e [nativert] Free stale execution frames (#154636)
Summary:
This was implemented in SR due to caching of runtime instances building up and causing some memory usage spikes after some large amount traffic went through the model, and then once traffic went down, SR was still caching all the previous usage.

We need something similar on the Sigmoid side to make sure the static dispatch modules aren't hogging memory. Currently, all ExecutionFrame objects are being cached, and never freed if stale.

Test Plan:
Added extra execution frames in tmp commit D75257998 and ran local replayer test to confirm extra execution frames get cleaned up down to min size, which is set at 8

 {F1978532047}

Also tested by modifying load_net_predictor (modifications also in D75257998) to run benchmarkNumIterations twice - once with benchmarkNumThreads, and once with only one thread. Also set clearing interval at one second. Verified that execution frames get cleared when we drop down to one thread.

 {F1978558984}

```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Bifferential Revision: D75257992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154636
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-06-02 23:44:12 +00:00
37eb909c94 Revert "[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)"
This reverts commit 7b25ff7cf2e6096c103da0068e417216a41be7a9.

Reverted https://github.com/pytorch/pytorch/pull/154091 on behalf of https://github.com/seemethere due to I root caused this PR to some failures, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures with my fix ([comment](https://github.com/pytorch/pytorch/pull/154091#issuecomment-2932848880))
2025-06-02 23:22:43 +00:00
ac65e94f45 Revert "[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)"
This reverts commit 2dfc0e33273fe50dcbb3d363da02c8cc485b4adc.

Reverted https://github.com/pytorch/pytorch/pull/154110 on behalf of https://github.com/seemethere due to This is part of a stack with failures internally, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures ([comment](https://github.com/pytorch/pytorch/pull/154110#issuecomment-2932845168))
2025-06-02 23:20:11 +00:00
e3af628b0d Revert "Add CPython exception tests (#150789)"
This reverts commit 67fb9b7cc3f7d2ebbb104296f2b11776f4adbb22.

Reverted https://github.com/pytorch/pytorch/pull/150789 on behalf of https://github.com/seemethere due to This is failing upstream in trunk, see 67fb9b7cc3 ([comment](https://github.com/pytorch/pytorch/pull/150789#issuecomment-2932823586))
2025-06-02 23:12:15 +00:00
7dee899130 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #154769
2025-06-02 23:01:58 +00:00
409c396a48 [dynamo] Record the pre-graph bytecode using fast record function event (#154769)
![image](https://github.com/user-attachments/assets/1d06618b-1c14-4ed5-ab7b-dcfecbb4d632)

Adds another event in the profiler traces. This can help us find models where pre-graph bytecode is very expensive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154769
Approved by: https://github.com/zou3519, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-06-02 22:33:27 +00:00
f6b83d4cc6 sort iteration over index vars (#154846)
Fix for https://github.com/pytorch/pytorch/issues/154741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-06-02 22:06:00 +00:00
d6420d4f85 [CI] Reuse old whl: replace the version (#154773)
Replace the git version, so whl name goes from `torch-something+git<old commit>` to `torch-something+git<new commit>`

Renamed a bunch of variables to hopefully be more clear

Tested on ef210ad54b
* Removed gating that prevents it from running on PRs (which is going to be merged soon)
* Removed gating that checks for which files can be changed (since this PR has stuff outside of the acceptable list)
* The above two allow the whl to be reused, and I added assert 1 == 2 in common_utils and checked that jobs failed (meaning they were using updated code despite not building)

Checked that the whl in the docker image has the right commit sha, didn't check torch.__version__ though
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154773
Approved by: https://github.com/malfet
2025-06-02 22:02:41 +00:00
e1644e40a7 [ez][TD] Fix TD indexer workflow (#154868)
Update docker image, and fix gpu flag env var

Example failure: https://github.com/pytorch/pytorch/actions/runs/15381170311/job/43272174443

Tested on 9cb28f03e5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154868
Approved by: https://github.com/Skylion007
2025-06-02 21:33:19 +00:00
104c31598f [cutlass backend][ez] Make load config from local more resilient (#154740)
Differential Revision: D75693211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154740
Approved by: https://github.com/ColinPeppler
2025-06-02 21:12:12 +00:00
731e635c95 Add CPython math/cmath tests (#150794)
Tests:
* test_math.py
* test_cmath.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_math" "test_cmath"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150794
Approved by: https://github.com/zou3519
2025-06-02 20:49:44 +00:00
67fb9b7cc3 Add CPython exception tests (#150789)
----

* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
2025-06-02 20:44:41 +00:00
48807d568e [CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169)
Contributing to the fix of #147383   and #154119

Additional steps required: 3218b1b684/.github/workflows/lint.yml cu118 needs to be updated.
Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv
2025-06-02 20:22:14 +00:00
9d3ad82ca7 [dynamo] Remove all skipIfTorchDynamo in test_tensor_creation_ops.py (#154693)
Looks like they are no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154693
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-06-02 20:14:35 +00:00
984b1a80e3 [ez] add docs for *eager_then_compile stances (#154818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154818
Approved by: https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822, #154823, #154805
2025-06-02 19:04:35 +00:00
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
33f2d0ff45 add reference to stances from dynamic shapes doc (#154823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154823
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822
2025-06-02 18:47:19 +00:00
d99e9568ec Add docs for how to mark as unbacked (#154822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154822
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826
2025-06-02 18:30:57 +00:00
1258aac1c2 [dynamo] Upcast torch.Size + tuple to be of size torch.Size (#154830)
Fixes https://github.com/pytorch/pytorch/issues/154432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154830
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007, https://github.com/williamwen42
2025-06-02 17:57:23 +00:00
9fe1b40d17 [ez] add dynamic sources docs (#154826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154826
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802
2025-06-02 17:53:30 +00:00
69e22301da Revert "[inductor] Add kernel_hash_key to ChoiceCaller (#154470)"
This reverts commit 7a79de1c0f31200f95a48a9e69fbd2df2a3c735d.

Reverted https://github.com/pytorch/pytorch/pull/154470 on behalf of https://github.com/seemethere due to Failing internal inductor tests, author is aware and suggested revert. D75767762 ([comment](https://github.com/pytorch/pytorch/pull/154470#issuecomment-2931717432))
2025-06-02 17:43:23 +00:00
113224b530 Enable non blocking remote cache write (#154837)
Test Plan:
Ran
```
buck2 run mode/opt //scripts/oulgen:runner
```
twice
and got

https://fburl.com/scuba/pt2_remote_cache/u7u1uqh1

Differential Revision: D75770423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154837
Approved by: https://github.com/jamesjwu
2025-06-02 17:36:43 +00:00
67067512a1 Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165)"
This reverts commit 515c19a3856e953c0fe23a0ed4fa844f8eea34d8.

Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))
2025-06-02 16:28:46 +00:00
981bdb39ca Enable ConvTranspose3D for FP32 and Complex64 (#154696)
Fixes #154615

Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15.

For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs.

Tests for ConvTranspose3D were enabled for the supported data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696
Approved by: https://github.com/malfet
2025-06-02 16:24:03 +00:00
77d85a4629 Symintify baddbmm (#154656)
Previously we would specialize on the shape in this if-statement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154656
Approved by: https://github.com/pianpwk
2025-06-02 15:23:14 +00:00
e22be781b7 Symintify repeat_interleave (#154660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154660
Approved by: https://github.com/pianpwk
2025-06-02 15:19:39 +00:00
cyy
f6275bf0fe Bump pocketfft submodule to the latest (#154845)
Fixes #154843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154845
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-02 14:54:13 +00:00
dfd6849e77 Update lint_urls.sh (#154838)
Do not match empty urls pieces like "https://"
Add headers for better handling urls like "https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154838
Approved by: https://github.com/Skylion007
2025-06-02 14:50:34 +00:00
c65e9ad77a Update slow tests (#154347)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154347
Approved by: https://github.com/pytorchbot
2025-06-02 11:30:56 +00:00
ff4515fde5 Add optional check_pinning argument to _validate_sparse_compressed_tensor/coo_args (#154759)
As in the title.

A prerequisite to https://github.com/pytorch/pytorch/pull/154638 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154759
Approved by: https://github.com/amjames, https://github.com/ngimel
ghstack dependencies: #154610
2025-06-02 10:17:07 +00:00
3f3c1f419f User-controlled sparse tensor validation when loading data from external storage (#154610)
This PR lets users to control sparse tensor invariants validation (that can be expensive, especially, for sparse tensors with many indices) when loading data from external sources.

By default, the validation of sparse tensor invariants is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154610
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-02 10:17:07 +00:00
9258cfc227 [audio hash update] update the pinned audio hash (#154776)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154776
Approved by: https://github.com/pytorchbot
2025-06-02 05:36:13 +00:00
16d05e130c [CI][CUDA][UCC] Update test_c10d_ucc.py - remove xfailIfLinux because it now succeeds (#150979)
pytest -v test/distributed/test_c10d_ucc.py  -k test_save_load
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/pytorch/.hypothesis/examples'))
rootdir: /opt/pytorch/pytorch
configfile: pytest.ini
plugins: anyio-4.9.0, hypothesis-6.130.13, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1, xdoctest-1.0.2, typeguard-4.3.0
collected 63 items / 62 deselected / 1 selected
Running 1 items in this shard

test/distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint PASSED [65.2581s]                                                                                               [100%]

================================================================================== 1 passed, 62 deselected in 68.78s (0:01:08)

@ptrblck @eqy @tinglvv @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150979
Approved by: https://github.com/eqy
2025-06-02 03:24:35 +00:00
cd3d2b75b3 Update README.md - James has the wrong github link. (#151473)
Unless I'm wrong, the James on the pytorch paper is not the account linked to in the README.md.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151473
Approved by: https://github.com/albanD
2025-06-02 01:53:44 +00:00
515c19a385 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-02 01:47:02 +00:00
0d0058d90d Fix flaky test in test_custom_ops (#152484)
Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484
Approved by: https://github.com/zou3519
2025-06-02 01:45:28 +00:00
80af98c6c3 [BE]: Update nlohmann submodule to 3.12.0 (#154817)
This is mostly compiler fixes, C++20 fixes, and clang-tidy fixes. Should be entirely backwards compatible with our current version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154817
Approved by: https://github.com/jansel, https://github.com/malfet
2025-06-02 01:29:58 +00:00
2b2245d5db [BE]: Replace printf with fmtlib call (#154814)
Safer, faster, more concise, and better type checking. Also add a few misc changes in the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154814
Approved by: https://github.com/jansel
2025-06-01 22:27:08 +00:00
206e9d5160 [BE]: Update cpp-httplib submodule to 0.20.1 (#154825)
Updates cpp-httplib to 0.20.1. This mostly updates OSS with a bunch of CMake, CXX compiler errors, and bugfixes from upstream. It's a header only library so should be pretty straightforward to upgrade
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154825
Approved by: https://github.com/malfet
2025-06-01 21:44:23 +00:00
064bb3cebc [BE]: Replace a couple of call sites with fmtlib printf (#154533)
This is faster, and memory safe implementation of printf functions coming from fmtlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154533
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 21:16:34 +00:00
0350c7e72c [BE] Introduce torch.AcceleratorError (#152023)
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`

`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's  [`PyErr_SetString`](cb8a72b301/Python/errors.c (L282)), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](cb8a72b301/Python/errors.c (L32))
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references

Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'

import torch

x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
    x[y] = 2
    print(x)
except torch.AcceleratorError as e:
    print("Exception was raised", e.args[0])
    print("Captured error code is ", e.error_code)
```

which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
#22 PyObject_CallFunctionObjArgs from ??:0
#23 _PyObject_MakeTpCall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyEval_EvalFrameDefault from ??:0
#33 _PyFunction_Vectorcall from ??:0
#34 _PyEval_EvalFrameDefault from ??:0
#35 PyFrame_GetCode from ??:0
#36 PyNumber_Xor from ??:0
#37 PyObject_Str from ??:0
#38 PyFile_WriteObject from ??:0
#39 _PyWideStringList_AsList from ??:0
#40 _PyDict_NewPresized from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyEval_EvalCode from ??:0
#43 PyEval_EvalCode from ??:0
#44 PyUnicode_Tailmatch from ??:0
#45 PyInit__collections from ??:0
#46 PyUnicode_Tailmatch from ??:0
#47 _PyRun_SimpleFileObject from ??:0
#48 _PyRun_AnyFileObject from ??:0
#49 Py_RunMain from ??:0
#50 Py_BytesMain from ??:0
#51 __libc_init_first from ??:0
#52 __libc_start_main from ??:0
#53 _start from ??:0

Captured error code is  710
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: #154436
2025-06-01 21:02:43 +00:00
f7c09f864a [Docs] Reformat sparse example (#154785)
Not sure why, but rst fails to colorize multiline inputs, but works fine for single line commands
Test plan:
| [Before](https://docs.pytorch.org/docs/main/sparse.html#construction)  | [After](https://docs-preview.pytorch.org/pytorch/pytorch/154785/sparse.html#construction) |
| ------------- | ------------- |
| <img width="466" alt="image" src="https://github.com/user-attachments/assets/96a5c52a-1804-4d05-a5cf-c10221aaddf6" />  | <img width="477" alt="image" src="https://github.com/user-attachments/assets/99565288-5c0b-4e8e-bd60-f016ebc207b5" />  |

Fixes https://github.com/pytorch/pytorch/issues/154779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154785
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-06-01 20:56:14 +00:00
c2e9115757 Fix typo in dcp module (#154815)
Fixed the  docstring in `validate_checkpoint_id`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815
Approved by: https://github.com/Skylion007
2025-06-01 18:18:45 +00:00
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
0cd18ba1ca [BE][Ez] Update deprecated pybind11 functions (#154798)
* getType() is deprecated, replace it with new/proper static method. These are backwards compatible with old pybind11 versions we support. So break this off before we upgrade to pybind11 3.0 where these methods are dropped in #154115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154798
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-06-01 06:17:50 +00:00
bfae151269 [BE][Ez]: Remove unneeded mypy suppressions (#154800)
Improvements in typing have made this suppression unnecessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 06:10:41 +00:00
9cbbc2593b test for 146431 (#154786)
Adds test for #146431 that was fixed by #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154786
Approved by: https://github.com/Skylion007, https://github.com/galv

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-06-01 04:17:54 +00:00
cyy
5616fa4a68 [Submodule] Bump flatbuffers to v24.12.23 (#143964)
This sub-module has not been updated for a long time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143964
Approved by: https://github.com/Skylion007
2025-06-01 02:25:57 +00:00
c33fc9dae3 [BE][Ez]: Update VulkanMemoryAllocator to 3.3.0 (#154796)
Last update to this submodule was 3 years ago, and the API is pretty stable and this is a minor version release update. Part of a bunch of PRs to eradicate low CMake required versions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154796
Approved by: https://github.com/jansel
2025-06-01 00:30:56 +00:00
9ce2732b68 [BE][Ez]: Fully type nn.utils.clip_grad (#154801)
Full types clip_grad and exposed typing annotations that were hidden by a bad decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801
Approved by: https://github.com/jansel
2025-05-31 23:06:45 +00:00
dbad6d71c7 [BE][Ez]: Unskip conv1d MPS test (#154795)
Fixes issue I noticed where conv1d test is skipped for complex types unconditionally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154795
Approved by: https://github.com/jansel
2025-05-31 23:01:19 +00:00
b85c460749 [BE][Ez]: Update NVTX submodule to 3.2.1 (#154797)
Update NVTX3 submodule to 3.2.1.
* Mostly improved compiler support, Python support, and better CMake and C++ support.
* Also has a few new APIs to support fancy new features.
* This is header only library so should be an easy non-invasive change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154797
Approved by: https://github.com/jansel
2025-05-31 23:01:13 +00:00
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
c99e91b1d7 [BE]Enhance _get_clean_triton.py to auto-generate launch_params if missing (#154666)
Previously, @Chillee wrote a script https://github.com/pytorch/pytorch/pull/125811 to remove inductor dependency for inductor compiled triton kernels. We'd like to automate the process of obtaining the launch parameters.

Added functionality to the torch/utils/_get_clean_triton.py to automatically generate the launch_params file if it does not exist and the auto_generate_params flag is set to True. This includes running the input file in a subprocess with the appropriate environment variable. Updated the get_clean_triton function and the main script to support this new feature, allowing users to disable auto-generation via a command-line argument.

# Test Plan
test embedding op in TritonBench
```
# generate inductor compiled triton kernels
TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python run.py --op embedding  --mode fwd  --precision fp32 --metrics nsys_rep --only inductor_embedding  --num-inputs 1 --input-id 11
# run the script to get rid of inductor dependency. By default, triton_only_repro.py is the output file name.
python ~/pytorch/torch/utils/_get_clean_triton.py ~/tritonbench/torch_compile_debug/run_2025_05_29_14_47_50_497790-pid_849274/torchinductor/model__0_forward_1.0/output_code.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154666
Approved by: https://github.com/davidberard98
2025-05-31 19:27:56 +00:00
c014e4bcaa Fix typo in vec256 interleave2 (#154784)
Fix a typo where the elements in a vector are mislabeled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154784
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-05-31 14:17:10 +00:00
daff263062 [Functorch] Support Functorch for PrivateUse1 backend (#154700)
This PR enable that functorch to be used in 3rd party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154700
Approved by: https://github.com/zou3519
2025-05-31 07:28:45 +00:00
15e9119a69 [BE] install_triton_wheel.sh update for internal dev (#154637)
internal devgpu gets mad at `pip install ...` but `python3 -m pip install ...` is fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154637
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-31 06:57:56 +00:00
7368eeba5e [dynamo][guards] Prevent LENGTH guard on nn modules (#154763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154763
Approved by: https://github.com/williamwen42
2025-05-31 05:32:31 +00:00
7a79de1c0f [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-05-31 03:09:37 +00:00
bd10ea4e6c Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit ad26ec6abe51d528124bc5fbbacaa87aef077ab8.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))
2025-05-31 02:14:24 +00:00
43390d8b13 ROCm Sparsity through HipSparseLT (#150578)
TLDR:

- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.

### Dense vs. Sparse Performance Comparison

#### **NT (Row-major)**
**Average Uplift**: `1.20`

| M     | N      | K      | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8      | 4096   | 20.05                   | 25.3                          | 1.26   |
| 4096  | 8      | 14336  | 21.07                   | 25.28                         | 1.20   |
| 3072  | 3072   | 10240  | 299.05                  | 351.82                        | 1.18   |
| 3072  | 1536   | 768    | 18.56                   | 20.05                         | 1.08   |
| 3072  | 17664  | 768    | 163.13                  | 173.91                        | 1.07   |
| 3072  | 196608 | 768    | 1717.30                 | 1949.63                       | 1.14   |
| 3072  | 24576  | 768    | 206.84                  | 242.98                        | 1.17   |
| 3072  | 6144   | 768    | 53.90                   | 56.88                         | 1.06   |
| 3072  | 98304  | 768    | 833.77                  | 962.28                        | 1.15   |
| 768   | 1536   | 768    | 8.53                    | 19.65                         | 2.30   |
| 768   | 17664  | 768    | 46.02                   | 46.84                         | 1.02   |
| 768   | 196608 | 768    | 463.15                  | 540.46                        | 1.17   |
| 768   | 24576  | 768    | 54.32                   | 59.55                         | 1.10   |
| 768   | 6144   | 768    | 19.47                   | 20.15                         | 1.03   |
| 768   | 98304  | 768    | 231.88                  | 258.73                        | 1.12   |

---

#### **NN (Row-major)**
**Average Uplift**: `1.13`

| M   | N      | K     | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536   | 3072  | 27.50                   | 28.78                         | 1.05   |
| 768 | 17664  | 3072  | 125.06                  | 158.94                        | 1.27   |
| 768 | 196608 | 3072  | 1568.38                 | 1767.12                       | 1.13   |
| 768 | 24576  | 3072  | 171.05                  | 203.49                        | 1.19   |
| 768 | 6144   | 3072  | 58.72                   | 60.39                         | 1.03   |
| 768 | 98304  | 3072  | 787.15                  | 887.60                        | 1.13   |

-------------------------

This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.

### ROCm and hipSPARSELt Support:

* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)

### Codebase Updates:

* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)

### Test Suite Updates:

* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
cyy
ad26ec6abe Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
3e71016459 Revert "Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)"
This reverts commit 489afa829a248ca64c4b2dffe2e6d601b8816cf9.

Reverted https://github.com/pytorch/pytorch/pull/154298 on behalf of https://github.com/izaitsevfb due to breaks linux-jammy-aarch64-py3.10 / build ([comment](https://github.com/pytorch/pytorch/pull/154298#issuecomment-2923966688))
2025-05-31 01:51:59 +00:00
489afa829a Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)
Test Plan: The only functional change is zero-initialization instead of undefined-initialization. If tests pass, I think it should be fine.

Differential Revision: D75345074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154298
Approved by: https://github.com/swolchok
2025-05-31 01:32:45 +00:00
472773c7f9 [nativert] move OpKernelKind enum to torch (#154756)
Summary: att

Test Plan: ci

Differential Revision: D75703996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154756
Approved by: https://github.com/zhxchen17, https://github.com/cyyever
2025-05-31 01:31:29 +00:00
f01e628e3b Resubmit Remove MemPoolContext (#154042) (#154746)
Summary: Per title

Test Plan: Added tests + existing tests

Differential Revision: D75695030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746
Approved by: https://github.com/malfet
2025-05-31 01:21:54 +00:00
932733e0e6 Fix memory leaks in mps_linear_nograph (#154765)
Fixes some memory leaks which were identified as part of the investigation of https://github.com/pytorch/pytorch/issues/154329. This doesn't appear to be the whole solution but wanted to merge this anyway since it's a quick fix

In my tests I see roughly 3MB of unexpected memory growth before this change, and after this change I see 2.2MB of memory growth
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154765
Approved by: https://github.com/malfet
2025-05-31 00:46:12 +00:00
108422ac26 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))
2025-05-31 00:28:03 +00:00
da4aacabac Add h100_distributed label (#154562)
Add h100_distributed label, testing distributed 3D composability tests on 8*H100 GPU node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154562
Approved by: https://github.com/seemethere
2025-05-31 00:17:43 +00:00
9b5308cd58 [upstream triton] support build with setup.py in ./python/ or in ./ (#154635)
Upstream triton has moved setup.py from python/ to ./.  This PR allows versions to be buildable by checking the location of setup.py and choosing the cwd of the build commands based on the location.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154635
Approved by: https://github.com/atalman
2025-05-31 00:15:43 +00:00
b019a33f8f [ez][CI] Reuse old whl: remove old zip/whl (#154770)
Forgot that unzip doesn't get rid of the zip so the old one is still there

Unrelated: figure out how to update the git version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154770
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-05-31 00:13:24 +00:00
0fab32290a Revert "[draft export] avoid storing intermediate real tensors in proxies (#154630)"
This reverts commit 5acb8d50801e6d110790993464611314dd1bd54b.

Reverted https://github.com/pytorch/pytorch/pull/154630 on behalf of https://github.com/malfet due to This still ooms, at least occasionally see 78624679a8/1 ([comment](https://github.com/pytorch/pytorch/pull/154630#issuecomment-2923759745))
2025-05-31 00:07:56 +00:00
faf973da5e [refactor] move materialize_as_graph to _higher_order_ops/utils.py (#154070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154070
Approved by: https://github.com/zou3519
2025-05-31 00:06:44 +00:00
cyy
78624679a8 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
5f1c3c67b2 [pgo] log dynamic whitelist in PT2 Compile Events (#154747)
Summary: logs the whitelist to PT2 Compile Events

Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Reviewed By: bobrenjc93

Differential Revision: D75617963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154747
Approved by: https://github.com/angelayi
2025-05-30 23:54:24 +00:00
bbda22e648 [BE][Ez]: Optimize unnecessary lambda with operator (#154722)
Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722
Approved by: https://github.com/jansel
2025-05-30 23:47:10 +00:00
0f3db20132 [ez][CI] Do not reuse old whl if deleting files (#154731)
Thankfully very few commits actually delete files so I don't think has affected anything
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154731
Approved by: https://github.com/Skylion007
2025-05-30 22:35:13 +00:00
eb93c0adb1 [inductor][AMD] support special kwargs in AMD triton configs (#154605)
**Context**:

AMD triton kernels can be launched with special kwargs, like `waves_per_eu`. Triton configs with these kwargs look like this:

```
triton.Config({
    "BLOCK_SIZE": 64,
    "waves_per_eu": 2,
})
```

in comparison, nvidia's special kwargs are explicit parameters on the config, e.g. num_warps:

```
triton.Config(
    {"BLOCK_SIZE": 64},
    num_warps=4,
)
```

**Problem**: this causes custom triton kernels w/ PT2 to error out, because there's a kwarg in the triton.Config that doesn't appear in the kernel signature.

**Solution**: When splicing in the constexpr values into the arg list, ignore any values in the config kwargs list if they don't appear in the function signature.

Differential Revision: [D75599629](https://our.internmc.facebook.com/intern/diff/D75599629/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D75599629/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154605
Approved by: https://github.com/njriasan
2025-05-30 22:24:32 +00:00
1193bf0855 Revert "convert inductor codecache to use getArtifactLogger (#153766)"
This reverts commit 5b6fd277f954b789649501e21e9689a42d565e13.

Reverted https://github.com/pytorch/pytorch/pull/153766 on behalf of https://github.com/malfet due to I want to revert this change as I'm 90+% certain it somehow broke testing ([comment](https://github.com/pytorch/pytorch/pull/153766#issuecomment-2923620806))
2025-05-30 22:20:07 +00:00
26aa8dcf27 [ONNX] Simplify onnx test dependencies (#154732)
Simplify onnx test dependencies and bump onnxscript to 0.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154732
Approved by: https://github.com/Skylion007
2025-05-30 21:58:04 +00:00
5acb8d5080 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-30 21:06:55 +00:00
abc2264e8f remove another instance of mtia_workloadd from pytorch (#154739)
Summary: ^

Test Plan: CIs

Differential Revision: D75692171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154739
Approved by: https://github.com/sraikund16
2025-05-30 20:50:46 +00:00
22a4cabd19 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 20:32:56 +00:00
ed1ff7d0fb [BE][Ez]: Update mimalloc submodule to 2.2.3 (#154720)
Updating minor version of mimalloc. The old version is more than 2 years old, and the newer release has performance fixes and compiler fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154720
Approved by: https://github.com/jansel
2025-05-30 20:17:13 +00:00
2f03673ebf [BE][Ez]: Enable ClangFormat aten/src/core/Formatting.cpp (#154719)
Follow up to #152830 . Noticed the file was excluded from fromatting, opt in to clang-format since it's really close anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154719
Approved by: https://github.com/jansel
2025-05-30 19:52:43 +00:00
f57754e815 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#154618)
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility :

Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.

Motivation

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618
Approved by: https://github.com/jansel
2025-05-30 19:30:25 +00:00
d6edefefbf [CUDA] Fixes for backwards in memefficient attn for large tensors (#154663)
followup to #154029.

@ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663
Approved by: https://github.com/ngimel
2025-05-30 19:30:07 +00:00
d89d213118 Fix test_tensorboard when started w/o tensorboard package (#154709)
If `TEST_TENSORBOARD == False` then `DataType` is not defined or imported. However it is used unconditionally when defining the test with `parametrize` which leads to an NameError crashing the test execution on start.

Provide a Dummy to make it syntactially correct. Tests will be skipped on start.

```
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 885, in <module>
    class TestTensorProtoSummary(BaseTestCase):
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 889, in TestTensorProtoSummary
    (torch.float16, DataType.DT_HALF),
                    ^^^^^^^^
NameError: name 'DataType' is not defined
Got exit code 1, retrying...
test_tensorboard 1/1 failed! [Errno 2] No such file or directory: '/dev/shm/build/pytorch-v2.2.1/.pytest_cache/v/cache/stepcurrent/test_tensorboard_0_0dba8bc00bbe233f'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154709
Approved by: https://github.com/Skylion007
2025-05-30 19:18:43 +00:00
22641f42b6 [Binary-builds]Use System NCCL by default in CI/CD. (#152835)
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.

Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
2025-05-30 18:51:48 +00:00
967937872f [dynamo] Remove dead code path for torch.Tensor.view(*shape) (#154646)
This was introduced in early days of Dynamo, and looks like it's been
fixed since -- the regression test `test_transpose_for_scores` passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154646
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #154645
2025-05-30 18:50:58 +00:00
f9dc20c7a3 [dynamo] Fix syntax error in aot graph from kwarg-less torch.Tensor.[random_|uniform_] calls (#154645)
As title, fixes #151432, see more context in the issue discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154645
Approved by: https://github.com/zou3519
2025-05-30 18:50:58 +00:00
fb67fa9968 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit aec3ef100844631cb7c4ce2725157984eb9cebfe.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/malfet due to Looks like it broke inductor/test_compile_subprocess.py::CpuTests::test_AllenaiLongformerBase, see 35fc5c49b4/1(default%2C%20&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2923154249))
2025-05-30 18:45:01 +00:00
35fc5c49b4 Revert "[internal] Expose additional metadata to compilation callbacks (#153596)"
This reverts commit f889dea97dad3cc506d43e379a469334417040c8.

Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))
2025-05-30 18:39:27 +00:00
b6b9311f4f [BE][Ez]: Fix typo in dynamo utils #154639 (#154748)
Fixes a typo in #154639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154748
Approved by: https://github.com/ngimel
2025-05-30 18:39:01 +00:00
bbdf469f0e Add CPython dict tests (#150791)
Tests:
* test_dict.py
* test_ordered_dict.py
* test_userdict.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_dict" "test_ordered_dict" "test_userdict"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150791
Approved by: https://github.com/zou3519
2025-05-30 18:17:09 +00:00
2120eeb8de [BE][Ez]: Improve dynamo utils typing with TypeIs and TypeGuard (#154639)
Adds some additional TypeIs and TypeGuard to some _dynamo utils for additional type narrowing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154639
Approved by: https://github.com/jansel
2025-05-30 18:09:50 +00:00
1b569e5490 Fix load_state_dict description (#154599)
Fixes #141364

Fix missing description in `assign` param

## Test Result

### Before
![image](https://github.com/user-attachments/assets/5928c691-4e31-463b-aa0a-86eb8bb452e5)

### After
![image](https://github.com/user-attachments/assets/036631a2-0f20-4a71-95c3-2c0fd732293e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154599
Approved by: https://github.com/colesbury, https://github.com/mikaylagawarecki
2025-05-30 18:08:59 +00:00
30ac7f4d4e [EZ/Memory Snapshot] Remove Handle even if compile_context not set (#154664)
Summary: When setting the memory snapshot callback we register and unregister callbacks for performance reasons. For ease of use, it makes sense to just remove all callbacks regardless of which flags are enabled. The enable stays behind a feature flag, this just changes the disable to ignore the flag itself.

Test Plan: Ran without any flags and saw all callbacks removed.

Differential Revision: D75636035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154664
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2025-05-30 18:08:37 +00:00
65d8dba735 [nativert] move layout planner settings to torch (#154668)
Summary: att

Test Plan: ci

Differential Revision: D75633031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154668
Approved by: https://github.com/zhxchen17
2025-05-30 17:33:27 +00:00
3bdceab124 [dynamo] fix: added star operator for graph_break_hints (#154713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154713
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-05-30 17:31:03 +00:00
802ffd06c8 [Export] Add math module for deserialization (#154643)
Summary: As title

Test Plan: ci

Differential Revision: D75580646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154643
Approved by: https://github.com/yushangdi
2025-05-30 17:29:25 +00:00
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
3027051590 [export] avoid float/bool specialization for scalar tensor construction (#154661)
Fixes #153411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154661
Approved by: https://github.com/angelayi
2025-05-30 17:18:21 +00:00
e7bf72c908 [multigraph] fix composabilty with aotautograd cache (#153526)
AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #153433, #153449
2025-05-30 16:56:17 +00:00
7183f52675 [dynamo] Support namedtuple subclass (#153982)
Fixes #133762. This involves
1. support tuple subclass constructed inside compile region.
2. handle the "fake" global scope associated with NamedTuple-generated
   `__new__`.
3. handle `namedtuple._tuplegetter` more faithfully.

Differential Revision: [D75488091](https://our.internmc.facebook.com/intern/diff/D75488091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153982
Approved by: https://github.com/jansel
ghstack dependencies: #154176
2025-05-30 16:14:37 +00:00
8002d22ce3 [dynamo] Trace into descriptor with __set__ (#154176)
As title, this patch basically implements
https://github.com/python/cpython/blob/3.11/Objects/object.c#L1371-L1452,
and make the `__get__` handling more robust.

I ran into this while fixing #133762.

Differential Revision: [D75488090](https://our.internmc.facebook.com/intern/diff/D75488090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154176
Approved by: https://github.com/jansel
2025-05-30 16:14:37 +00:00
31f95b5d2e Revert "inductor codecache: include private inductor configs in cache key (#153672)"
This reverts commit 2c1cb38d9516e10474b4f12a2e839046648a71a8.

Reverted https://github.com/pytorch/pytorch/pull/153672 on behalf of https://github.com/malfet due to Looks like it regressed pr_time_benchmarks, see ba3f91af97/1 ([comment](https://github.com/pytorch/pytorch/pull/153672#issuecomment-2922759739))
2025-05-30 15:54:14 +00:00
4b1f047a33 Add CPython list/tuple tests (#150790)
Tests:
* test_list.py
* test_tuple.py
* test_userlist.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_raise" "test_list" "test_tuple" "test_userlist"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150790
Approved by: https://github.com/williamwen42
2025-05-30 15:53:38 +00:00
ba3f91af97 Type hints for distributions/utils (#154712)
Fixes #144196
Part of #144219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154712
Approved by: https://github.com/Skylion007
2025-05-30 15:50:31 +00:00
0f81c7a28d [CI] Pin the torchao version used when testing torchbench (#154723)
Summary: To fix a recent CI breakage. As a follow-up, the torchao pin in .github/ci_commit_pins/torchao.txt is 6-month old. We should bump up that once we verify this fix works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154723
Approved by: https://github.com/eellison
2025-05-30 15:04:26 +00:00
7e8532077f Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))
2025-05-30 13:16:33 +00:00
cyy
1ece53b157 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
9d6f0d5991 avoid sym_max on nested int in is_contiguous. (#154633)
calling is_contiguous will fail due to sym_max not being supported for nested int, this address in a way consistent with
make_contiguous_strides_for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154633
Approved by: https://github.com/bobrenjc93
2025-05-30 09:59:33 +00:00
3c05167489 [Intel GPU] fix matmul accuracy when offset > 0 (#154495)
This pr will make matmul tensors contiguous if they are not 64 byte alignment. oneDNN requires a minimal alignment of 64 https://uxlfoundation.github.io/oneDNN/dev_guide_c_and_cpp_apis.html#intel-r-processor-graphics-and-xe-architecture-graphics

Fixes https://github.com/intel/torch-xpu-ops/issues/1656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154495
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang
2025-05-30 09:53:51 +00:00
aec3ef1008 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 08:53:24 +00:00
dc82e911e7 remove allow-untyped-defs from torch/utils/data/datapipes/iter/filelister.py (#154624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154624
Approved by: https://github.com/Skylion007
2025-05-30 08:38:05 +00:00
639f459cb6 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit c3de2c7c6bc865b9fabd2db8f2af6383936aa653.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to see if it help fix the broken trunk below.  It it does not help, I will reland the PR ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2921562089))
2025-05-30 08:11:22 +00:00
f889dea97d [internal] Expose additional metadata to compilation callbacks (#153596)
These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward.

Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout.

```python
class CallbackTrigger(enum.Enum):
    # most common case, dynamo attempts to trace a new frame
    DYNAMO = 1
    # backward compilation can be deferred to runtime
    LAZY_BACKWARD = 2
    # some backends autotune at runtime
    TRITON_AUTOTUNING = 3
    # cudagraphs record at runtime
    CUDAGRAPH_RECORDING = 4
```

Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596
Approved by: https://github.com/masnesral
2025-05-30 08:07:04 +00:00
208965a9d6 Fix unbackend symint error (#154672)
## Summary

Me and @laithsakka  spoke offline about this one, TLDR is that we wanted this
![image](https://github.com/user-attachments/assets/2e537612-3261-4fbe-a6b9-f8ff92ba3c37)

to also be true for Inductor. In that vein we added two new apis to size-vars which is `guard_or_false`, or `guard_or_true`
with the semantics:

guard_or_false, guard_or_true:

Those APIs may add guards, but will never fail with data-dependent errors; They will try to evaluate the expression with the possibility of adding guards, if that fails due to data dependency, instead of hard failing. False or True are returned.

When to use this?

Performance optimizations that warrant a recompilation.

Take the general path and add a runtime check.
```
# Consider this branching.
if x==0:
    return 1
else
    return 10
# To make data dependent friendly, it can be written as the following:
if guard_or_false(x==0):
    return 1
else
  torch.check(x!=0) # runtime check
  return 10
```

However there is still 1 more api to add to make this example work which is the torch.check which works with expressions, I will leave that to the @laithsakka

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154672
Approved by: https://github.com/laithsakka
2025-05-30 07:45:01 +00:00
5a7442b91f remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626
Approved by: https://github.com/Skylion007
2025-05-30 07:43:04 +00:00
d66a55def0 remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625
Approved by: https://github.com/Skylion007
2025-05-30 07:37:56 +00:00
382b38ed1b remove allow-untyped-defs from torch/nn/utils/_expanded_weights/conv_expanded_weights.py (#154623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154623
Approved by: https://github.com/Skylion007
2025-05-30 07:32:57 +00:00
bcbd2a22b2 [Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU (#147693)
* add onednn primitive cache for int4 gemm for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147693
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/guangyey, https://github.com/ZhiweiYan-96

Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-30 07:26:36 +00:00
0df96e3921 remove allow-untyped-defs from torch/ao/quantization/stubs.py (#154622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154622
Approved by: https://github.com/Skylion007
2025-05-30 07:26:09 +00:00
30f7079c93 [FSDP2] allow different dtypes for no grad model params (#154103)
Fixes #154082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103
Approved by: https://github.com/weifengpy
2025-05-30 07:00:54 +00:00
d173ba5a75 Revert "Remove MemPoolContext (#154042)"
This reverts commit 3b38989b5f8f918cf1ad38bdade059608544af4b.

Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))
2025-05-30 06:53:37 +00:00
0fdd568b78 [forward fix] add support for MemoryFormat after type tightening (#154658)
Summary:
fixes error:
```
    raise AssertionError(f"Unexpected type in c_type_for_prim_type: {type_=}")
AssertionError: Unexpected type in c_type_for_prim_type: type_=MemoryFormat
```

after https://github.com/pytorch/pytorch/pull/154371 | D75568111

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/test:test_custom_ops -- --exact 'deeplearning/aot_inductor/test:test_custom_ops - test_export_extern_fallback_nodes (deeplearning.aot_inductor.test.test_custom_ops.TestAOTInductorProxyExecutor)'
```

Differential Revision: D75617432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154658
Approved by: https://github.com/Camyll, https://github.com/atalman, https://github.com/malfet
2025-05-30 06:53:25 +00:00
a4b0023f3b [cutlass backend] Cache config generation locally and remotely (#154686)
Summary:
Trying to cache the json list of configs.

There are probably some more work:
* preset
* filelock (?)
* for cases where we generate from scratch, save it to local as well (?)

Test Plan: tested offline

Reviewed By: coconutruben

Differential Revision: D75334439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154686
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-05-30 05:40:46 +00:00
ba51f4876d Revert "Enable C++ dynamic shape guards by default (#140756)"
This reverts commit dc0f09a4785349fc3b4e4d3dc3c02b018e5a0534.

Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/izaitsevfb due to seem to break dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2921151663))
2025-05-30 03:52:02 +00:00
852b99eba0 Revert "[c10d] Separate monitoring thread into a class in PGNCCL (#153977)"
This reverts commit 0db9c64d68dcdf25210357c4f7a41618441091d4.

Reverted https://github.com/pytorch/pytorch/pull/153977 on behalf of https://github.com/izaitsevfb due to breaks lots of jobs internally, safer to revert, see D75628917 ([comment](https://github.com/pytorch/pytorch/pull/153977#issuecomment-2921146129))
2025-05-30 03:46:43 +00:00
20ee5f9044 remove allow-untyped-defs from elastic_distributed_sampler.py (#154620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620
Approved by: https://github.com/Skylion007
2025-05-30 03:29:45 +00:00
9c06dff1ce [multigraph] use specializations in compile_and_call_fx_graph (#153449)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes:**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes (this PR):**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449
Approved by: https://github.com/zou3519
ghstack dependencies: #153433
2025-05-30 03:19:49 +00:00
c3de2c7c6b [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Differential Revision: D74691131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 03:09:37 +00:00
4a302b5731 NativeRT readme (#154581)
Summary: att

Test Plan: ci

Differential Revision: D75557667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154581
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/yiming0416
2025-05-30 02:50:53 +00:00
adfd5b293a Enhance UT on elapsed_time for XPUEvent (#154494)
# Motivation
UT enhancement to avoid the incorrect elapsed time return by xpu's Event.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154494
Approved by: https://github.com/EikanWang
2025-05-30 02:00:02 +00:00
0289313551 [AOTI] Support OptionalTensor return type in AOTI proxy executor (#154286)
Summary:

When a C++ custom op returns an uninitialized tensor, it will be marked as None in Python. For this scenario, the user should mark the possibly uninitialized return as Tensor? in the custom op schema.
This diff adds `as_optional_tensor` type to export schema and the support for optional tensor in AOTI proxy executor.

Test Plan:

```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output
```

Differential Revision: D75262529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154286
Approved by: https://github.com/desertfire
2025-05-30 01:53:00 +00:00
58ead04ee9 [dynamic shapes] unbacked safe unsqueeze (#154087)
Also ran into this working on https://github.com/SWivid/F5-TTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154087
Approved by: https://github.com/laithsakka
2025-05-30 01:41:57 +00:00
172015fc11 [multigraph] add specialize_on kwarg to mark_{dynamic,unbacked} (#153433)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes (this PR):**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes:**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153433
Approved by: https://github.com/zou3519
2025-05-30 01:08:15 +00:00
9371491529 [Reland][pytorch] Patch the _is_conv_node function (#154473)
Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Differential Revision: D75494468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473
Approved by: https://github.com/Skylion007
2025-05-30 00:41:03 +00:00
d6cb0fe576 [MPS] Extend index_copy support to complex dtypes (#154671)
Should have noticed it during the review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154671
Approved by: https://github.com/dcci
ghstack dependencies: #154670
2025-05-30 00:28:13 +00:00
0134150ebb [MPS][BE] Do not copy sizes/strides unnecesserily (#154670)
Just pass them as args to `mtl_setArgs`, metaprogramming should deal with the rest
Also use `mtl_dispatch1DJob` instead of computing max threadgroup size by nand

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154670
Approved by: https://github.com/dcci
2025-05-30 00:28:13 +00:00
61bfb3df9f [a2av] Improve tuning for 4 GPUs (#154580)
### Problem
Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch).
Before:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91
```

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

### Change
If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90
```

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154580
Approved by: https://github.com/ngimel
2025-05-30 00:26:13 +00:00
2c1cb38d95 inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
ghstack dependencies: #153766
2025-05-30 00:24:29 +00:00
5b6fd277f9 convert inductor codecache to use getArtifactLogger (#153766)
I'm not entirely sure of the background for why inductor codecache code uses default python logging instead of the new TORCH_LOGS-based artifact logging, but switching it over to artifact logging makes it easier to use nice testing utils in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153766
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2025-05-30 00:24:29 +00:00
eqy
818f76a745 [cuDNN] Allow cudnn attention or flash attention in test_export.py regex (#154458)
Analogous to #153272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154458
Approved by: https://github.com/drisspg
2025-05-29 23:51:09 +00:00
dc0f09a478 Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305, https://github.com/laithsakka
ghstack dependencies: #151225
2025-05-29 23:44:43 +00:00
0c6c7780d9 [Inductor] Add envvar to disable decomposeK (#154421)
Summary: Add envvar to Inductor config to disable decomposeK autotuning choice

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_max_autotune_decompose_k_dynamic_False_sizes2 (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Reviewed By: eellison

Differential Revision: D75174823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154421
Approved by: https://github.com/eellison
2025-05-29 23:34:41 +00:00
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
d5e0704247 [ROCm] Update maxpool launch config (#154619)
* Better perf on MI300 with updated launch configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154619
Approved by: https://github.com/jeffdaily
2025-05-29 23:28:07 +00:00
43b18d098b Forward fix for test_frame_traced_hook in internal testing (#154641)
Summary: Fixes the newly-added dynamo test test_frame_traced_hook so it can run internally

Test Plan: This is a test change

Differential Revision: D75616787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154641
Approved by: https://github.com/Skylion007
2025-05-29 23:02:01 +00:00
b040d63ce4 Prevent SAC cache from being kept alive by reference cycle (#154651)
Fixes https://github.com/pytorch/pytorch/issues/154642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154651
Approved by: https://github.com/xmfan
2025-05-29 22:27:35 +00:00
7d17253af8 [BE]: Improve aten formatter with fmtlib (#152830)
Replaces stateful ostream output with stateless fmtlib, which is signficantly faster and more contained. It is especially faster for the type of complex double formatting found here since it uses the newer [DragonBox algorithm](https://github.com/jk-jeon/dragonbox) for faster floating point formatting (which is the main bottleneck here). This also enables some static time checking of the formatting strings

test plan: all tests pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152830
Approved by: https://github.com/cyyever, https://github.com/malfet, https://github.com/atalman
2025-05-29 22:11:30 +00:00
fdbf314278 [Inductor] Cache subgraph autotuning choices properly (#154067)
Differential Revision: D75170507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154067
Approved by: https://github.com/eellison
2025-05-29 22:01:44 +00:00
c7e8e8ee19 Add torch.profile benchmarking function to feedback_fns (#153579)
Summary: Updates some benchmarking code to have the option to use torch.profile, and passes in a thunk to benchmark_fns to get this information (this will be a different result from `timings`, which are already passed into those functions).

Test Plan: Existing unit tests.

Differential Revision: D74444990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153579
Approved by: https://github.com/coconutruben, https://github.com/masnesral, https://github.com/nmacchioni
2025-05-29 21:43:45 +00:00
1237f271aa [ROCm] MIOpen: Get current device from Torch rather than HIP in handle creation (#154549)
Get current device from Torch rather than HIP in MIOpen handle creation. The device may have already been set from torch side, otherwise device is set to 0 for handle.  Additional audits of cudnn vs miopen Handle.cpp file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154549
Approved by: https://github.com/jeffdaily, https://github.com/cyyever

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:12:12 +00:00
08fdc64c86 [ROCm] Exposing Some MIOpen Symbols (#2176) (#154545)
This PR exposes some MIOpen symbols, namely:

1. `miopenDataType_t getMiopenDataType(const at::Tensor& tensor)`
2. `miopenHandle_t getMiopenHandle()`
3. `class TensorDescriptor`
4. `class Descriptor`
5. `class FilterDescriptor`
6. `struct ConvolutionDescriptor`
7. `struct DropoutDescriptor`
8. `struct RNNDescriptor`

to enable adding extensions that make use of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154545
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:10:45 +00:00
83a0e4e6f9 [Visualizer] Start at index with most events (#154571)
Summary: Oftentimes a single snapshot will contain multiple GPU traces in it based on what the process can see. In this case lets just start with the gpu trace with the highest amount of activity

Test Plan:
Ran od with: https://www.35929.od.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-chujiechen-f701302011/1/rank-1_itrn-3.Mar_01_06_10_09.3747.snapshot.pickle
And it started at index 1 instead of 0

Differential Revision: D75555558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154571
Approved by: https://github.com/aaronenyeshi
2025-05-29 20:49:33 +00:00
2bc8fec744 deprecate MTIA_WORKLOADD from pytorch (#154627)
Differential Revision: D75612179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154627
Approved by: https://github.com/sraikund16
2025-05-29 20:30:40 +00:00
cb56df55dc [Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331)
Fixes #153298

This PR is the 3rd and final step of #147479
All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated.
All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary.

[henrylhtsang](https://github.com/henrylhtsang)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331
Approved by: https://github.com/henrylhtsang
2025-05-29 20:29:58 +00:00
629fca295e Always set CPU affinity for benchmark jobs (#154569)
Because metrics like compilation time requires CPU.  I want to see if this help fix https://github.com/pytorch/pytorch/issues/152566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154569
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-29 20:11:47 +00:00
3afbab66f7 [BE] Remove unused release scripts. Add clarifications for the branch cut process (#154649)
Scripts in ``scripts/release/promote/`` are not used for a while.
We use the ones in test-infra [here](https://github.com/pytorch/test-infra/blob/main/release/) .
Hence this small cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154649
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-29 19:49:37 +00:00
e8f5c24d17 [rocm]add device guard when initialize single stream (#154433)
Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error.

Differential Revision: D75456460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433
Approved by: https://github.com/jeffdaily
2025-05-29 19:42:12 +00:00
20ec61a02f [BE] fix lint errors caused by const SROpFunctor fn (#154552)
Summary: Remove const quaiflier from SR suggsted from CLANGTIDY.

Test Plan: arc lint -a -e extra --take CLANGTIDY caffe2/torch/fb/sparsenn/cpu_operators/to_dense_representation_cpu.cpp

Reviewed By: henryoier

Differential Revision: D75534056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154552
Approved by: https://github.com/Skylion007
2025-05-29 19:40:08 +00:00
5a21d6f982 [AOTI][reland] Support multi-arch when using package_cpp_only (#154608)
Summary: Reland https://github.com/pytorch/pytorch/pull/154414

Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154608
Approved by: https://github.com/yushangdi
2025-05-29 19:32:33 +00:00
0db9c64d68 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-05-29 17:45:04 +00:00
6f992e1b3f [BE][AT] cleanup my old todo (#154542)
Summary: this todo is very old, and probably not needed anymore. let's have CI figure out if removing this breaks anything

Test Plan: CI

Differential Revision: D75491068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154542
Approved by: https://github.com/Skylion007
2025-05-29 17:22:01 +00:00
634ce22601 [MPSInductor] Fix codegen for nested multistage reductions (#154578)
Yet to write a unittest for it, but this fixes codegen for
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5  --backend inductor --inference --devices mps --float16
```

By correctly closing triple nested loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154578
Approved by: https://github.com/jansel, https://github.com/dcci
2025-05-29 17:09:25 +00:00
8883e494b3 [cutlass backend][ez] remove indent for cutlass config serialization (#154573)
Differential Revision: [D75566642](https://our.internmc.facebook.com/intern/diff/D75566642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154573
Approved by: https://github.com/ColinPeppler
2025-05-29 17:00:52 +00:00
41092cb86c [MPS] index copy impl (#154326)
Second most requested op according to #154052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326
Approved by: https://github.com/malfet
2025-05-29 16:57:43 +00:00
733e684b11 Skip test file that doesn't run gradcheck for slow gradcheck (#154509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154509
Approved by: https://github.com/malfet
2025-05-29 16:32:26 +00:00
2c6f24c62d [ROCm] Updated default workspace for gfx95 (#153988)
Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153988
Approved by: https://github.com/jeffdaily
2025-05-29 16:22:17 +00:00
53b0f6f543 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 4613081b729273a9273185e9ef7470ce76e22da2.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see ef1d45b12d/1 ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))
2025-05-29 16:14:28 +00:00
ef1d45b12d Cleanup parent fallback logic (#154006)
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
2025-05-29 13:40:36 +00:00
d6e29bf875 Reflect back mutation if we clone misaligned tensors (#154442)
Fix for https://github.com/pytorch/pytorch/issues/152425

inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.

If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.

We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch

t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)

@torch.compile(dynamic=False)
def foo(x):
    return x.add_(1)

import triton

print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.

In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-05-29 13:36:48 +00:00
3c74a72ea0 Keep XPU compatible with toolchain 2025.2 (#154359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359
Approved by: https://github.com/EikanWang, https://github.com/cyyever
2025-05-29 11:12:07 +00:00
cd9ff41282 check fallback_value first. (#154493)
This is just a refactor, not a fix for any issue.
we do check fallback_value first  and early exit instead of checking it not set over and over.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154493
Approved by: https://github.com/bobrenjc93
2025-05-29 09:06:43 +00:00
447b481c79 [AOTI] Save data sizes to constants_info (#154534)
Differential Revision: D75223179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154534
Approved by: https://github.com/muchulee8
2025-05-29 06:39:13 +00:00
9c7ed3e46e [debug_printer][BE] Fix float8 type printing for min/max value printing (#154466)
Summary:
ATT

GH Issue: https://github.com/pytorch/pytorch/issues/149008

**Previous:**
Failed to use debug printing for float8 types due to the limitation of "min_all_cuda" implementation from aten native:

 4b39832412/aten/src/ATen/native/cuda/ReduceMinValuesKernel.cu (L51)

Error:

Min value: Error: "min_all_cuda" not implemented for 'Float8_e4m3fn'

**Now:**
Example output paste: P1824621233
Unblocked float8 type tensor debug printing. Suggest to print the whole value if numel <= threshold.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_C
OMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_float8_dtype_cuda
```

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_fp8_cuda 2>&1 | tee fp8_example_printing.txt
```

Differential Revision: D74847967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154466
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang
2025-05-29 05:48:02 +00:00
07343efc15 [cutlass backend] small refactor to flatten the ops to avoid nested for loops (#154576)
Differential Revision: [D75565429](https://our.internmc.facebook.com/intern/diff/D75565429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154576
Approved by: https://github.com/ColinPeppler
2025-05-29 04:42:58 +00:00
b394c6e89c [Inductor][CPP] Add block sparse for FlexAttention CPU (#147196)
## Overview
This PR is to optimize FlexAttention CPP template with block sparse.
Block sparse is natively supported in FlexAttention block mask structures, thus following logic of the kv blocks from `kv_indice ` and `full_kv_indice ` is the strightforward way to add this optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147196
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-05-29 02:57:02 +00:00
c0864bb389 Add a (t * 0) pattern (#153161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153161
Approved by: https://github.com/danielvegamyhre
2025-05-29 02:19:36 +00:00
316e7a9293 [BE][Ez]: Denote common types as TypeAlias (#154527)
Denotes common_types as TypeAlias. This triggered a Ruff rule since we named our TypeAlias off standards so I added a file wide ruff suppression
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154527
Approved by: https://github.com/benjaminglass1, https://github.com/aorenste
2025-05-29 02:00:13 +00:00
2d932a2e01 [ROCm] Fix 3D tensor perf degradation with NHWC format (#154522)
Co-author: @doru1004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154522
Approved by: https://github.com/jeffdaily
2025-05-29 01:33:49 +00:00
cyy
4613081b72 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-29 00:52:44 +00:00
946a4c2bdc BE: Type previously untyped decorators (#154515)
Summary: Cloned #153726 from Skylion007 and fixed internal typing issues.

Test Plan: Unit tests pass

Differential Revision: D75477355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154515
Approved by: https://github.com/Skylion007
2025-05-29 00:36:34 +00:00
ba0a91b3ea [4/n][Optimus][Auto-AC] Expose the config to skip the dynamo gaurds to avoid recompile (#154152)
Summary:
context: https://fb.workplace.com/groups/1075192433118967/permalink/1673720956599442/

Thanks Microve for raising the existing dynamo skip API in D75196435

The dynamic shape triggers recompilation, introducing compilation time increase, we expose config that users can skip the dynamo guards to avoid the recompile. Note that it may quantize unnessarily nodes, which can impact NE, QPS and memory saving,  needs verification.

Differential Revision: D75248430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154152
Approved by: https://github.com/bobrenjc93
2025-05-29 00:35:37 +00:00
22a1b3b5d0 use 4 elements per thread in no-cast elementwise kernel (#154558)
Reduce elems per thread to 4 in vectorized function also (only for unaligned inputs where there's no vectorization anyway). This slightly reduces binary size (by 4MB)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154558
Approved by: https://github.com/malfet
2025-05-29 00:32:44 +00:00
40abb2b403 Fix deprecated amp APIs in docs (#154553)
Update usage of deprecated amp APIs.

Fixes https://github.com/pytorch/tutorials/issues/3331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154553
Approved by: https://github.com/Skylion007
2025-05-29 00:05:59 +00:00
2b3ac17aa2 [Cutlass] Remove spammy log for gemm extensions (#154548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154548
Approved by: https://github.com/henrylhtsang
2025-05-28 23:55:36 +00:00
81b7c96697 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 23:29:37 +00:00
6cda280483 [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 23:29:37 +00:00
bbd45f1f1f [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 23:29:37 +00:00
0f0d5749a0 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 23:29:37 +00:00
65b1aedd09 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-28 23:25:17 +00:00
3e05a48927 Fix clamp type promotion in inductor decomposition (#154471)
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.

Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```

Differential Revision: D75490124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2025-05-28 23:24:25 +00:00
d865b784e4 Support unbacked whitelist (#154295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154295
Approved by: https://github.com/angelayi
2025-05-28 23:01:22 +00:00
ef4d57329b [CAG] Support for call_module at copy paste aot bwd graph (#153827)
Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7

Problem is being observed with HPU backend in example repro due to creating fused modules.

```
import torch

device = 'cpu' #'hpu'
backend = 'inductor' #'hpu_backend'

def fn(t1):
    t1 = t1 * 1
    t1_grad = torch.ones_like(t1, device=device)
    t1.backward(t1_grad, retain_graph=True)
    return t1

t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze()
compiled_fn = torch.compile(fn, backend=backend)
result = compiled_fn(t1)

with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)):
    result_grad = torch.ones_like(result, device=device)
    result.backward(result_grad)

print(f'{result_grad=}')
print(f'{t1.grad=}')
```

With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze):
`torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')`

While on CPU there's following warning and None returned:
`repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
  print(f'{t1.grad=}')
t1.grad=None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153827
Approved by: https://github.com/xmfan
2025-05-28 22:52:40 +00:00
d62a33c002 [ez] add docblock for _expandsums (#154397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154397
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396, #154399
2025-05-28 22:43:26 +00:00
0c00e32632 [ez] add docblock for _eval_is_non_overlapping_and_dense (#154399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154399
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396
2025-05-28 22:40:03 +00:00
0f56318152 [precompile] Add Exception type PackageError for unsupported precompile features. (#154430)
Summary:
Today when guard serialization fails, dynamo will raise an internal error like:

```
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CLOSURE_MATCH guard cannot be serialized.
```

Adding a dedicated PackageError type to surface the error more clearly.

Test Plan: CI

Differential Revision: D75452124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154430
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-05-28 22:34:51 +00:00
11129d9317 Add new ops in fallback ops (#154251)
Fixes #ISSUE_NUMBER

## Background

Task: [T222738229](https://www.internalfb.com/intern/tasks/?t=222738229)

It's the first starter task on the project **_Enabling TorchNative Standalone on Whisper_**.  We are using cshim to create a layer of abstraction between _**libtorch**_ and **_AOTInductor generated artifacts_**.

So we needed to add an entry in the cshim for every API surface in libtorch. And we only care about operators that AOTInductor does not handle. And for this task, we only wanted to add it for the following ops.

## What I've done?

4 new fallback ops are added that show up in the Whisper model. (torchgen/aoti/fallback_ops.py)

- aten.permute (default)
- aten.squueze (dim)
- aten.abs (default)
- aten.hann_window (default)

Then I ran the below command to generate new header C shim header files. As it says [here](7e86a7c015/torchgen/gen.py (L2424-L2436%20for%20details))
`python torchgen/gen.py --update-aoti-c-shim`

Then, `python setup.py develop` to rebuild PyTorch

## Testing

Also 4 new tests have been added on test/inductor/test_aot_inductor.py

- test_proxy_executor_permute
- test_proxy_executor_abs
- test_proxy_executor_squeeze
- test_proxy_executor_hann

I ran these commands to test it (inside local pytorch root folder):

`python test/inductor/test_aot_inductor.py -k test_proxy_executor_permute`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_abs`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_squeeze`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_hann`

## NOTE:
I didn't see any order between the tests inside _test/inductor/test_aot_inductor.py_. That's why, I added new tests just after the test given in the example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154251
Approved by: https://github.com/angelayi
2025-05-28 22:11:07 +00:00
d2f506cae8 [ca] disable ca for functorch grad and run all HOO tests (#154147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154147
Approved by: https://github.com/zou3519
ghstack dependencies: #154133
2025-05-28 22:06:13 +00:00
857f21631d [ca] fix hop_db tests (#154133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154133
Approved by: https://github.com/zou3519
2025-05-28 22:06:13 +00:00
ed348e7026 Add docblock for TrackedFake (#154396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154396
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398
2025-05-28 21:19:49 +00:00
d311b79c12 add docblock for _fast_expand (#154398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154398
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400
2025-05-28 21:16:47 +00:00
e7318b863d [ez] add docblock to cast_symbool_to_symint_guardless (#154400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154400
Approved by: https://github.com/laithsakka
2025-05-28 21:11:53 +00:00
f6dcc45c44 [Kineto x Insight] Add device to activity type map in pytorch (#154253)
Summary: Update the device to ActivityType Map in pytorch. Need to be exported to github

Test Plan:
Run the ondemand e2e test and insight profiler is triggered during profiling
P1819539581: https://www.internalfb.com/intern/paste/P1819539581/
{F1978519960}

Insight profiler is not enabled when mtia_insight not specifying in config
{F1978527200}

Reviewed By: fenypatel99

Differential Revision: D75246621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154253
Approved by: https://github.com/Skylion007
2025-05-28 20:36:19 +00:00
e25074d462 [c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441)
Fixing internal error caused by #153167.

`skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6.
So we'd need to set expected return code conditional on `IS_SANDCASTLE`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441
Approved by: https://github.com/fduwjj, https://github.com/nWEIdia
ghstack dependencies: #153167
2025-05-28 20:35:52 +00:00
c381103fd7 Fix the logic of set_cpu_affinity (#154503)
While investigating https://github.com/pytorch/pytorch/issues/152566, I found two issues with how the cpu affinity is set in benchmark job:

* The current logic doesn't work with cgroups slice, the mechanism behind multi-tenant runner:
    * Using `lscpu` returns all CPUs and not the available ones from cgroups.  On the other hand, `nproc` works correctly.  For example, on H100, `lscpu` returns 192 CPUs while `nproc` returns 24 (192 / 8)
    * Setting `taskset -c 0-N` blindly is wrong because CPU 0 is only available to the the first tenant, aka alice.  For example, running `taskset -c 0 ls` on any other tenants will fail. To fix this, the ID of available CPUs can be fetched by calling `os.sched_getaffinity(0)`.
* The last bug is `taskset` works with logical CPUs https://www.man7.org/linux/man-pages/man1/taskset.1.html, so using the result from `test_inductor_get_core_number` is also wrong because that function returns the number of physical CPUs.

### Testing

CPU benchmark jobs look ok

* [aarch64 torch.compile benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2021%20May%202025%2016%3A40%3A28%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A40%3A28%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=fix-cpu-affinity-cgroups&lCommit=9a6288e083d650c470623f5fe136b1060824021c&rBranch=main&rCommit=dec5ab8d984b8a608140911351d877b9ddb141c2)
* [x86 micro benchmark](https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2021%20May%202025%2016%3A41%3A26%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A41%3A26%20GMT&granularity=day&lBranch=main&lCommit=c1b7dbc52aaa49f4cd147bbe5935110a4a10e3e3&rBranch=refs/tags/ciflow/inductor-micro-benchmark-cpu-x86/154503&rCommit=9a6288e083d650c470623f5fe136b1060824021c&repoName=pytorch%2Fpytorch&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=cpu%20(x86_64)&archName=All%20Platforms)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-28 19:38:20 +00:00
66f53889d5 [nativert] port semaphore to c10 util (#153504)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore

gonna need a oss build export to take a look at this...

Test Plan: CI

Differential Revision: D73882656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504
Approved by: https://github.com/zhxchen17
2025-05-28 19:17:30 +00:00
24980d2641 [ROCm][CI] Update build-environment for mi300 workflows (#153134)
so their test times are tracked separately in https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json. Currently, both MI200 and MI300 test times get combined into the same key `linux-focal-rocm-py3.10`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153134
Approved by: https://github.com/huydhn
2025-05-28 19:04:53 +00:00
d4ab8e74f3 Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095)"
This reverts commit c6fc11af760d4ad1f01cc699a3c6488ab5f41770.

Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/izaitsevfb due to still fails to link internally at meta ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2917221575))
2025-05-28 18:22:39 +00:00
1c7a70b483 [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-28 17:35:19 +00:00
66ac724b56 pyfmt lint torch/_export/passes/replace_view_ops_with_view_copy_ops_pass.py (#154488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154488
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485, #154487
2025-05-28 17:07:15 +00:00
dfe0f48123 pyfmt lint torch/_export/serde/schema.py (#154487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154487
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485
2025-05-28 17:07:15 +00:00
92cebed1bd pyfmt lint torch/_export/serde/serialize.py (#154485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154485
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484
2025-05-28 17:07:07 +00:00
b4fe5ca58a pymft lint torch/utils/weak.py (#154484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154484
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483
2025-05-28 17:06:58 +00:00
4de1b25df7 Remove empty files from execlude lint rule (#154483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154483
Approved by: https://github.com/Skylion007
2025-05-28 17:06:50 +00:00
70539308ac [dynamo] updating gb_type names for uniqueness (#154452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154452
Approved by: https://github.com/williamwen42
2025-05-28 16:54:10 +00:00
e313152a33 SDPA fix memory efficient attention for large batch dim (#154029)
Fixes #146704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029
Approved by: https://github.com/ngimel
2025-05-28 16:53:53 +00:00
3b38989b5f Remove MemPoolContext (#154042)
Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042
Approved by: https://github.com/albanD, https://github.com/syed-ahmed
2025-05-28 16:35:48 +00:00
d23aa7e182 Add deprecation warning for torch.ao.quantization (#153892)
Summary:
att

Test Plan:
(ao) $ PYTHONWARNINGS='default' python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.ao.quantization.quantizer.xnnpack_quantizer import XNNPACKQuantizer
printing warning
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/__init__.py:36: DeprecationWarning: torch.ao.quantization is deprecated. Plan is to
1. Remove eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead
2. Remove fx graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx, torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e)
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e)
see https://dev-discuss.pytorch.org/t/torch-ao-quantization-migration-plan/2810 for more details
  warnings.warn(
>>> a = XNNPACKQuantizer()
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/quantizer/xnnpack_quantizer.py:281: DeprecationWarning: XNNPACKQuantizer is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead
  warnings.warn(f"{self.__class__.__name__} is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead", DeprecationWarning)
>>>

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153892
Approved by: https://github.com/Skylion007
2025-05-28 16:25:30 +00:00
5bf74753f6 [precompile] Prune local scope variables for guard serialization. (#154431)
Summary: Prune unused local objects from serialized local scope if they are not used in guard reconstruction. This is helpful when a user program takes things like local callable functions or the function call is recursive.

Test Plan:
test/dynamo/test_guard_serialization.py -k test_function_locals

Before pruning locals:
```
state = GuardsState(output_graph=OutputGraphGuardsState(local_scope={'x': tensor([ 0.0461,  0.4024, -1.0115]), 'g': <function ...aints=None, _guards=<torch._guards.GuardsSet object at 0x7fbccc7e9fc0>, _aotautograd_guards=[]), shape_code_parts=None)

    def pickle_guards_state(state: GuardsState) -> bytes:
        buf = io.BytesIO()
        pickler = GuardsStatePickler(buf)
        try:
            pickler.dump(state)
        except AttributeError as e:
>           raise torch._dynamo.exc.PackageError(str(e)) from e
E           torch._dynamo.exc.PackageError: Can't pickle local object 'TestGuardSerialization.test_function_locals.<locals>.foo'
```
After the diff
```
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D75452123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154431
Approved by: https://github.com/jansel
2025-05-28 16:03:02 +00:00
9db7bcb3fe [Dynamo] Introduce hook receiving list of traced code objects (#153622)
This PR:
* Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects
* Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph`
    *  Whenever an `inline_call()` is encountered, the corresponding code object is added to this set
    * `OutputGraph`'s associated `f_code` is added to the list just before the hook is called

I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622
Approved by: https://github.com/jansel
2025-05-28 15:40:09 +00:00
476e0a643a [ez] add docblock for ShapeGuardPythonPrinter (#154403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154403
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385, #154402
2025-05-28 14:17:17 +00:00
473a93eb58 [ez] add docblock for _ShapeGuardPrinter (#154402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154402
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385
2025-05-28 14:13:22 +00:00
35a473e364 [ez] add docblock for guard_scalar (#154385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154385
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384
2025-05-28 14:10:07 +00:00
ee4f433963 [ez] add docblock for _guard_or (#154384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154384
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383
2025-05-28 14:06:29 +00:00
e9b97d19b1 [ez] Make SymNodeImpl comments less misleading (#154480)
As discussed in DS workchat, it's easy for users to get confused by
guarding for these supposedly non-guarding methods. The TL;DR is in the
case of non pythonic compilers like XLA, we actually do guard. I've
updated the comments accordingly to reduce confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154480
Approved by: https://github.com/pianpwk, https://github.com/Skylion007
2025-05-28 14:04:32 +00:00
a75e3a02be Revert "[dynamo, nested graph breaks] small fixes to resume function generation (#151056)"
This reverts commit 28e7aa21c522e92ea01a62dfdc5e3b74e398d8f0.

Reverted https://github.com/pytorch/pytorch/pull/151056 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
9603d6382d Revert "[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)"
This reverts commit 1fe98429222a8ba5e16dd9381f50a8fb90edcf0e.

Reverted https://github.com/pytorch/pytorch/pull/153510 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
5fd7004dc9 Revert "[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)"
This reverts commit 9a66c30bdc563c62375e5030c4103b67515b8dac.

Reverted https://github.com/pytorch/pytorch/pull/153772 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
e86439ed5b Revert "[dynamo, nested graph breaks] add skip_frame debugging function (#153773)"
This reverts commit aadf9eae63c4793e1107a3b21ede30e5289eeaca.

Reverted https://github.com/pytorch/pytorch/pull/153773 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
203b0efd63 [PP] Allow unused kwargs in ZB path (#153498)
This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error:

```
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in
[rank3]:[rank3]: return lambda: stage_backward_input(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input
[rank3]:[rank3]: dinputs = torch.autograd.grad(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad
[rank3]:[rank3]: result = _engine_run_backward(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad
```

related issues: https://github.com/pytorch/torchtitan/issues/1188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498
Approved by: https://github.com/kwen2501
2025-05-28 13:34:04 +00:00
cf7451f279 Fix signature of torch.sparse_coo_tensor() (#152681)
Fixes #145371

@pearu Searched all and find these codes, wondering whether is the root cause of the issue, could you have a review? Thanks a lot!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152681
Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/nikitaved
2025-05-28 13:16:41 +00:00
f58143b945 [Typing] Refactor torch.types.Device in torch/cuda/__init__.py (#153447)
Part of: #152952
Follow up: #153027

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

So `Optional[Union[Device, int]]` is equivalent to `torch.types.Device`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153447
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-28 10:09:31 +00:00
fdc339003b Revert "[AOTI] Support multi-arch when using package_cpp_only (#154414)"
This reverts commit a84d8c4a1cc515db274366537afd0b1492800c2d.

Reverted https://github.com/pytorch/pytorch/pull/154414 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm trunk job ([comment](https://github.com/pytorch/pytorch/pull/154414#issuecomment-2915597821))
2025-05-28 09:23:31 +00:00
853958f82c Fix: Replacements can cause runtime assertions to disappear and can cause invalid inductor code. (#153661)
Lets explore firs a couple of problem related to replacements and runtime assertions.

#### example problem 1
if we have a runtime assertions that u0==s0, u0 is an input coming from mark_unbacked. A replacement u0=s0 will be added, the function f(u0, s0) will become f(s0, s0), this leads to the assert  not being inserted during insert_deferred_runtime_asserts.
The reason is that insert_deferred_runtime_asserts logic insert each assertion once all its inputs are seen,  but u0 will never be seen. Same thing can happen when we defer assertion on backed i.e: s0==s2 ..etc.

#### example problem 2
Consider u0==s0, where u0 is coming from a call to .item() Imagine later on that a specialization happens to s0 to become 2. In that case s0 as input wont be seen during insert_deferred_runtime_asserts and the assertion won't be inserted in the graph. Worse, Inductor will generate some code that refers to s0 in the cpp wrapper while it does not exist, causing a failure.
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1669766396994898/

## The solution :
Runtime assertions insertion loops depend on detecting that the symbols that are used in the runtime assertions are seen, note that those symbols are either graph inputs or generated in the graph from data dependent ops like .item().

The issues above happen when symbols are graph inputs, in order to force the symbols to exist in the graph and to be seen by the runtime assertions we do not do replacements on placeholders expressions during codegen and during runtime assertions insertion.

This should not have performance overhead, since we already optimized the graph with replacements, the only effect is not mistakenly dropping graph inputs that are used in runtime assertions.
I added extended testing. A solo unrelated follow up that I noticed, is that we might want to rename unbacked symbols in runtime assertions when we do unbacked renaming, but that's a different issue.

Other approaches that did not work :
#### ban replacements on unbacked.
1. does not work when we defer runtime assertions on backed ex: s0==s1. we could also ban such replacements
but problem 2 becomes more problematic.
2. Problem two, it affects the quality of reasoning ! in a bad way.

#### Apply specialization on runtime assertions before codegen .
1. Can fix some issues, but may lead also to runtime assertions becoming NOPs.
2. Does not fix the issue if not inserting runtime assertions during insert_deferred_runtime_asserts due to input not being detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153661
Approved by: https://github.com/jansel
2025-05-28 09:08:05 +00:00
aadf9eae63 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 08:54:09 +00:00
9a66c30bdc [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 08:54:09 +00:00
1fe9842922 [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 08:54:09 +00:00
28e7aa21c5 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 08:54:09 +00:00
cyy
9d04c0f352 Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-28 08:44:58 +00:00
1d9b7dd2d1 [PGO] suggest dynamic whitelist for recompilations (#154189)
suggests `TORCH_COMPILE_DYNAMIC_SOURCES` based off tensor size changes in PGO code state, including parameters.

Closing #153442 which took the dynamo guards approach.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154189
Approved by: https://github.com/bobrenjc93
2025-05-28 07:11:43 +00:00
fe760b6636 [ez] add docblock for _free_unbacked_symbols_with_path (#154383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154383
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381
2025-05-28 05:53:50 +00:00
8e25ba6963 [ez] add docblock for find_symbol_binding_fx_nodes (#154381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154381
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380
2025-05-28 05:44:26 +00:00
08c29deb5f [ez] add docblock to is_symbol_binding_fx_node (#154380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154380
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379
2025-05-28 05:41:19 +00:00
07405a6cff [ez] add docblock for free_unbacked_symbols (#154379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154379
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378
2025-05-28 05:37:25 +00:00
dcdaef5206 [ez] add docblock for free_symbols (#154378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154378
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377
2025-05-28 05:34:25 +00:00
abc3fdc7ac [ez] add docblock for _iterate_exprs (#154377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154377
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405
2025-05-28 05:28:58 +00:00
ab6cb85cb0 [ez] add docblock for _remove_effect_token_unbacked_bindings (#154405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154405
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404
2025-05-28 05:16:14 +00:00
fde8f6a8b8 [ez] add docblock for _suggest_torch_checks (#154404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154404
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386, #154401
2025-05-28 04:45:55 +00:00
b82fb57b67 [ez] add docblock for RuntimeAssert (#154401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154401
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386
2025-05-28 04:43:22 +00:00
d64b4a91dd [ez] remove unused function _constrain_symbol_range (#154386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154386
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376
2025-05-28 04:41:00 +00:00
ef90cc18d7 use definitely_contiguous for _prim_elementwise_meta short circuit (#153441)
*
This verifies that the check short circuit is not material. https://github.com/pytorch/pytorch/pull/153431
```
import torch
from torch.export import Dim, export
class MyModel(torch.nn.Module):
    def forward(self, x, ranks):
        first_k = ranks.max().item()
        torch._check_is_size(first_k)
        narrow = x.narrow(dim = 1, start = 0, length = first_k)
        lt = narrow < narrow.size(1)
        return lt
inps = (
    torch.randn((8, 16), device="cuda"),
    torch.arange(8, device="cuda", dtype=torch.int8)
)
spec = {
    "x": (Dim.AUTO, Dim.AUTO),
    "ranks": (Dim.AUTO,),
}
traced = export(MyModel(), inps, dynamic_shapes=spec, strict=True).run_decompositions({})

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153441
Approved by: https://github.com/jansel
ghstack dependencies: #153432
2025-05-28 03:41:26 +00:00
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
54f1f29fed [dynamo] dynamic gb_type -> static gb_type (#154435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154435
Approved by: https://github.com/williamwen42
2025-05-28 03:14:26 +00:00
f12ce4e36b [Intel GPU] convolution fusion at XPU backend (#154202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154202
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/etaf
ghstack dependencies: #140365
2025-05-28 03:14:18 +00:00
c6fc11af76 Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-28 02:47:16 +00:00
855eff8e8e Don't CSE unbacked nodes (#154387)
* #154440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154387
Approved by: https://github.com/TroyGarden
ghstack dependencies: #154440
2025-05-28 02:21:56 +00:00
919a1a17e3 [ez] Replace misleading implementations with NYI (#154440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-28 02:21:56 +00:00
a84d8c4a1c [AOTI] Support multi-arch when using package_cpp_only (#154414)
Summary: Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Differential Revision: [D75452096](https://our.internmc.facebook.com/intern/diff/D75452096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154414
Approved by: https://github.com/angelayi
ghstack dependencies: #154412, #154413
2025-05-28 01:20:38 +00:00
cde82d25b7 [AOTI] Add a multi_arch_kernel_binary option (#154413)
Summary: CUDA can support multi-arch with the fatbin format. Add this multi_arch_kernel_binary option, so the compiled model binary can run across different GPU archs.

Differential Revision: [D75452094](https://our.internmc.facebook.com/intern/diff/D75452094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154413
Approved by: https://github.com/angelayi
ghstack dependencies: #154412
2025-05-28 01:20:38 +00:00
4d8f3d537a [AOTI][refactor] Rename embed_cubin to embed_kernel_binary (#154412)
Summary: Rename as it is not CUDA specific.

Differential Revision: [D75452095](https://our.internmc.facebook.com/intern/diff/D75452095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154412
Approved by: https://github.com/angelayi
2025-05-28 01:20:28 +00:00
e79790e14b [ez] add docblock for _sympy_from_args (#154376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154376
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375
2025-05-27 23:43:13 +00:00
fe082c5ffe Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 23:16:21 +00:00
3f10c9d8af Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/jeffdaily
2025-05-27 23:10:25 +00:00
4b39832412 [CI] Update torchbench pin (#154453)
Related to https://github.com/pytorch/pytorch/issues/154446
Pins torchbench repo to a https://github.com/pytorch/benchmark/pull/2620 which pins opacus to ``1.5.3`` version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154453
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-05-27 23:08:42 +00:00
247ea229ba Create issue template: Release highlight for proposed Feature (#154125)
Authors: @anitakat @atalman

This is related to: https://github.com/pytorch/pytorch/issues/152134 . Adding RFC template for feature submissions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154125
Approved by: https://github.com/anitakat, https://github.com/ZainRizvi, https://github.com/albanD
2025-05-27 22:45:21 +00:00
53affa273b [MTIA Aten Backend][1.3/n] Migrate remaining view ops, which all need explicit register in native_functions.yaml (#154337)
See context in D75266206.

This diff/PR migrates all the remaining view ops, which all need changes in `native_functions.yaml` and thus need to be exported to PR.

Ops covered by this diff:
- _reshape_alias
- unfold

internal: Also delete the entire aten_mtia_view_ops.cpp file, and update corresponding build config.

Differential Revision: [D75385411](https://our.internmc.facebook.com/intern/diff/D75385411/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154337
Approved by: https://github.com/nautsimon
ghstack dependencies: #154336
2025-05-27 22:18:12 +00:00
eaf355cb11 [BE] Clean up unused parameter input in AOTIModel (#154276)
Summary: As title

Test Plan: CI

Differential Revision: D74691763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276
Approved by: https://github.com/Skylion007
2025-05-27 22:17:32 +00:00
241f8dc84d Revert "Remove outdated CUDA 11 conditions (#154313)"
This reverts commit 3936e6141c09dab94f21e4fdab7bea4bddf62ac2.

Reverted https://github.com/pytorch/pytorch/pull/154313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/154313#issuecomment-2914230005))
2025-05-27 21:54:41 +00:00
6be829535f [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 20:49:32 +00:00
555fc05868 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 6169ca0b65bcb382faa1a2287278b3717c18f127.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/benjaminglass1 due to Appears to have broken main ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2913975736))
2025-05-27 20:39:09 +00:00
7359705232 Add CPython tests for unittest (#150788)
Tests:
* test_assertions.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150788
Approved by: https://github.com/williamwen42
2025-05-27 20:26:17 +00:00
12fc06d267 Add CPython complex tests (#152015)
Tests:
* test_complex.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152015
Approved by: https://github.com/williamwen42
2025-05-27 20:24:28 +00:00
3b218e56dc Add CPython tests for iter/sort (#150797)
Tests:
* test_iter.py
* test_sort.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150797
Approved by: https://github.com/williamwen42
2025-05-27 20:22:34 +00:00
4fd8a54a41 [ez] add docblock for is_accessor_node (#154375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154375
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374
2025-05-27 19:47:32 +00:00
b367e5f6a6 [ROCm][Windows] Fix building torch 2.8 wheel with ROCm (added hipblasLt and rocblas directories) (#153144)
Since rocblas.dll and hipblaslt.dll are copied to torch/lib, rocblas and hipblaslt directories are needed to be stored there too (otherwise we have an error after wheel installation while searching for files in rocblas/library and hipblaslt/library which doesn't exist). This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153144
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-27 19:40:28 +00:00
fa6ca59079 Revert "Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)"
This reverts commit 2bd95f3a1f07132aa00f5c438c5228866d7dd1f8.

Reverted https://github.com/pytorch/pytorch/pull/154153 on behalf of https://github.com/malfet due to Broke inductor tests, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/154153#issuecomment-2913738047))
2025-05-27 19:23:28 +00:00
6169ca0b65 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-27 19:17:41 +00:00
75bbd4989c [dynamo] Support using symint from dispatcher-style tensor subclass (#154130)
Fixes #146932.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154130
Approved by: https://github.com/laithsakka
2025-05-27 19:05:46 +00:00
8c0f07f944 Revert "[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)"
This reverts commit 0d4de7872ac019abbd6e87b3391b2276d9d05bd4.

Reverted https://github.com/pytorch/pytorch/pull/153634 on behalf of https://github.com/malfet due to Broke inductor jobs, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/153634#issuecomment-2913619071))
2025-05-27 19:02:59 +00:00
b8452e55bc [Kineto x Insight] Update Kineto submodule (#154426)
Summary: We add a new ActivityType::MTIA_INSIGHT in 20f652846f

Test Plan: CI

Differential Revision: D75454945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154426
Approved by: https://github.com/Skylion007
2025-05-27 18:29:29 +00:00
5075df6fee Make torch importable if compiled without TensorPipe (#154382)
By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check
Fixes https://github.com/pytorch/pytorch/issues/154300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382
Approved by: https://github.com/Skylion007
ghstack dependencies: #154325
2025-05-27 18:13:38 +00:00
f472ea63bb [BE] Fix typos in SyntaxError description (#154436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436
Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-27 18:08:58 +00:00
cfbd99fdfd [Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012)
Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):
1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t)
2. vector reduction to a scalar value (opmath_t -> opmath_t)
3. optional downcast if opmath_t != out_t

The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel

Test Plan: Attention CI passes

Differential Revision: D75023858

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154012
Approved by: https://github.com/Valentine233, https://github.com/aditew01, https://github.com/CaoE, https://github.com/drisspg
2025-05-27 17:43:21 +00:00
1ca082d9a1 [ez] Rewrite comment to be more friendly to non haskellers (#151421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151421
Approved by: https://github.com/aorenste
2025-05-27 17:32:34 +00:00
70fbd5e08c [ez] Add docblock for resolve_unbacked_bindings (#154374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154374
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-27 17:05:49 +00:00
2560c1f3f0 add sticky cache pgo (#154418)
It's a reland of https://github.com/pytorch/pytorch/pull/154394 that hit some mergebot bug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154418
Approved by: https://github.com/malfet
2025-05-27 16:40:18 +00:00
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
0ddfd1ed43 [Intel GPU] Enable mkdnn._linear_pointwise at XPU backend (#140365)
# Motivation

This PR is intended to add post-op fusion support fo Linear. The liner-pointwise fusion is expected to be used in graph mode like torch.compile. The FusionUtils.cpp file defines a utilization APIs for generating primitive attribute. This APIs would also be used for conv-pointwise fusion, which is in #140372.

# Validation
```bash
   python test/xpu/test_fusion.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140365
Approved by: https://github.com/etaf, https://github.com/guangyey, https://github.com/EikanWang
2025-05-27 15:57:15 +00:00
0d4de7872a [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 15:38:43 +00:00
7ae204c3b6 [BE][CI][Easy] Run lintrunner on generated .pyi stub files (#150732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste
2025-05-27 14:58:02 +00:00
0a7eef140b Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022)
Fixes #153790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022
Approved by: https://github.com/Skylion007
2025-05-27 14:10:00 +00:00
d88699308f [CI][MacOS] Move more dependencies to pypi (#154309)
Hopefully last step before all Mac build/tests could be switched away from conda
- Update cmake version from 3.22 to 3.25 as 3.22 from pipy seems  to be unusable with python-3.12
- Add `--plat-name macosx_11_0_arm64` to setup.py command
- Remove `codesign` for cmake workaround (that was probably never really necessary
-  Install `libpng` and `jpeg-turbo` when building torchbench and build torchaudio without OpenMP (to be fixed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154309
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-27 13:49:40 +00:00
11a51a11af Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)"
This reverts commit 5c6d7caaaa08f134c3b17ce032cb014527b53417.

Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))
2025-05-27 13:42:34 +00:00
c52a002a22 Add getDeviceProperties api to torch mtia device (#153577)
topic: not user facing

Test Plan: Internal benchmark.

Differential Revision: D74256550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153577
Approved by: https://github.com/nautsimon
2025-05-27 11:55:58 +00:00
2bd95f3a1f Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 11:53:47 +00:00
6f86c1ce1d Add pyrefly.toml (#154144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154144
Approved by: https://github.com/Skylion007
2025-05-27 10:16:30 +00:00
5c6d7caaaa introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-27 08:54:31 +00:00
dec5ab8d98 [MTIA Aten Backend][1.2/n] Migrate as_strided to in-tree, and add unit tests (#154336)
See context in PR https://github.com/pytorch/pytorch/pull/153670

This diff migrate as_strided to in-tree. I found it's not covered by `test_kernel_eager_ci` so also adding unit tests.

Differential Revision: [D75385404](https://our.internmc.facebook.com/intern/diff/D75385404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154336
Approved by: https://github.com/nautsimon
2025-05-27 06:32:38 +00:00
ef6306e1c6 Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit 8d6139b8d8a75aab5ead4262ff59d48615ebee31.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2911206795))
2025-05-27 06:02:14 +00:00
870133b2a0 Use get_device_context in aoti runtime for XPU directly (#154360)
# Motivation
Reuse [c10::xpu::get_device_context](1bebe0424e/c10/xpu/XPUFunctions.h (L27)) directly to reduce overhead, as it returns a cached `sycl::context` managed by PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154360
Approved by: https://github.com/EikanWang
2025-05-27 05:55:59 +00:00
8d89cdceb6 fix a compilation issue when TORCH_XPU_ARCH_LIST is an empty string (#153604)
When `XPU_ARCH_FLAGS` is an empty string, compilation will fail on `C10_STRINGIZE(XPU_ARCH_FLAGS)` in file `torch/csrc/xpu/Module.cpp` on Windows.
This PR fixes this issue by setting `TORCH_XPU_ARCH_LIST` to `""` to avoid an empty string conversion in `C10_STRINGIZE()` when compiling without an AOT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153604
Approved by: https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-27 05:26:46 +00:00
8d6139b8d8 [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-27 04:54:46 +00:00
912af9b2c2 update torchbench pin (#154256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154256
Approved by: https://github.com/huydhn
2025-05-27 04:40:54 +00:00
8d319607a7 [CPU][Brgemm] add s8s8 GEMM microkernel API (#154358)
As the title. `u8s8` and `u8u8` have already been supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154358
Approved by: https://github.com/leslie-fang-intel, https://github.com/Skylion007, https://github.com/Valentine233
2025-05-27 03:47:56 +00:00
f8010e7b93 [nativert] Move file_util to pytorch core (#153162)
Summary: fbcode//sigmoid/core/common -> fbcode//caffe2/torch/nativert/common

Test Plan: Github CI

Differential Revision: D74328089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153162
Approved by: https://github.com/zhxchen17
2025-05-27 03:42:47 +00:00
70d12ccc3f [Torch] Fix error message formatting in fp8 comparison logic (#153647)
Summary: Using `\` includes all the tabs from the next line in the error message.

Test Plan: Nothing, simply error message fixing

Reviewed By: exclamaforte

Differential Revision: D74539234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153647
Approved by: https://github.com/exclamaforte
2025-05-27 02:51:05 +00:00
100ec0b34a [Inductor] Allow passing in custom lowering dict to register_lowering() (#154344)
This PR adds support for passing in custom lowering dict to `register_lowering()`, which allows systems (e.g. Helion, https://github.com/pytorch-labs/helion/pull/80) that uses Inductor to maintain their own lowering dict instead of using the Inductor global `lowerings` dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154344
Approved by: https://github.com/jansel
2025-05-27 01:35:26 +00:00
cyy
3936e6141c Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-27 00:30:14 +00:00
6006352ed3 [BE] Refactor manywheel build scripts (#154372)
1. Remove `CentOS Linux` cases, since its deprecated
2. Remove logic for old CUDA versions
3. Remove logic for `CUDA_VERSION=12.4` since we deprecated CUDA 12.4 support
4. Simplify setting `USE_CUFILE=1` - only supported on CUDA 12.6 and 12.8 builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154372
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-05-26 23:17:23 +00:00
b643076e4e Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit b6868f290e4882f9c895b1c9476327974288eaba.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2910692163))
2025-05-26 22:09:16 +00:00
aaf5cc13d9 [EASY] use guard_or_false instead of gso in Meta converter (#154234)
this was added in https://github.com/pytorch/pytorch/pull/141659, the current change keep the same intention
"i do not want to fail here if i cant tell if the size is zero or not"
i am not familiar enough in the code to know if we need here a runtime check, but looking at current
impl it seems that guard_or_false is appropriate to match current behaviour  and have the same effect of guard_size_oblivious here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154234
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167, #154172
2025-05-26 21:59:52 +00:00
e33feddb72 used guard_or_false instead of guard_size_oblivious inside maybe_reduce (#154172)
This was added in https://github.com/pytorch/pytorch/pull/119562
the idea in this loop seems to be the following.
```
    if (TORCH_GUARD_SIZE_OBLIVIOUS(size.sym_eq(1))) {
      // NB: we could short circuit this once needs_reduce is true but there's
      // no point since the reduction function will guard on this anyway
      if (!c10::guard_or_false(size.sym_eq(target), __FILE__, __LINE__)) {
        needs_reduce = true;
      }
    } else {
      if (!size.sym_eq(target).expect_true(__FILE__, __LINE__)) {
        fail();
      }
    }
  ```
  1. if we know size ==1
       1.1 : if we know for sure size == target --> no reduce needed.
       1.2 : we know for sure that size != target  --> we do reduction.
       1.3: we could not tell if size == target or not --> we do reduction.
  2. if we do now know if size ==1 or not
     we add a runtime assertions that size ==target and we fail at runtime if size is not equal to target.

We could have simplified 1.1 and always do reduction under 1.1, since doing 1.3 without runtime checks implies
that it is safe, but i feel the reason could be perf here? idk.

anyway using TORCH_GUARD_OR_FALSE instead of TORCH_GUARD_SIZE_OBLIVIOUS here is appropriate.
there is really no clear reason for size oblivious reasoning. or for this logic not to apply when size is not size like
size is always >=0 anyway. but bad reasoning can make us not able to infer that although we know its true here.

 python test/dynamo/test_misc.py -k test_validate_outputs_unbacked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154172
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167
2025-05-26 21:59:52 +00:00
ab5137b048 used guard_or_false instead of guard_size_oblivious in is_int_or_symint (#154167)
This is a short circuit, that we should not fail on. Before this PR we would not fail on u0, u0+u1,
only if they are size like.  but we will fail on u0-u1.. etc for no need.
guard_or_false seems appropriate for that reason.

This was added in https://github.com/pytorch/pytorch/pull/122145 there was no unit tests for me to verify
why it was added, i could not repo using the associated issue , the example does not work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154167
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164
2025-05-26 21:59:45 +00:00
1da2cc52bc [EASY] remove guard_size_oblivious from is_nonzero proxy call check (#154164)
This was added in https://github.com/pytorch/pytorch/pull/149637,
torch._check can handle unbacked there is no need for size oblivious reasoning here.

Note this does not make is_nonzero unbacked friendly. but that is a different story.
I ran the test added in  https://github.com/pytorch/pytorch/pull/149637 for veirfication.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154164
Approved by: https://github.com/aorenste, https://github.com/bobrenjc93
ghstack dependencies: #154154
2025-05-26 21:59:29 +00:00
f8a2998832 [EASY] used guard_or_false instead of guard_sizes_oblivious in pointless_view (#154154)
The change is direct and clear, the optimizations removes pointless_view iff it all sizes are the same if not we want to return false, there is no need for size oblivious  reasoning.

this was added in https://github.com/pytorch/pytorch/pull/139136, run existing tests that are added in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154154
Approved by: https://github.com/bobrenjc93
2025-05-26 21:59:21 +00:00
e89ee1e217 Pin almalinux version to 8.10-20250519 (#154367)
This PR pins Almalinux version to latest supported 8.10

This is related to: https://github.com/pytorch/pytorch/pull/154364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154367
Approved by: https://github.com/jeanschmidt, https://github.com/wdvr, https://github.com/malfet, https://github.com/huydhn
2025-05-26 20:08:20 +00:00
839c9c6156 Use property instead of ClassVar for Uniform.arg_constraints and Wishart.arg_constraints (#154361)
Fixes #154355

For these two distributions, the constraints depend on the actual values, and so `arg_constraints` cannot be a `ClassVar`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154361
Approved by: https://github.com/Skylion007
2025-05-26 17:48:28 +00:00
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
187d38185e [cutlass backend] Do not raise hard error when re worker has cuda compilation error (#154173)
fbcode specific

Differential Revision: D75262641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154173
Approved by: https://github.com/bertmaher
2025-05-26 17:10:36 +00:00
f55f2f42a7 Add missing docstring for sym_ite (#154201)
`sym_ite` is listed in [the reference page](https://docs.pytorch.org/docs/stable/torch.html) and has no document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154201
Approved by: https://github.com/Skylion007
2025-05-26 15:59:21 +00:00
02445ec8f0 Almalinux image, install glibc-langpack-en (#154364)
After update to: https://hub.docker.com/layers/amd64/almalinux/8/images/sha256-4f63eb966695df3c993deeacec7c73d87728e2ea66d3b48fed4b40cb547fa7c2

Started seeing warning: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
and random Segfaults when using python like:
https://github.com/pytorch/test-infra/actions/runs/15216565225/job/42901732536
```
+++ python -c 'import torch'
./check_binary.sh: line 258:  2276 Segmentation fault      (core dumped) python -c 'import torch'
```

Installing langpack does  resolve these issues: https://github.com/pytorch/test-infra/actions/runs/15256338815/job/42904808826#step:15:2311

Almalinux Docker build without setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15030284546/job/42240978131

Almalinux Docker build with setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15246391200/job/42873875745#step:3:7180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154364
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-05-26 15:56:42 +00:00
4b0ee3f4f2 [BE] Do not templetize unnnecessarily (#154305)
`${{ os.runner }}` would always evaluate to macOS for those files
And architecutre is always ARM64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154305
Approved by: https://github.com/atalman
2025-05-26 15:00:48 +00:00
7ab4fae62a Fix s390x vectorization compilation in inductor (#153946)
Fix s390x vectorization compilation in inductor.

One of failing tests is
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_add_complex_cpu
but it is still disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153946
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-05-26 12:54:25 +00:00
1bebe0424e Fix platform detection in MKLDNN CMake file (#142067)
When building PyTorch with `USE_XPU=True` and Clang,
the user sees misleading errors related to incorrect platform
detection that assumes that all users that are not using the GNU
compilers are on Windows. We can fix this by simply using CMake's
builtin platform detection variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067
Approved by: https://github.com/EikanWang, https://github.com/min-jean-cho, https://github.com/guangyey
2025-05-26 06:09:37 +00:00
21e42c5d62 More descriptive error message for torch.nanmean() with complex dtypes (#153252)
Fixes #153132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153252
Approved by: https://github.com/colesbury
2025-05-26 05:42:57 +00:00
b6868f290e [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-26 04:43:10 +00:00
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
062387fb53 [SymmMem] Speed up tests (#153677)
Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #153653
2025-05-26 03:39:11 +00:00
8c16d0e404 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-26 00:56:05 +00:00
b04852e404 Fix deterministic indexing with broadcast (#154296)
Fixes #79987, now for real.
Also removed thrust sort path that was needed for cuda <=11.2 because we no longer support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154296
Approved by: https://github.com/soumith
2025-05-25 21:14:50 +00:00
c3100067ae [ONNX] Update onnx to 1.18 (#153746)
Update onnx python package to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153746
Approved by: https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/malfet
2025-05-25 20:58:47 +00:00
43b2716e89 PYFMT lint grandfathered files 1 (#154261)
lint:
-  test/test_fake_tensor.py
-  test/test_flop_counter.py
- torch/_export/verifier.py

with same rules as other files, it was a night mare for me to update tests in one of the skipped files
with not being able to lint them locally like other files with lintrunner -a.
note that those file do have active dev and not old not touched files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-05-25 17:36:14 +00:00
5677ab9aab [BE] Correctly pass exceptions raised from rpc_init to CPython (#154325)
By decorating function body with `HANDLE_TH_ERRORS`

Partially addresses https://github.com/pytorch/pytorch/issues/154300

I.e. after that change, importing torch no longer crashes but returns a readable (and actionable exception)
```
>>> import torch
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import torch
  File "/Users/malfet/git/pytorch/pytorch/torch/__init__.py", line 2134, in <module>
    from torch import _VF as _VF, functional as functional  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/functional.py", line 8, in <module>
    import torch.nn.functional as F
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/__init__.py", line 8, in <module>
    from torch.nn.modules import *  # usort: skip # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Bilinear, Identity, LazyLinear, Linear  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/linear.py", line 7, in <module>
    from torch.nn import functional as F, init
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/functional.py", line 11, in <module>
    from torch._jit_internal import (
    ...<5 lines>...
    )
  File "/Users/malfet/git/pytorch/pytorch/torch/_jit_internal.py", line 42, in <module>
    import torch.distributed.rpc
  File "/Users/malfet/git/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 37, in <module>
    from torch._C._distributed_rpc import (  # noqa: F401
    ...<33 lines>...
    )
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154325
Approved by: https://github.com/Skylion007
2025-05-25 17:01:45 +00:00
31ae07b5e7 [CI] Do not install libuv on MacOS (#154307)
It's tensorpipe submodule and is build from source
Same for `dataclasses` as it's needed only for python-3.6
And get rid of `nidia-ml-py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154307
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #154304
2025-05-25 15:30:38 +00:00
6968386385 [BE] Sort requirements files alphabetically (#154304)
Using `sort` tool
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154304
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 15:30:38 +00:00
ed27ee8355 Bump setuptools from 70.0.0 to 78.1.1 in /tools/build/bazel (#154075)
Bumps [setuptools](https://github.com/pypa/setuptools) from 70.0.0 to 78.1.1.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v78.1.1</h1>
<h2>Bugfixes</h2>
<ul>
<li>More fully sanitized the filename in PackageIndex._download. (<a href="https://redirect.github.com/pypa/setuptools/issues/4946">#4946</a>)</li>
</ul>
<h1>v78.1.0</h1>
<h2>Features</h2>
<ul>
<li>Restore access to _get_vc_env with a warning. (<a href="https://redirect.github.com/pypa/setuptools/issues/4874">#4874</a>)</li>
</ul>
<h1>v78.0.2</h1>
<h2>Bugfixes</h2>
<ul>
<li>Postponed removals of deprecated dash-separated and uppercase fields in <code>setup.cfg</code>.
All packages with deprecated configurations are advised to move before 2026. (<a href="https://redirect.github.com/pypa/setuptools/issues/4911">#4911</a>)</li>
</ul>
<h1>v78.0.1</h1>
<h2>Misc</h2>
<ul>
<li><a href="https://redirect.github.com/pypa/setuptools/issues/4909">#4909</a></li>
</ul>
<h1>v78.0.0</h1>
<h2>Bugfixes</h2>
<ul>
<li>Reverted distutils changes that broke the monkey patching of command classes. (<a href="https://redirect.github.com/pypa/setuptools/issues/4902">#4902</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>Setuptools no longer accepts options containing uppercase or dash characters in <code>setup.cfg</code>.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="8e4868a036"><code>8e4868a</code></a> Bump version: 78.1.0 → 78.1.1</li>
<li><a href="100e9a61ad"><code>100e9a6</code></a> Merge pull request <a href="https://redirect.github.com/pypa/setuptools/issues/4951">#4951</a></li>
<li><a href="8faf1d7e0c"><code>8faf1d7</code></a> Add news fragment.</li>
<li><a href="2ca4a9fe47"><code>2ca4a9f</code></a> Rely on re.sub to perform the decision in one expression.</li>
<li><a href="e409e80029"><code>e409e80</code></a> Extract _sanitize method for sanitizing the filename.</li>
<li><a href="250a6d1797"><code>250a6d1</code></a> Add a check to ensure the name resolves relative to the tmpdir.</li>
<li><a href="d8390feaa9"><code>d8390fe</code></a> Extract _resolve_download_filename with test.</li>
<li><a href="4e1e89392d"><code>4e1e893</code></a> Merge <a href="https://github.com/jaraco/skeleton">https://github.com/jaraco/skeleton</a></li>
<li><a href="3a3144f0d2"><code>3a3144f</code></a> Fix typo: <code>pyproject.license</code> -&gt; <code>project.license</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4931">#4931</a>)</li>
<li><a href="d751068fd2"><code>d751068</code></a> Fix typo: pyproject.license -&gt; project.license</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v70.0.0...v78.1.1">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=70.0.0&new-version=78.1.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154075
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-25 15:13:03 +00:00
c113cf5a8f [BE] Remove unused conda-env-Linux-X64 (#154303)
According to https://github.com/search?type=code&q=conda-env-++repo%3Apytorch%2Fpytorch it's not referenced anywhere and has been replaced with `conda-env-ci` a while ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154303
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 14:24:28 +00:00
d8aed0703e [BE][Ez]: Enable ruff rule PLW1507. os.environ is not copied (#154120)
Enables a RUFF rule check against copying os.environ since its' actually a proxy object, not a dict so a shallow copy will be a noop which is rarely desired behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154120
Approved by: https://github.com/malfet
2025-05-25 14:22:57 +00:00
54932d865e Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))
2025-05-25 13:17:27 +00:00
c4ef4090c5 Fix segfault on exit in CachingHostAllocator by signaling background thread to exit (#154117)
Fixes #152008

This PR fixes a segmentation fault that occurred when exiting the program due to improper background thread management in CachingHostAllocator.

Previously, the background thread continued running and called process_events() even after the allocator object was destroyed, leading to a crash on exit.

f12d8d60b1/aten/src/ATen/core/CachingHostAllocator.h (L218)

```cpp
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
  getBackgroundThreadPool()->run([&]() {
    while (true) {
      process_events();  // <-- This line may cause segfault on exit
      std::this_thread::sleep_for(std::chrono::microseconds(100));
    }
  });
  return true;
}();
```

The fix adds a mechanism to signal the background thread to exit before the object is destructed, ensuring the thread stops safely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154117
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-05-25 07:46:12 +00:00
9d922b55ef [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-25 03:49:29 +00:00
03e102dbe8 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-25 03:48:34 +00:00
10c51b11ff Bump protobuf version and refactor tensorboard tests (#154244)
In preparation for https://github.com/pytorch/pytorch/pull/153746, I am bumping protobuf to 5.29.4 and fixing the tensorboard tests first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154244
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-05-25 00:50:07 +00:00
53ecb8159a Introduce statically_known_false (#154291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154291
Approved by: https://github.com/mengluy0125
2025-05-24 14:23:55 +00:00
2dfc0e3327 [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
2025-05-24 09:51:33 +00:00
cyy
8fe7ec6721 Add /Zc:preprocessor for torch libraries in MSVC builds (#147825)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825
Approved by: https://github.com/janeyx99
2025-05-24 06:57:46 +00:00
6503b4a96e Update to using mypy 1.15 (#154054)
The BC break isn't real - mypy decided to start complaining about the way we were typing that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054
Approved by: https://github.com/Skylion007
2025-05-24 04:30:57 +00:00
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
1ab2993345 Add a link to transformer_building_blocks tutorial (#154281)
Cross-link to https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154281
Approved by: https://github.com/mikaylagawarecki
2025-05-24 02:50:24 +00:00
e904d01c16 Make inductor UT to be generic (#154196)
# Motivation
https://github.com/pytorch/pytorch/pull/151773 introduces UT `test_triton_template_generated_code_caching` failed on XPU;
https://github.com/pytorch/pytorch/pull/153895 introduces UT `test_mutation_rename` failed on XPU;

fix https://github.com/pytorch/pytorch/issues/154218

# Additional Context
With this PR, both failed UTs passed on local machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154196
Approved by: https://github.com/jansel
2025-05-24 02:47:46 +00:00
a19f2cdf29 [draft export] skip when no LOC found (#154190)
Couldn't repro error, but verified fix with @ColinPeppler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154190
Approved by: https://github.com/ColinPeppler
2025-05-24 02:29:34 +00:00
975bbc63db [MPS][BE] Move fmod/remainder to Metal ops (#154280)
This accomplishes following:
 - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
 - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
 - Eliminates need for several correctness workarounds

Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
2025-05-24 01:45:33 +00:00
8f08bdb7f2 [MPS][BE] Code dedup (#154290)
Eliminate some copy-pasta by introducing `REGISTER_FLOAT_BINARY_OP` and `REGISTER_INTEGER_BINARY_OP` macros
Use `_METAL_310_PLUS` to guard bfloat dtype use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154290
Approved by: https://github.com/yangw-dev, https://github.com/wdvr
ghstack dependencies: #154275
2025-05-24 01:41:31 +00:00
e5f63f4f66 [CI] Move Mac testing to 3.12 (#154177)
Prep step to completely move away from Conda during the builds..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154177
Approved by: https://github.com/huydhn, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269, #154270
2025-05-24 01:41:20 +00:00
11a490f32f [CI] Reuse old whl on more workflows (#154285)
Still only on main branch, not PRs, so that we can monitor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154285
Approved by: https://github.com/malfet
2025-05-24 01:25:35 +00:00
308beeeb56 [dynamo] Use UUID for compiled function variable names. (#154148)
Summary:
We previously assign each compiled function variable a name based on in-process global counter. This works fine within the same process but when we're trying to serialize the states with precompile, we need a way to load back these compiled functions without causing collision to the existing global scope.

Changing the counter to a true global uuid seems to resolve this issue.

For example, the new variable name will look like:
```
__compiled_fn_0_7ce7d872_4fe8_4174_b8fd_2496b09b8b43
```

Test Plan: CI

Differential Revision: D75244901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154148
Approved by: https://github.com/jansel
2025-05-24 01:08:42 +00:00
7ba6fb69e6 [Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #152417, #152418, #153364
2025-05-23 23:20:02 +00:00
84b657d0b5 Add Vectorized FP8 E5M2 (#153364)
**Summary**
This PR mainly adding the `Vectorized<Float8_e5m2>` class to support the vectorization of `FP8 E5M2` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E5M2Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153364
Approved by: https://github.com/jgong5, https://github.com/CaoE, https://github.com/vkuzo
ghstack dependencies: #152417, #152418
2025-05-23 23:11:25 +00:00
b77a6504fa [Inductor][CPP] Enable vectorized fp8 quant dequant (#152418)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE
ghstack dependencies: #152417
2025-05-23 23:05:17 +00:00
080b74ce67 Add Vectorized FP8 E4M3 (#152417)
**Summary**
This PR mainly adding the `Vectorized<Float8_e4m3fn>` class to support the vectorization of `FP8 E4M3` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E4M3Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152417
Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/vkuzo
2025-05-23 22:56:56 +00:00
bab59d3c28 Upgrade to CUDA 12.8.1 for nightly binaries (#152923)
Upgrade current CUDA 12.8 builds to 12.8.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152923
Approved by: https://github.com/atalman
2025-05-23 22:37:05 +00:00
f0b2706914 remove sleef_arm target (#154166)
Summary:
X-link: https://github.com/pytorch/executorch/pull/11082

We shouldn't need an ARM-specific variant; we have select() where we should need it.

Test Plan: CI

Reviewed By: nlutsenko

Differential Revision: D74356413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154166
Approved by: https://github.com/kimishpatel, https://github.com/malfet, https://github.com/Skylion007
2025-05-23 22:16:01 +00:00
86a160353e [BE] Don't run windows builds in pull.yml (#154264)
We already run windows builds and tests [during trunk.yml](c13eeaa718/.github/workflows/trunk.yml (L115-L130)).

Spot checking for failures of this job in pull.yml shows that the most of the times this job fails, the failure correlates with other build jobs failing as well, so it's not offering much unique signal.

Given that we'll run this job before merging the PR as part of trunk.yml anyways, the trade off of extra signal from getting a windows build signal a little earlier doesn't seem worth the infra investment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154264
Approved by: https://github.com/malfet
2025-05-23 22:03:19 +00:00
65f0cf3df5 [mergebot] Do not block on autoformat workflow (#154236)
Helps with https://github.com/pytorch/pytorch/issues/154084

Merge sometimes fails due to autoformat failing.  I believe it's because author doesn't have write perms/workflow running perms -> needs approval for workflows.  On merge, the bot adds the merge label -> triggers autoformat workflow -> needs approval (even though it will end up getting get skipped because the label doesn't match) -> merge sees and fails

So I put an ugly exception for the workflow in mergebot

Some restrictions to keep in mind:
* Need to checkout the PRs code changes to run lint/format on them -> possible security issue if someone modifies a linter/formatter
* The (third party) reusable action used in the autoformat workflow requires the trigger to be pull_request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154236
Approved by: https://github.com/malfet
2025-05-23 22:00:34 +00:00
bb17f9c98b [AOTAutogradCache] Fix CHROMIUM_EVENT_LOG being none (#154258)
It turns out if you import something that's None at import time in python, and later update the value, the one you imported stays none:

```
import torch
from torch._dynamo.utils import CHROMIUM_EVENT_LOG
class Foo:
  pass
torch._dynamo.utils.CHROMIUM_EVENT_LOG =  Foo()

print(CHROMIUM_EVENT_LOG) # None
```

This fixes teh bug so we get AOTAUtogradCache instant events again

Differential Revision: [D75305770](https://our.internmc.facebook.com/intern/diff/D75305770/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154258
Approved by: https://github.com/oulgen
2025-05-23 21:53:31 +00:00
0e4f1b8a06 [CI] Update MacOS conda requirmenets (#154270)
Pick package versions which are compatible with both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154270
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269
2025-05-23 21:44:50 +00:00
5db1503846 [CI] Update MacOS numba and scipy versions (#154269)
Pick versions that supported by both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154269
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271
2025-05-23 21:44:49 +00:00
aa3eab2ce6 Fix tcp init when using port 0 (#154156)
I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2025-05-23 21:41:58 +00:00
3c0b93afc5 Re-enable link linter (#153280)
And make URL linter always succeed for now.
I'll monitor the logs manually and experiment with it futher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153280
Approved by: https://github.com/albanD
2025-05-23 20:56:25 +00:00
6f34d141ab [MPS][BE] Delete complex_div (#154275)
An absolute no-op: delete `complex_div` from `UnaryKernel.metal` and use identical one from `c10/metal/utils.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154275
Approved by: https://github.com/dcci
2025-05-23 20:53:50 +00:00
dec6a47996 [BE] Delete unused pip-requirements-iOS.txt (#154271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154271
Approved by: https://github.com/clee2000
ghstack dependencies: #154237, #154268
2025-05-23 20:08:19 +00:00
acd0873d3b [CI] Fix TestDynamoTimed.test_ir_count for 3.12 (#154268)
Python-3.12 emits the same bytecode as 3.13 for code in question
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154268
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237
2025-05-23 20:08:19 +00:00
28af44285b Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see fe784c5a2c/1 ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))
2025-05-23 19:44:08 +00:00
fe784c5a2c Fix torchbind path in AOTI package loader (#154265)
Summary: as title, fix the path in package loader and fix the test to take the additional dir into consideration.

Test Plan:
```
buck run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:torchbind
```

Reviewed By: angelayi

Differential Revision: D75308904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154265
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-05-23 19:32:53 +00:00
90855835ff Revert "[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)"
This reverts commit 269fa8028f68b29176e21886108634f48b1eced7.

Reverted https://github.com/pytorch/pytorch/pull/154155 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/154155#issuecomment-2905514934))
2025-05-23 19:08:40 +00:00
3b21d79225 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 19:04:36 +00:00
499a76b844 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-23 19:04:28 +00:00
561a11aa68 Revert "Patch the _is_conv_node function (#153749)"
This reverts commit c985cec5b2545d46af682d486b18866eee5dffd5.

Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))
2025-05-23 19:04:20 +00:00
4ff19ecf66 Revert "[export] Move PT2ArchiveWriter/Reader to torch/export (#153795)"
This reverts commit 7e80f23516a86e18ae5bc5579d3005c1e7610102.

Reverted https://github.com/pytorch/pytorch/pull/153795 on behalf of https://github.com/malfet due to Looks like it broke lots of tests, see ec368a1903/1 ([comment](https://github.com/pytorch/pytorch/pull/153795#issuecomment-2905415496))
2025-05-23 18:29:08 +00:00
ec368a1903 Add sitemap (#154158)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154158
Approved by: https://github.com/albanD
2025-05-23 18:01:00 +00:00
0d62fd5c3c [MTIA Aten Backend][2/n] Migrate clamp ops(clamp.out/clamp_min.out/clamp_max.out) from out-of-tree to in-tree (#154015)
Summary:
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This PR
1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops):
- clamp.out
- clamp_min.out
- clamp_max.out
2. Also enabled structured kernel codegen for MTIA, which is needed by clamp
3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`)

Differential Revision: D74674418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-23 17:59:47 +00:00
bcb2125f0a [BE][CI] Update expecttest version to 0.3.0 (#154237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154237
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/atalman
2025-05-23 17:27:41 +00:00
cae25ef4e5 [c10d] Enhance Error Logging in new_subgroups() for Non-Divisible World Sizes (#154124)
Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness.

Test Plan: contbuild & OSS CI

Differential Revision: D75226925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124
Approved by: https://github.com/wz337
2025-05-23 17:12:43 +00:00
e927ba6dbd [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-23 17:12:25 +00:00
04a6fe7914 Update provenance tracking doc (#154062)
Summary: Update the doc to reflect the changes in https://github.com/pytorch/pytorch/pull/153584/files#diff-e0cdb58c0f84f56f20c5433339b6d83c470dcde47847e2328effea6bedd4cd27 and https://github.com/pytorch/tlparse/pull/110

Test Plan: CI

Differential Revision: D75155981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154062
Approved by: https://github.com/svekars, https://github.com/desertfire
2025-05-23 17:09:52 +00:00
7d8ea5db69 Disable cache and utilization stats uploading steps on s390x (#150297)
There are no AWS credentials available on s390x runners. These steps are failing anyway due to that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150297
Approved by: https://github.com/seemethere
2025-05-23 16:49:38 +00:00
7e80f23516 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Differential Revision: D74616598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 15:40:25 +00:00
214e4cef9f Fix RMSNorm doc rendering (#154205)
By removing `::func::` decorator which adds unneeded parenthesis

Test plan: Check https://docs-preview.pytorch.org/pytorch/pytorch/154205/generated/torch.nn.RMSNorm.html#rmsnorm
that now renders as
<img width="704" alt="image" src="https://github.com/user-attachments/assets/443f605d-75a6-41ef-8971-21e7dc8ef9f6" />

Fixes https://github.com/pytorch/pytorch/issues/154184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154205
Approved by: https://github.com/mikaylagawarecki
2025-05-23 15:39:29 +00:00
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
4b7abce6a4 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

The latest version of this also:
1. Addresses the problem that caused #153891.
    The issue was that with caching ops are required to support `__eq__`.  Unfortunately _RecordFunction is minimalistic and doesn't support that - so in the off-chance that two keys hash to the same value the `__eq__` check would raise an exception.

    Apparently this was much more common on MacOS where memory patterns end up with more reuse (so the object IDs are the same and give you the same hash value for objects that use pointer hash).

    Tested locally on MacOS where running
```
python test/inductor/test_torchinductor.py GPUTests
```
was pretty much guaranteed to fail (at least for me) somewhere around test 100-200 and passed all 800 tests after this change.

Another way to test this is to run the inductor tests with `torch._subclasses.fake_tensor._DispatchCacheKey.__hash__` monkey-patched to return a constant (causing all values to hash-collide) but this can't really be checked-in since it causes the cache lookup to turn into an O(n) lookup which takes a crazy long time to run through all the tests...

2. Folds in #153780 to ensure that exceptions raised from the op don't include the context from the cache key bypass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-23 15:03:31 +00:00
866142ff16 Revert "Update the heuristic for AArch64 bmm/baddbmm (#149122)"
This reverts commit d759a517af3e6b2337bf8f8e0d1734e64e470f1b.

Reverted https://github.com/pytorch/pytorch/pull/149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](https://github.com/pytorch/pytorch/pull/149122#issuecomment-2904703075))
2025-05-23 14:54:54 +00:00
5859582ee4 [BE][MPS] Delete unused complex_mul_out (#154175)
It's no longer called, after `mul` has been migrated to binary op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154175
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-05-23 13:44:24 +00:00
2225231a14 Enable AArch64 CI scripts to be used for local dev (#143190)
- Allow user to specify custom ComputeLibrary directory, which is then built rather than checking out a clean copy
- Remove `setup.py clean` in build. The CI environment should be clean already, removing this enables incremental rebuilds
- Use all cores for building ComputeLibrary

Mostly a port of https://github.com/pytorch/builder/pull/2028 with the conda part removed, because aarch64_ci_setup.sh has changed and can now handle being called twice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143190
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>
2025-05-23 12:09:59 +00:00
25149cd173 [c10d] Add more tests to prevent extra context (#154174)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Loop a bunch of sync ops and see if any of them creates extra context.
Requires nvml to check number of processes resident on a device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174
Approved by: https://github.com/atalman
2025-05-23 09:54:01 +00:00
ba5d45d22e Add assertion to align with cuda (#153233)
Fixes #153137

Aligned batch_norm_cpu_out assertion to [batch_norm_cuda_out](a7ea115494/aten/src/ATen/native/cuda/Normalization.cu (L436)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153233
Approved by: https://github.com/malfet
2025-05-23 07:32:43 +00:00
5623d30228 [Minimizer] Gracefully exit when there is no discrepancy in block mode (#154076)
Summary:
Previously, when there is no discrepancy in results for block mode, net_min_base will throw an OOB error.

This occurs due to the block _block_traverse_impl returning an OOB after exhausting subgraphs all the way down to a single node

There is also an issue where we may get an unsound subgraph (i.e. mark an earlier node as the "end" even if the correct end is later). This is due to an incorrect check (start_idx == mid) where there can possibly be two values left before the program pre-maturely returns

Test Plan:
Buck UI: https://www.internalfb.com/buck2/52524c26-ace5-4593-8a4b-843a54eb206a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3096224973363310
Network: Up: 0B  Down: 15MiB  (reSessionID-cd404e97-395f-49fc-8381-373e90a1378f)
Executing actions. Remaining     0/1
Command: test.
Time elapsed: 53.7s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D75143242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154076
Approved by: https://github.com/jfix71
2025-05-23 06:42:07 +00:00
8342b9371e [ROCm] Prefer hipblaslt for gfx1200, gfx1201 (#153610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153610
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2025-05-23 06:01:53 +00:00
26471fc203 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-23 05:45:35 +00:00
b33b7d5c8c [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-23 05:45:35 +00:00
269fa8028f [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-23 04:51:36 +00:00
5bb156a7fd [dynamo] raise observed exception for module attribute errors (#153659)
Fixes https://github.com/pytorch/pytorch/issues/153605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153659
Approved by: https://github.com/StrongerXi
2025-05-23 03:56:26 +00:00
db1f33147b [audio hash update] update the pinned audio hash (#154001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154001
Approved by: https://github.com/pytorchbot
2025-05-23 03:51:21 +00:00
c1055f41a6 Data dependent free reshape. (#153198)
#### change 1: if compute_strides stride fail for reshape just clone.

Lets consider the most general case, if torch compile is asked to reshape [u0, u1][u3, u4] -> [u5, u6] what shall it do?
The shape is general enough to represent both contiguous and non contiguous tensors, tensors where a clone free reshape can happen and other where a clone free cant happen.  The current algorithm will fail due to data dependent errors.

The general idea is if its impossible to tell if the reshape can happen in place, (because for some concrete inputs
it will and other not) then its ok to take the general path and clone, instead of failing or asking the user to give hints.
**Because the user want a single graph (single compilations)** and this is the only way it can be done.
Had this been a view? then the user is explicitly asking for a copy-free reshape, we would fail asking for more
information (hints in torch.checks form).

with this change reshape works as the following:
1. if we know the input is contiguous we will convert the reshape to view.
2. if compute_strides succeed we will use view. (compute_strides  was changed to not fail when when unbacked presented instead it will just return nullptr if it cant compute the strides meaning we shall use a clone).
3. if neither 1, 2 works clone and use a view.

Side note: having a view does not mean that inductor will not clone, for inductor there is a pass that converts all views back to reshapes and inductor has its logic dealing with those.

#### change 2 : skip  _reshape_view_helper and fall back to simpler logic if it fail.
We trace _reshape_view_helper when doing fake tensor tracing , but not during proxy tracing. hence such tracing wont effect the graph (only compute output shapes of several operations). We should not fail there, because it should always be possible for us to pass it in case of reshape.

i.e. when reshape_symint was called we would have either cloned, or compute_strides succeeded so the view should pass. What I did is the following: we run _reshape_view_helper, if we fail due to unbacked we call _view_simple which will succeed always for reshapes, (might fail for views when its impossible to do the view, in such case we throw the dde that was thrown by the original algorithm).

Ideally I would want to register _view_simple as the meta for view and avoid calling  _reshape_view_helper completely but I am running some issues with the dispatcher with subclasses and I do not have time to debug it. Namely one test
would end up calling some c++ view function that does not support symints during meta dispatch when i register a
python meta decompositions
```python test/dynamo/test_subclasses.py SubclassTests.test_subclass_views_dynamic_True ```
 https://github.com/pytorch/pytorch/issues/153303.I will follow up with that change in a separate PR.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @bdhirsh

 Two other alternatives for registering   _view_simple as meta and the try catch approach in this PR is:
 1. call _view_simple if any input is dynamic see  #153521
 2. if we make is_compiling works for framework code tracing (does not work rn) we can call _view_simple
 is if is_compiling.

#### Note:
Reshape can still fail when is_contiguous is called, Next PR will handle that by calling is_known_contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153198
Approved by: https://github.com/etaf, https://github.com/bobrenjc93
2025-05-23 01:45:16 +00:00
f74842d665 [DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286)
This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile.

`_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286
Approved by: https://github.com/bdhirsh
2025-05-23 01:40:38 +00:00
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
3c0cbf4b44 Update GH action to use the correct label (#154126)
Update GH action to use the correct label for the docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154126
Approved by: https://github.com/AlannaBurke, https://github.com/clee2000
2025-05-23 00:29:43 +00:00
31f3ee0966 [BE][Ez]: Enable PT014 check for duplicate parameterize test cases (#154118)
Ruff rule which checks for an error [PT014](https://docs.astral.sh/ruff/rules/pytest-duplicate-parametrize-test-cases/) where a user might specify two duplicate test cases in pytest.parameterize, which is likely an error since it tests the same thing twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154118
Approved by: https://github.com/malfet
2025-05-23 00:00:53 +00:00
7b25ff7cf2 [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-05-22 23:37:03 +00:00
59c5fff2aa Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404)"
This reverts commit a79e621c1c11bcef5f816b9770b751237b84f620.

Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))
2025-05-22 22:26:59 +00:00
f2cce45657 [libc++ readiness][caffe2] No reason to check for "ext/stdio_filebuf.h" (#154080)
Summary: There should be no reason to check for existence of this GNU C++ header here in this file. It doesn't include it. Removing this condition to make it build under libc++.

Differential Revision: D75179136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154080
Approved by: https://github.com/soumith
2025-05-22 22:23:39 +00:00
c985cec5b2 Patch the _is_conv_node function (#153749)
Summary: torch.ops.aten.conv2d.padding is also conv2d node

Differential Revision: D74898941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749
Approved by: https://github.com/andrewor14, https://github.com/Skylion007
2025-05-22 22:17:02 +00:00
413664b3c5 catch CSE recursion depth errors (#154039)
Fixes #153777

CSE is an optimization and shouldn't block a compile if it hits recursion depth limits. Unfortunately we can't write this iteratively due to a dependency on `ast.unparse` which necessarily needs to do recursion. This PR catches opts out of CSE when we hit recursion depth errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154039
Approved by: https://github.com/Microve
2025-05-22 20:17:19 +00:00
cad0727fe1 Rename the provenance tracing artifact name for kernel <-> post_grad nodes mapping (#154046)
Summary:
Context:

Recently we've added a couple more kernel types support other than inductor generated triton kernels,

such as cpu cpp kernels, extern kernels.

The name appeared in tlparse chrome link can be confusing to users.

Rename from

`inductor_triton_kernel_to_post_grad_nodes.json`

to `inductor_generated_kernel_to_post_grad_nodes.json`

Test Plan: CI

Differential Revision: D75159042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046
Approved by: https://github.com/yushangdi
2025-05-22 19:20:56 +00:00
4277907d02 [binary builds] Linux aarch64 CUDA builds. Make sure tag is set correctly (#154045)
1. This should set the Manylinux 2.28 tag correctly for CUDA Aarch builds.
I believe we used to have something similar in the old script:
https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/build_aarch64_wheel.py#L811

``Tag: cp311-cp311-linux_aarch64 ``-> ``Tag: cp311-cp311-manylinux_2_28_aarch64``

2. Remove section for CUDA 12.6, since we no longer building CUDA 12.6 aarch64 builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154045
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-05-22 18:36:13 +00:00
788d9cb2d7 [3/n][Optimus][Auto-AC][reland] Support any fp8 quantization type and set scaling as the default" (#154057)
Summary:
This is a reland of D74910193.
We change the dtype to torch.float8_e5m2 in unit test since it is not supported.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Differential Revision: D75169792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154057
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:26:34 +00:00
c2660d29a5 [ROCm] Added unit test to test the cuda_pluggable allocator (#154041)
Added unit test to include the cuda_pluggable allocator and replicate the apex setup.py to build nccl_allocator extension

This test to check if this commit https://github.com/pytorch/pytorch/pull/152179 helps to build the cuda pluggable allocator in Rocm/Apex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154041
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-05-22 18:22:15 +00:00
5b8f422561 [PT2][Optimus] Fix a typo in decompose_mm (#154048)
Summary: As titled

Differential Revision: D75160513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154048
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:11:40 +00:00
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
7421c21b5e remove unused code. (#153979)
Remove the unused cmake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979
Approved by: https://github.com/albanD
2025-05-22 17:50:11 +00:00
fc859077a0 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = return 0 if u0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-22 17:25:38 +00:00
025c5cc048 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit d23762974eae105aad837188d5d2254ea9783b37.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))
2025-05-22 16:52:04 +00:00
7d3dab6b90 Revert "[BE]: Type previously untyped decorators (#153726)"
This reverts commit b7d08defe9cfe1595ff680f845b39f5e03a89555.

Reverted https://github.com/pytorch/pytorch/pull/153726 on behalf of https://github.com/yangw-dev due to sorry, it seems like your pr failed typecheck error internally, [D75155486](https://www.internalfb.com/diff/D75155486) ([comment](https://github.com/pytorch/pytorch/pull/153726#issuecomment-2901911114))
2025-05-22 16:49:08 +00:00
a15550b776 [Cutlass] Use env var for EVT flag (#154099)
Swaps out hard flag for environment variable in inductor config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154099
Approved by: https://github.com/eellison
2025-05-22 16:36:57 +00:00
a82c8891d5 Revert "[aoti] Add MPS runner and shim (#153964)"
This reverts commit 918ae5d36188f419a47f3b1315f9fb373035ed66.

Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))
2025-05-22 16:35:59 +00:00
47a01f3efb Revert "[aoti] Initial Metal support (#153959)"
This reverts commit 28bcd9eb30336b370298dbe9677b95019882f2a8.

Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))
2025-05-22 16:17:07 +00:00
f419373dd3 [inductor] lowering for fractional_max_pool3d (#148630)
also a lowering with a reduction for large window_sizes for
fractional_max_pool2d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630
Approved by: https://github.com/eellison
2025-05-22 16:06:29 +00:00
9a8c42ff94 Get rid of unused code in linters (#154043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154043
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-05-22 15:24:54 +00:00
35ddad284d update mutation renames (#153895)
Thanks to @PaulZhang12 for original find. When we finalize a multi template buffer, we need to reflect mutation renaming in dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153895
Approved by: https://github.com/PaulZhang12
2025-05-22 14:54:39 +00:00
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
4439255148 [aotd] Support saved tensors hooks in aot_autograd (#150032)
https://github.com/pytorch/pytorch/issues/148222

Goal:

At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.

To get user specified autograd saved tensors hooks in the graph.

Logic:

UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.

User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.

This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.

In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.

If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.

Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).

The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace  pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.

In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.

Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.

Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.

Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.

We do not try to put it close the last usage etc., relying on inductor to do this optimization.

```
INFO: TRACED GRAPH
 ===== Forward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== saved_tensors_pack_hook add 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== saved_tensors_unpack_hook add 3 =====
 <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== Forward graph 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]);  add = None
        return (view, _to_copy, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph 3 =====
 <eval_with_key>.21 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32);  add_packed_2 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy);  tangents_1 = _to_copy = None
        return (None, None, add_7)

```

Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
2025-05-22 14:09:38 +00:00
f12d8d60b1 Add hint message when parameters is empty in clip_grad_norm_ (#151529)
Fixes #148259

## Changes

- Add print warning message when `parameters` generator exhausted

## Test Result
### print warning
```python

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

inputs = torch.randn(16, 10)
targets = torch.randn(16, 1)

outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()

params_to_clip = model.parameters()

for p in params_to_clip:
    print(p.shape)

max_norm = 1.0
norm_type = 2.0
total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type)
print(f"total_norm: {total_norm}")
```

```bash
/home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur.
  warnings.warn(
total_norm: 0.0
```

### UT

```bash
pytest test/test_nn.py -k test_clip_grad_norm
```

![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529
Approved by: https://github.com/jbschlosser
2025-05-22 11:23:39 +00:00
40e6ca24ef Update CPU Inductor merge rules by adding more CPP Template (#152086)
**Summary**
Add more CPP Template into the CPU Inductor merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152086
Approved by: https://github.com/atalman
2025-05-22 09:46:26 +00:00
2f57ee579d S390x update docker image (#153619)
Add ninja-build for pytorch tests.
Switch to gcc 14 due to fix for precompiled headers and s390x vectorization interaction.
Disable -Werror when building onnxruntime.
Pin onnx version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153619
Approved by: https://github.com/huydhn
2025-05-22 09:34:46 +00:00
d7a83ab67b Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 (#149312)
Fixes #102261

## Changes

- Use flag `_is_initial` to replace `self.last_epoch == 0` condition to judge whether `lr` should be initial value
- Add test for `ExponentialLR` checkpoint usecase

## Test Result

```python
pytest -s test/optim/test_lrscheduler.py  -vv
```

![image](https://github.com/user-attachments/assets/6fd32bcc-b4fb-4421-b891-620bd4900dc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149312
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-22 08:42:37 +00:00
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
c1b7dbc52a [dynamo] unimplemented -> unimplemented_v2 in variables/dict.py (#154040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154040
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-22 06:46:10 +00:00
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
482e5b6660 [inductor] Added precompilation_timeout_seconds into a config instead of hardcoded (#153788)
Fixes #153392

- Updated config.py to add the timeout as a config var to be tuned dynamically (default is 3600s).
- Passed the var as a kwarg during call on instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153788
Approved by: https://github.com/henrylhtsang
2025-05-22 06:44:02 +00:00
7128b50a65 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it
https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
2025-05-22 06:33:29 +00:00
4bcff4af99 Move prologue_supported_inputs computations to def_kernal (#150869)
This avoid replaying load_input on a cache hit on the generate_code_cache.
the idea is that if a template have prologue_loads_all_inputs = True, it means that
all all inputs are loaded and hence no need to replay

Effect on the current benchmark on a local run on dev server.
18549985383 -> 15072230073
25697270062 -> 20738613297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150869
Approved by: https://github.com/eellison
2025-05-22 06:24:44 +00:00
4421aee558 torch.compile: Supress stdout / stderr output from subprocesses when local (#153837)
Summary:
This output is extremely noisy - i.e. on a 96 core machine, with 8 ranks, you
can get ~700 duplicate set of logs from each worker.

Differential Revision: D74907920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153837
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-05-22 05:49:43 +00:00
f2af30fee5 Add a HOP to bypass tracing of a wrapper function while tracing the wrapped function (#153487)
Usage:
```python
from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper

# Your ordinary function wrapper
def my_hop_fn_impl(fn, *args, k=1, **kwargs):
    def wrapper(*args, **kwargs):
        out = fn(*args, **kwargs)
        if isinstance(out, tuple):
            return (out[0] + k,)
        return out + k

    return wrapper

# Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph
def my_hop_fn(fn, *args, k=1, **kwargs):
    return dynamo_bypassing_wrapper(
        functools.partial(my_hop_fn_impl, k=k), fn, *args, **kwargs
    )
```

Notes:
- The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn.
- The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487
Approved by: https://github.com/ydwu4
2025-05-22 04:24:38 +00:00
669b176d4c [Graph Partition] support removed arguments, NoneLayout, and mutation (#153899)
Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases:

1. `NoneLayout` is not allocated so it cannot become a partition input or output.
2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`.
3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen.

This PR supports these 3 cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899
Approved by: https://github.com/eellison
2025-05-22 04:24:31 +00:00
d1fe198df6 [cond] support output the same unbacked symbol from two branches (#148206)
Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored).

In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206
Approved by: https://github.com/zou3519
2025-05-22 03:39:43 +00:00
fe285b9560 [aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768)
## PR
There are a few cases that my previous PR (#153220) didn't cover.
1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`.
2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this.
3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this.

## Device assertion msg

```
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
```

## Autotuning code setup
This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`).

It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1).
```
in_buf0 = generate_example_value(size=(u1 + s0, 256))   # concrete size is (17900, 256)
in_buf1 = generate_example_value(size=(u0, 10))         # concrete size is (8192, 10)
...
out_buf = generate_example_value(size=(u1 + s0, 266))   # concrete size is (17900, 256+10)
triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...)
```

If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads.
- `tmp6`  makes sure we're only loading with the `xindex` from 256 to 264.
- `xmask` makes sure we're only loading with the `xindex` within `xnumel`.
- `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`.

The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load.
```
def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)
    xmask = xindex < xnumel
    x0 = (xindex % 264)
    x1 = xindex // 264
    ...
    tmp6 = x0 >= tl.full([1], value=256)
    tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask)
    # device assertion is thrown here
    tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768
Approved by: https://github.com/jingsh
2025-05-22 02:05:37 +00:00
a264af8c71 Support fp8 output of _scaled_mm for CPU (#153600)
This PR is to support fp8 output of torch._scaled_mm for CPU, and create related UTs with fp8 and bf16/fp16/fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153600
Approved by: https://github.com/leslie-fang-intel, https://github.com/mingfeima, https://github.com/jansel
2025-05-22 01:15:39 +00:00
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
261897734a Revert "cpp_wrapper: build non-performance-sensitive code at O1 (#148773)"
This reverts commit 3c89cfd46075e62c1725b43557612901a9cbb6fa.

Reverted https://github.com/pytorch/pytorch/pull/148773 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that pr_time_benchmark is regressed after this land ([comment](https://github.com/pytorch/pytorch/pull/148773#issuecomment-2899545140))
2025-05-22 00:11:14 +00:00
7ef2c62fd3 [ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices (#152341)
This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.

Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.

This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.

Points of discussion:

* For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
* CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
* Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels

** Testing **

Use the existing tests in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341
Approved by: https://github.com/chenyang78
2025-05-21 23:59:16 +00:00
87fc5af1f6 [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)
Work around issues like #153960, #152623

NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055
Approved by: https://github.com/atalman
2025-05-21 23:46:52 +00:00
fae6f6c9ca [aot] fix deepcopying of aot bwd containing real tensors (#153999)
Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module.

But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs.

FIXES https://github.com/pytorch/pytorch/issues/153993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-05-21 23:30:02 +00:00
67f9feeee7 remove TestCustomOp.test_impl_device_cpu from dynamo expected failures (#154049)
Fixes https://github.com/pytorch/pytorch/issues/153763 maybe?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154049
Approved by: https://github.com/StrongerXi
2025-05-21 23:20:30 +00:00
5ee1242310 Follow up to #152209, remove compat patch after docker image rename (#152958)
Remove compat patch that lets PRs that haven't rebased base #152209 still have docker images.

Merge ~2 weeks after the above PR was merged.  ~80% of PRs have a merge base that is <2 weeks old

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152958
Approved by: https://github.com/huydhn
2025-05-21 23:11:29 +00:00
d82610c2af docs: fix "should not to be" typo in register_buffer docstring (#153817)
Corrects a small grammatical error in `register_buffer` docstring, from "...  should not to be ..." to "...  should not be ...". Docs-only change, so no runtime behavior, tests, or APIs are affected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153817
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:46:50 +00:00
b967b7b11e Update rnn.py, fix torch.nn.RNN document error (#153620)
I found the same issue as #147490 (@jibril-b-coulibaly).

There's an equivalent in the [doc-string](https://docs.pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) of `torch.nn.RNN`:

```python
# Efficient implementation equivalent to the following with bidirectional=False
def forward(x, hx=None):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(num_layers, batch_size, hidden_size)
    h_t_minus_1 = hx
    h_t = hx
    output = []
    for t in range(seq_len):
        for layer in range(num_layers):
            h_t[layer] = torch.tanh(
                x[t] @ weight_ih[layer].T
                + bias_ih[layer]
                + h_t_minus_1[layer] @ weight_hh[layer].T
                + bias_hh[layer]
            )
        output.append(h_t[-1])
        h_t_minus_1 = h_t
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t

```

However there's something wrong.

1. Like mentioned in #147490, line 499 is wrong

fb55bac3de/torch/nn/modules/rnn.py (L499)

The **input for RNNCell should be different** for different layers.

2. The code contains several hidden **reference-related issues** that may result in unintended modifications to tensors. For example in line 504, this causes all elements in the final output list to point to the same tensor.

fb55bac3de/torch/nn/modules/rnn.py (L504)

3. Some variable is not **defined**. Despite being a relatively minor issue in annotation, it can lead to significant confusion for those who are new to the concept. For example `weight_ih` in line 499

fb55bac3de/torch/nn/modules/rnn.py (L499)

So, i write a runnable version to make it more clear:

```python
# Efficient implementation equivalent to the following with bidirectional=False
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
def forward(x, hx=None, batch_first=False):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(rnn.num_layers, batch_size, rnn.hidden_size)
    h_t_minus_1 = hx.clone()
    h_t = hx.clone()
    output = []
    for t in range(seq_len):
        for layer in range(rnn.num_layers):
            input_t = x[t] if layer == 0 else h_t[layer - 1]
            h_t[layer] = torch.tanh(
                input_t @ params[f"weight_ih_l{layer}"].T
                + h_t_minus_1[layer] @ params[f"weight_hh_l{layer}"].T
                + params[f"bias_hh_l{layer}"]
                + params[f"bias_ih_l{layer}"]
            )
        output.append(h_t[-1].clone())
        h_t_minus_1 = h_t.clone()
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t
```

This code can reproduce the computation of torch.nn.RNN.

For example:

```python
import torch
import torch.nn as nn

torch.manual_seed(0)
input_size, hidden_size, num_layers = 3, 5, 2
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
x = torch.randn(10, 4, 3)

official_imp = rnn(x)
my_imp = forward(x)

assert torch.allclose(official_imp[0], my_imp[0])
assert torch.allclose(official_imp[1], my_imp[1])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153620
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:45:28 +00:00
5b6e551c0f [AOTI][refactor] Fix an anonymous namespace issue (#154033)
Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning,
```
warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033
Approved by: https://github.com/chenyang78
2025-05-21 22:29:09 +00:00
d356ca2466 [map] add inductor support by lowering to while_loop (#150971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150971
Approved by: https://github.com/zou3519
ghstack dependencies: #151034
2025-05-21 22:19:47 +00:00
cf1b38a017 [map] make proxy mode re-dispatch to fake key (#151034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034
Approved by: https://github.com/zou3519
2025-05-21 22:19:47 +00:00
c13eeaa718 Move inductor benchmark jobs to CUDA 12.8 (#154004)
For benchmark jobs, we usually want to run with the latest support CUDA version to get the best performance. This is a request coming from NVIDIA where they are running inductor benchmarks on Blackwell with CUDA 12.8 (min support version) and looking for an apples-to-apples comparison.

This also clean up references to CUDA 12.4 which have been sunset in PyTorch CI.

### Testing

- H100 benchmark https://github.com/pytorch/pytorch/actions/runs/15151424588
- Micro benchmark https://github.com/pytorch/pytorch/actions/runs/15151445957 (I just realize that this is still running on A100, @yanboliang Do you want to run on H100 now that we have capacity there?  It would also solve the problem of GPU memory)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154004
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/seemethere
2025-05-21 22:17:10 +00:00
053ca7439a [cutlass backend] Add serializer for cutlass ops (#153894)
Differential Revision: [D74524786](https://our.internmc.facebook.com/intern/diff/D74524786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153894
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-05-21 22:01:40 +00:00
401fa87ace make only current thread allocate to pool in NcclPG (#153990)
follow up to #153356 that fixes nccl allocation to pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990
Approved by: https://github.com/kwen2501
2025-05-21 21:57:37 +00:00
28bcd9eb30 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-21 21:55:59 +00:00
918ae5d361 [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-21 21:55:59 +00:00
0b79a8c1a9 [dynamo] renamed _fn for more clarity and put a comment of user compiler user (#154026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154026
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-21 21:12:51 +00:00
0e5f2339d0 [ROCm][Windows] Run hipcc with compatibility flags. (#153986)
See also https://github.com/ROCm/TheRock/issues/590. Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
2025-05-21 20:26:52 +00:00
3c89cfd460 cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-21 20:23:04 +00:00
4c6f0fe22f [dynamo] Properly handle torch.script.jit under @staticmethod (#153984)
Fixes #153607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153984
Approved by: https://github.com/williamwen42
2025-05-21 19:45:06 +00:00
b184e3da9c [easy] Fix internal only test (#154035)
Internally static cuda launcher isn't enabled, so we need to always enable it

Differential Revision: [D75146584](https://our.internmc.facebook.com/intern/diff/D75146584/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154035
Approved by: https://github.com/Skylion007
ghstack dependencies: #153565
2025-05-21 19:00:55 +00:00
8e6e79fc1b [hop_schema] support gen_schema for invoke_subgraph (#152984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152984
Approved by: https://github.com/zou3519
ghstack dependencies: #151067, #152974
2025-05-21 18:55:46 +00:00
9c33899196 [hop_schema] add HopSchemaGenerator to make it easier to create hop schema (#152974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152974
Approved by: https://github.com/zou3519
ghstack dependencies: #151067
2025-05-21 18:55:46 +00:00
1e0f19e173 auto functionalize base_hop (#151067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151067
Approved by: https://github.com/zou3519
2025-05-21 18:55:46 +00:00
11c0ffefcd Cache code generation during triton template expansion and enable it for mm_template. (#151773)
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.

with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.

This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.

How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.

Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
2025-05-21 18:55:41 +00:00
aec7cc60d7 add graph_code_verbose_log artifact for fx passes (#153775)
Fixes #153646

This PR refactors the logging behavior in the FX pass insert_deferred_runtime_asserts and runtime_assert.py to separate verbose/intermediate graph logs from the final output graph log. All verbose logs generated during the FX pass are now routed to a new artifact logger, graph_code_verbose, while only the final output graph remains logged to the original graph_code artifact.

Changes

- Added a new artifact logger: [graph_code_log = torch._logging.getArtifactLogger(__name__, "graph_code_verbose")]
- Updated all verbose/intermediate FX pass logs in [insert_deferred_runtime_asserts] to use the new graph_code_verbose artifact.
- Ensured that only the final output graph is logged to the original graph_code artifact.
- No changes to the FX pass logic or output—only logging behavior is affected.

Notes
This change is backward-compatible and does not affect the functional behavior of FX passes.
No changes to user-facing APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153775
Approved by: https://github.com/williamwen42
2025-05-21 18:31:59 +00:00
d23f4ae7b5 s390x: use qemu issue workaround for runner registration too (#154030)
s390x: use qemu issue workaround for runner registration too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154030
Approved by: https://github.com/seemethere
2025-05-21 18:30:25 +00:00
bb7e30c165 [MegaCache] Make MegaCache generic to allow external plugins registration (#152977)
Implements #152976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977
Approved by: https://github.com/oulgen
2025-05-21 18:18:47 +00:00
c31e239910 [precompile] Add BundledAOTAutogradCacheEntry (#152840)
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.

This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output

We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.

On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.

Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
2025-05-21 18:08:42 +00:00
3eb8fa081a Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)"
This reverts commit 32b1baa981fe53d13d77acbee509c51087abf107.

Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see d23762974e/1 ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))
2025-05-21 17:20:31 +00:00
d23762974e [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-21 17:12:05 +00:00
72a3c8dfa8 [AOTI][reland] Add an option to specify custom op C shim (#153968)
Summary: Reland https://github.com/pytorch/pytorch/pull/153851 after fixing a fuzzer test issue.

Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153968
Approved by: https://github.com/hl475
2025-05-21 15:57:57 +00:00
b7d08defe9 [BE]: Type previously untyped decorators (#153726)
This fixes decorator typing which unmasks a lot of typing issues in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153726
Approved by: https://github.com/albanD
2025-05-21 15:56:19 +00:00
6c2c527cd6 [BE] Remove extra semicolons from SymmetricMemory.hpp (#154034)
Fixes
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.cpp:1:
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:77:4: warning: extra ';' after member function definition [-Wextra-semi]
   77 |   };
      |    ^
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:81:4: warning: extra ';' after member function definition [-Wextra-semi]
   81 |   };
      |    ^
2 warnings generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154034
Approved by: https://github.com/Skylion007
2025-05-21 14:33:30 +00:00
33767eb391 Add option to statically launch user defined triton kernels (#153725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153725
Approved by: https://github.com/oulgen, https://github.com/Mingming-Ding, https://github.com/jansel
ghstack dependencies: #153565
2025-05-21 14:33:15 +00:00
b73d77900e [CI] Run limited h100 tests every 6 hours (#153900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153900
Approved by: https://github.com/Skylion007
2025-05-21 13:40:03 +00:00
11f8511455 Update torch-xpu-ops commit pin (#153902)
Update the torch-xpu-ops commit to defce46ae7, includes:

- Resolve the aten::gamma accuracy gap compared to scipy
- Optimize layernom_vectorized_impl by using adaptive wg selection for small shapes
- [Intro async flag and use current stream avoid stream sync](https://github.com/intel/torch-xpu-ops/pull/1546)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153902
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-21 13:29:41 +00:00
616e20b4f1 [Intel GPU] scalar tensor case handling in addmm, baddmm (#153051)
# Motivation
This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside.

This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767.

# Implementation
We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape.

UT testing
```
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-05-21 12:24:37 +00:00
afd7a13bca Migrate to new Windows Arm64 runners (#152099)
This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md )

Fixes #151671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099
Approved by: https://github.com/seemethere
2025-05-21 09:13:15 +00:00
ffd49d538e [BE][Ez]: Improve typing in torch/modules/container.py (#153728)
Adds some missing type annotations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153728
Approved by: https://github.com/albanD
2025-05-21 07:15:00 +00:00
a636a92ee9 [MTIA ATen Backend] Migrate "_unsafe_view" and "view" ops from out-of-tree to pytorch in-tree (#153670)
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.

Differential Revision: D74672464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-21 05:20:45 +00:00
dcb3edd30d [AOTI][XPU] Refactor AOTInductor runtime API for Intel GPU. (#153929)
Simplify and improve code format for sycl_runtime_wrappers.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153929
Approved by: https://github.com/desertfire
ghstack dependencies: #153924
2025-05-21 03:52:54 +00:00
531d8f5fb6 Revert "[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)"
This reverts commit 2b43d635d31f1338743885efd1a259f43bd2ee65.

Reverted https://github.com/pytorch/pytorch/pull/153556 on behalf of https://github.com/eqy due to reverting, will add disable for reduced precision reduction ([comment](https://github.com/pytorch/pytorch/pull/153556#issuecomment-2896257521))
2025-05-21 02:09:11 +00:00
1478d0185c Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)"
This reverts commit 8cabd23b3d357ec38a400978bb5423efcb433f2a.

Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))
2025-05-21 01:45:20 +00:00
daa68e7a93 Update USE_XCCL option if USE_XPU is OFF (#153936)
# Motivation
Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality.
Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](1075bb37d3/cmake/Dependencies.cmake (L97)) is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936
Approved by: https://github.com/Skylion007
2025-05-21 01:32:41 +00:00
cf6e5d1881 Revert "[cuBLASLt] relax addmm cuBLASLt constraint (#153675)"
This reverts commit f9bb7cf72a43bec17b5fc2ccbe865aa130e760be.

Reverted https://github.com/pytorch/pytorch/pull/153675 on behalf of https://github.com/eqy due to incorrect, cuBLASLt doesnt handle beta != 1.0 but this appears untested ([comment](https://github.com/pytorch/pytorch/pull/153675#issuecomment-2896188784))
2025-05-21 01:20:10 +00:00
fe49b11e09 Add memory reporting for XPU to Memory Profiler (#152842)
Adds support for XPU profile_memory in Pytorch Profiler.

Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns:

<details><summary>profiling.py</summary>
<p>

```python
import torch
import torch.nn as nn
import torch.optim as optim

from torch.profiler import profile, ProfilerActivity

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.conv1 = nn.Conv1d(20,20,15,padding="same")
        self.flatten = nn.Flatten()
        self.net1 = nn.Linear(2048, 4096)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(4096, 5)

    def forward(self, x):
        res = self.conv1(x)
        res = self.flatten(res)
        res = self.net1(res)
        return self.net2(self.relu(res))

def demo_basic():
    model = ToyModel().to("xpu")
    loss_fn = nn.MSELoss().to("xpu")
    optimizer = optim.SGD(model.parameters(), lr=0.001)

    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof:
        for epoch in range(10):
            optimizer.zero_grad()
            outputs = model(torch.randn(20, 2048).to("xpu"))
            labels = torch.randn(20, 5).to("xpu")
            loss_fn(outputs, labels).backward()
            optimizer.step()
    print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100))

if __name__ == "__main__":
    demo_basic()
```
</p>
</details>

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.501ms        44.73%       1.501ms      25.024us           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.12%       1.067ms        30.47%     260.929ms      13.046ms       0.000us         0.00%       1.009ms      50.448us           0 b           0 b            20
                                         AddmmBackward0         0.09%     744.983us        15.99%     136.944ms       6.847ms       0.000us         0.00%     784.640us      39.232us           0 b           0 b            20
                                               aten::mm        15.41%     131.956ms        15.79%     135.167ms       3.379ms     784.640us        23.37%     784.640us      19.616us           0 b           0 b            40
                                           aten::linear         0.02%     156.361us        20.58%     176.187ms       8.809ms       0.000us         0.00%     741.760us      37.088us           0 b           0 b            20
                                            aten::addmm        20.25%     173.371ms        20.52%     175.723ms       8.786ms     741.760us        22.10%     741.760us      37.088us           0 b           0 b            20
                                Optimizer.step#SGD.step         0.40%       3.429ms         5.55%      47.509ms       4.751ms       0.000us         0.00%     488.960us      48.896us           0 b           0 b            10
                                    aten::_foreach_add_         4.81%      41.162ms         5.15%      44.080ms       4.408ms     488.960us        14.57%     488.960us      48.896us           0 b           0 b            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     422.880us        12.60%     422.880us      42.288us           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.03%     280.041us         4.36%      37.328ms       3.733ms       0.000us         0.00%     356.320us      35.632us           0 b           0 b            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 856.227ms
Self XPU time total: 3.357ms
```

This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem       XPU Mem  Self XPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.436ms        43.64%       1.436ms      23.939us           0 b           0 b           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.13%       1.186ms        29.92%     262.875ms      13.144ms       0.000us         0.00%       1.005ms      50.272us           0 b           0 b     320.94 Mb      -4.69 Mb            20
                                         AddmmBackward0         0.09%     815.288us        16.48%     144.802ms       7.240ms       0.000us         0.00%     790.720us      39.536us           0 b           0 b     325.47 Mb           0 b            20
                                               aten::mm        15.86%     139.342ms        16.26%     142.875ms       3.572ms     790.720us        24.03%     790.720us      19.768us           0 b           0 b     325.47 Mb     325.47 Mb            40
                                           aten::linear         0.02%     182.856us        20.46%     179.775ms       8.989ms       0.000us         0.00%     669.440us      33.472us           0 b           0 b       3.13 Mb           0 b            20
                                            aten::addmm        20.10%     176.607ms        20.40%     179.210ms       8.961ms     669.440us        20.34%     669.440us      33.472us           0 b           0 b       3.13 Mb       3.13 Mb            20
                                Optimizer.step#SGD.step         0.42%       3.692ms         5.61%      49.267ms       4.927ms       0.000us         0.00%     486.640us      48.664us           0 b           0 b           0 b           0 b            10
                                    aten::_foreach_add_         4.83%      42.439ms         5.19%      45.574ms       4.557ms     486.640us        14.79%     486.640us      48.664us           0 b           0 b           0 b     -20.00 Kb            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     420.960us        12.79%     420.960us      42.096us           0 b           0 b           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.04%     310.719us         4.47%      39.279ms       3.928ms       0.000us         0.00%     339.520us      33.952us           0 b           0 b      -2.89 Mb      -3.12 Mb            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 878.627ms
Self XPU time total: 3.291ms
```

These XPU memory numbers match the same profiling results on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-05-21 01:19:19 +00:00
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
0959869683 Bump triton pin for the release 3.3.1 of triton (#153951)
Triton is pointing to latest triton pin : https://github.com/triton-lang/triton/tree/release/3.3.x
XPU pointing to latest XPU pin: https://github.com/intel/intel-xpu-backend-for-triton/commits/release/3.3.x/

This version contains the fix for: Compilation Issue for RTX 5090 GPUs with Compute Capability = 120. https://github.com/triton-lang/triton/pull/6771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153951
Approved by: https://github.com/davidberard98
2025-05-21 00:27:39 +00:00
32b1baa981 [3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)
Summary:
1. Customers now can test with float8_e4m3fn.
2. To play safe, we set the scaling version as the default.

Test Plan:
### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443
Network: Up: 16KiB  Down: 3.9MiB  (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d)
Analyzing targets. Remaining     0/281
Executing actions. Remaining     0/5957                                                                                                   7.3s exec time total
Command: test.     Finished 3 local, 1 remote
Time elapsed: 1:29.7s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D74910193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802
Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding
2025-05-21 00:21:54 +00:00
58dc80dff6 [MPSInductor] Fix indexing calculation (#153997)
By using `c10:🤘:floor_divie` primitive

Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI)

Before this change following script generated different compile and eager results
```python
import torch

def foo(unsqueeze, unsqueeze_1):
    cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1)
    view = torch.ops.aten.view.default(cat_1, [4])
    slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3)
    rev_1 = torch.ops.aten.flip.default(slice_5, [0])
    return rev_1

if __name__ == "__main__":
    x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1)
    y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1)

    rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y)
    print(kernel)
    print("Compile: ", rc)
    print("Eager: ", foo(x, y))
```
After this change
```
'''
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp0 = (2 + ((-1)*x0)) % (2);
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 1;
        auto tmp5 = tmp1 < tmp4;
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2;
        auto tmp10 = tmp1 < tmp9;
        auto tmp12 = tmp8 ? tmp11 : 0.0;
        auto tmp13 = tmp5 ? tmp7 : tmp12;
        out_ptr0[x0] = static_cast<float>(tmp13);
    }
'''
Compile:  tensor([2., 5., 1.], device='mps:0')
Eager:  tensor([2., 5., 1.], device='mps:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997
Approved by: https://github.com/dcci
ghstack dependencies: #153970, #153971
2025-05-21 00:03:46 +00:00
fc33da410f Add torch/header_only_apis.txt and enforce they're tested (#153635)
This PR adds enforcement of testing header only APIs.

The benefit of torch/header_only_apis.txt is twofold:
1) this gives us a clear view of what we expect to be header only
2) this allows us to enforce testing

The enforcement added in this PR is very basic--we literally string match that a symbol in `torch/header_only_apis.txt` is in a cpp test. This is meant to be a first step in verifying our APIs are properly tested and can get fancier over time. For now, I've added myself as a codeowner to learn what to look out for in terms of proper tests. Over time, I anticipate we can automate more steps, but right now let's just get something out the door.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153635
Approved by: https://github.com/albanD
ghstack dependencies: #153965
2025-05-20 23:42:24 +00:00
41a9aa6564 Remove janky (though at times useful) dlclose test (#153975)
This test was never the shining star in class but it helped check that we can properly delete a stable library. But now that we are running it in CI this is not a good test to annoy people with as dlclose + parallelism is likely not the move. I will miss it locally though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153975
Approved by: https://github.com/jbschlosser
2025-05-20 23:26:42 +00:00
7b7604fdb4 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit 0c04492e3b142854fad8356a2a4d74f12e2c6c5d.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see 3742b7fb3a/1 ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))
2025-05-20 23:12:11 +00:00
3742b7fb3a Treat dim=[] same as dim=None (#153570)
Fixes https://github.com/pytorch/pytorch/issues/153568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570
Approved by: https://github.com/ngimel
2025-05-20 22:44:29 +00:00
f7b8eadd9d Add codeowner for merge rules (#152354)
To ensure changes to merge rights are properly reviewed
Also make the codeowner file valid by removing invalid users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152354
Approved by: https://github.com/malfet
2025-05-20 22:24:23 +00:00
0c04492e3b [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-20 22:19:02 +00:00
2c2524f74b [AOTI] Generate unique cubin file names when package_cpp_only (#153948)
Summary:
* When package_cpp_only is specified, generate kernel file names with unique kernel names to make the final packaged package files more readable. Assert on unique_kernel_names in case somehow it was explicitly set to False.
* Fix a rocm test skip, see https://github.com/pytorch/pytorch/pull/153828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153948
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-20 22:07:53 +00:00
8cabd23b3d [CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
2025-05-20 21:56:47 +00:00
2b43d635d3 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 21:51:49 +00:00
aeb734f519 [nativert] Move GraphSignature to pytorch core (#152969)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading.

The serialization schema is defined in torch/_export/serde/schema.py
See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature

Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp`

Differential Revision: D73895378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969
Approved by: https://github.com/swolchok
2025-05-20 21:49:56 +00:00
8f943046f8 [BE] light cleanups to linter logic (#153965)
some BE cleanup on other lint things I saw while doing the top of the this stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153965
Approved by: https://github.com/soulitzer
2025-05-20 21:28:48 +00:00
deaf6c2f2f Address the ignored warnings for -Wmissing-field-initializers in the file fbcode/caffe2/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (#153958)
Summary:
the error message  https://www.internalfb.com/sandcastle/workflow/698057942249983018/artifact/actionlog.698057942382778255.stderr.1?selectedLines=66-66-70-148 from D74892646

When switching the host compiler to Clang, maybe we should only silence these warnings in this file.

Test Plan: sandcastle_green

Differential Revision: D75029051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153958
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-05-20 21:25:56 +00:00
6cb7e4b5a5 [EZ] Update mps xfail reason (#153971)
cummin is not implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153971
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #153970
2025-05-20 21:15:14 +00:00
03859242ce [Testing] Fix test_deterministic_... on MPS (#153970)
By decorated emitted kernels with `'''` rather than `"""`

To match regex in `torch._inductor.utils.run_and_get_kernels`
This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970
Approved by: https://github.com/dcci, https://github.com/jansel
2025-05-20 21:15:14 +00:00
3aa95b252a Fix test_side_stream_backward_overlap flakiness (#153963)
Fixes https://github.com/pytorch/pytorch/issues/153927

Although the autograd backward should always execute SideBw before MainBw, there is still a small chance the recorded events won't be in that order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153963
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
ghstack dependencies: #151079, #153412
2025-05-20 21:02:56 +00:00
500a710422 Revert "Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)"
This reverts commit 2e56ce097a201ff3c69610cea953a9efce17d1b1.

Reverted https://github.com/pytorch/pytorch/pull/153245 on behalf of https://github.com/yangw-dev due to tests failed internally [D75078034](https://www.internalfb.com/diff/D75078034) ([comment](https://github.com/pytorch/pytorch/pull/153245#issuecomment-2895785642))
2025-05-20 20:45:55 +00:00
179e7d8624 Fix vs2022 caused AVX512 illegal instruction issue. (#153480)
Fixes #145702

Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine.

Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459

Local test passed:
<img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480
Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman
2025-05-20 20:37:00 +00:00
996c4d803d Removing conda references from PyTorch Docs (#152702)
Addresses #148339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702
Approved by: https://github.com/svekars, https://github.com/albanD, https://github.com/atalman
2025-05-20 20:33:28 +00:00
05bc78e64f [submodule] Update fbgemm pinned version (#153950)
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751

Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec

Test Plan: PyTorch OSS CI

Differential Revision: D75073516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 20:24:27 +00:00
eqy
823a35807c [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-20 20:19:03 +00:00
e8f8baf71f set CUDA_MODULE_LOADING for older drivers only (#152695)
`CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable.

(the `LOG(WARNING)` has to be removed before merging)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia
2025-05-20 19:34:40 +00:00
7587350458 Make python_agnostic cpp extension tests standalone (#153274)
Related: #148920

This PR:
* Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns
    * This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py`

After this PR, it is now possible to run:
```
python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py
```

and the test will build the prerequisite wheel before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274
Approved by: https://github.com/janeyx99, https://github.com/cyyever
ghstack dependencies: #153264
2025-05-20 19:18:09 +00:00
3ecd444004 Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264)
Related: #148920

This PR:
* Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()`
    * Updates libtorch_agnostic tests to use this
* Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py`
    * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now

After this PR, it is now possible to run:
```
python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py
```

and the test will build the `libtorch_agnostic` extension before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264
Approved by: https://github.com/janeyx99
2025-05-20 19:18:09 +00:00
f1f54c197d [c10d] Simplify new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`.

Test Plan: contbuild & OSS CI

Differential Revision: D75007368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843
Approved by: https://github.com/Skylion007, https://github.com/wz337
2025-05-20 19:15:20 +00:00
2d20106922 [inductor] Support cutlass backend with remote execution (#153844)
Meta-internal builds need to use RE to build with nvcc, since the
trainers do not have nvcc (and its attendant build toolchain) installed.

This diff enables building using an RE service (via the same code path used for
Triton)

Differential Revision: [D74907192](https://our.internmc.facebook.com/intern/diff/D74907192/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153844
Approved by: https://github.com/henrylhtsang
2025-05-20 19:05:23 +00:00
e0f8174001 [triton][fb] Move build_paths into triton_utils (#153652)
Summary: TSA, this is just a small cleanup

Test Plan: CI

Differential Revision: D74835506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153652
Approved by: https://github.com/Skylion007
2025-05-20 18:59:50 +00:00
f9bb7cf72a [cuBLASLt] relax addmm cuBLASLt constraint (#153675)
`beta == 1.0` doesn't seem to be required anymore

https://github.com/pytorch/pytorch/issues/153590

`self.dim() == 1` restriction seems to still hold but not sure if that's due to a lack of handling on the PyTorch side or the cuBLASLt side, will investigate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153675
Approved by: https://github.com/Skylion007
2025-05-20 18:43:38 +00:00
7c9d94e9bb Redirect mobile_optimizer.rst to executorch (#153664)
Redirect mobile_optimizer.rst to executorch

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153664
Approved by: https://github.com/byjlw, https://github.com/malfet
2025-05-20 18:13:45 +00:00
0087f5f0af [AOTI][XPU] Embed SPRI-V files into .so (#153924)
Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary.
Fixes #153829
Fixes #153830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924
Approved by: https://github.com/desertfire
2025-05-20 17:38:53 +00:00
335c89c6f1 [Monitoring] enable local logs and add mac test monitoring (#153454)
Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454
Approved by: https://github.com/huydhn
2025-05-20 17:14:40 +00:00
b910d37ec6 [cutlass backend] Reduce log level for cutlass runtime error (#153457)
Want to make sure we always call self.cleanup_run_fn() even if we crash.

I think this is the reason why sometimes we get
```
in _dlclose
TypeError: 'NoneType' object is not callable
```

Differential Revision: [D74629230](https://our.internmc.facebook.com/intern/diff/D74629230/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153457
Approved by: https://github.com/ColinPeppler
2025-05-20 17:03:17 +00:00
6b5b69a468 [Memory Snapshot] Fix RecordFunction Callback Handling (#153839)
Fixes #153571
Summary:
1. Set annotation callback to global to include all threads
2. Only init callbacks when enable == true and callbacks are empty under mutex
3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex

We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry

Test Plan: Ran basic snapshot and saw that the callbacks were registered properly

Reviewed By: ngimel

Differential Revision: D74771491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-05-20 17:01:00 +00:00
ddfaab3b56 [aoti] Reset expr when generating cpp code (#153898)
Maybe fixes https://github.com/pytorch/pytorch/issues/153896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898
Approved by: https://github.com/desertfire
2025-05-20 16:31:25 +00:00
5163bf0069 [CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone

Tests that should exercise both backends should explicitly parametrize this setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
2025-05-20 16:18:35 +00:00
a7c01d7f13 [Inductor] Subgraph check output strides (#153755)
Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous.

Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755
Approved by: https://github.com/eellison
ghstack dependencies: #153754
2025-05-20 16:07:18 +00:00
63e5d46478 [Inductor] Subgraph support dynamic input expressions (#153754)
Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp

Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754
Approved by: https://github.com/eellison
2025-05-20 16:07:18 +00:00
2e56ce097a Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jeffdaily
2025-05-20 15:46:21 +00:00
ef958fa152 [cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888
Approved by: https://github.com/Skylion007
2025-05-20 15:08:37 +00:00
3102ae6798 Revert "[AOTI] Add an option to specify custom op C shim (#153851)"
This reverts commit 365ac49840105918c604a6b1c7e81c1ca59e37fb.

Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see c4d1ff02f8/1 ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))
2025-05-20 14:23:50 +00:00
c4d1ff02f8 [Lint] Update clang-format to 19.1.4 (#153889)
All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-20 14:12:46 +00:00
d869ea11e0 [BE]: Update fmtlib submodule to 11.2.0 (#153853)
Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853
Approved by: https://github.com/malfet
2025-05-20 14:11:18 +00:00
4b759d98f8 Recheck autotune cache on static cuda launcher load (#153565)
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.

Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.

Test Plan:
New unit test that failed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
2025-05-20 14:00:43 +00:00
d68d4d31f4 [Cutlass] EVT tests update (#153926)
Fixes internal EVT tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153926
Approved by: https://github.com/williamwen42
2025-05-20 10:03:10 +00:00
d44074f01a [Dynamo] Fix einops regression (#153925)
Fixes https://github.com/pytorch/pytorch/issues/153476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153925
Approved by: https://github.com/williamwen42
2025-05-20 09:52:42 +00:00
44f19c7179 Record the XPU and XCCL build settings in the compiled binary (#147161)
Fixes #ISSUE_NUMBER

Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support.

Below is the output adding them (see end of last line):

```
Python 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
XPU backend  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1,
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-20 09:21:39 +00:00
1075bb37d3 Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit cb5f31a4a164a4fa1eaa627f9b15cdc18aa95ef1.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2893059329))
2025-05-20 06:02:38 +00:00
9849c79fa2 Revert "FakeTensorMode dispatch shouldn't include bypass in exception context (#153780)"
This reverts commit aa84c037f0f473c91a79f48a5f278b7243f64b0e.

Reverted https://github.com/pytorch/pytorch/pull/153780 on behalf of https://github.com/malfet due to Reverting to clearly revert https://github.com/pytorch/pytorch/pull/153034, that seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153780#issuecomment-2893053304))
2025-05-20 05:59:42 +00:00
365ac49840 [AOTI] Add an option to specify custom op C shim (#153851)
Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851
Approved by: https://github.com/hl475
2025-05-20 05:12:09 +00:00
89ebd29fdc [Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744
Approved by: https://github.com/williamwen42
2025-05-20 04:08:29 +00:00
5ef90e14a3 [export] Remove unused constants (#153800)
An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800
Approved by: https://github.com/dolpm, https://github.com/pianpwk
2025-05-20 03:15:27 +00:00
a79e621c1c [DDP] rebuilt bucket order when find_unused_parameters=true (#153404)
Differential Revision: D72437251

Enable to rebuild bucket order when find_unused_parameters=true.

It should be always better than not rebuilding bucket order when find_unused_parameters=True:

1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything

2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
    a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
    b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter

it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2025-05-20 02:45:01 +00:00
8b94d30b26 [Testing] Benchmark more tests for MPSInductor (#153897)
And report HF tests as HF tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153897
Approved by: https://github.com/dcci
2025-05-20 02:41:38 +00:00
1627951f24 [3.13] Remove all profiler related skips (#153857)
As underlying issue were fixed by https://github.com/pytorch/pytorch/pull/153848

Fixes https://github.com/pytorch/pytorch/issues/142166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153857
Approved by: https://github.com/williamwen42
ghstack dependencies: #153848
2025-05-20 01:19:52 +00:00
b15720118a Revert "Cache code generation during triton template expansion and enable it for mm_template. (#151773)"
This reverts commit 9180bb187c0e4c3ab3654e765fe33ad4c75a2b1a.

Reverted https://github.com/pytorch/pytorch/pull/151773 on behalf of https://github.com/malfet due to It broke ROCm, see f9aa3bae8c/1 ([comment](https://github.com/pytorch/pytorch/pull/151773#issuecomment-2892587039))
2025-05-20 00:42:53 +00:00
f9aa3bae8c [Inductor][XPU] Fallback bmm to mm when batch == 1, align with cuda. (#153770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153770
Approved by: https://github.com/NikhilAPatel, https://github.com/EikanWang, https://github.com/jansel
2025-05-19 23:56:20 +00:00
d81217be2e Revert "Improve torch.ops typing (#153558)"
This reverts commit c5cba39d469151895cd0ecf7673b98e5072b69c2.

Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))
2025-05-19 23:32:36 +00:00
701e22112d [PT2][Optimus][Observability] Refactor the logging to avoid excessive tlparse log (#153584)
Summary: context: https://fb.workplace.com/groups/943185660584207/permalink/1215335930035844/

Test Plan:
before: aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000&fbclid=IwZXh0bgNhZW0CMTEAAR575JfJZUtE7kQCqzIZVCYomv1q03JzuMFVok8qDA_FuGC8oZ6rhhb2EziSQA_aem_abITQJZQP45t51_r-J-cFw

Differential Revision: D74776025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153584
Approved by: https://github.com/jamesjwu
2025-05-19 22:57:29 +00:00
c3e14ecdcd [CachingHostAllocator] guard accesses to use_host_register by mutex (#153845)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153845
Approved by: https://github.com/mradmila, https://github.com/jeffdaily
2025-05-19 22:39:13 +00:00
41564803c2 [Docs] Mention version.txt change for patch releases (#153860)
Part of https://github.com/pytorch/pytorch/issues/151425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153860
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2025-05-19 22:35:33 +00:00
08e716fc70 [BE] Fix -Wextra-semi warning (#153887)
Introduced by https://github.com/pytorch/pytorch/pull/153645

Semicolon is not needed after closing curly bracket defining a class method.

Not sure why CI did not catch it, but my local builds are now erroring out with
```
[19/97] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/passes/dead_code_elimination.cpp.o
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/passes/dead_code_elimination.cpp:4:
/Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.h:356:64: warning: extra ';' after member function definition [-Wextra-semi]
  356 |   ValueAndMemoryLocationSet(const AliasDb* db) : aliasDb_(db){};
      |                                                                ^
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153887
Approved by: https://github.com/wdvr, https://github.com/davidberard98
2025-05-19 22:25:03 +00:00
f419067e50 [ROCm] improve sparse addmm, enable complex (#153262)
PR to:
- enable complex data types for sparse matmul on ROCm
- fix sparse addmm/baddbmm on ROCm
- fix sparse hipification for ROCm
- fix/enable sparse tests on ROCm (~40 tests total):
```
test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_*
test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_*
test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS*
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153262
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-05-19 22:23:18 +00:00
91cc93deae [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 21:47:33 +00:00
c0343b1539 Fix profiler on cpython-3.13 (#153848)
Per [PEP 667](https://peps.python.org/pep-0667/) `PyFrame_GetLocals` no longer returns dict, but rather instance of `PyFrameLocalsProxy_Type`, so calling `PyDict_GetItemString` is no longer valid(it will always return None) and must be replaced with `PyMapping_GetItemString`

Tested by partially reverting https://github.com/pytorch/pytorch/pull/141674 full revert will be done in the followup PR

Fixes https://github.com/pytorch/pytorch/issues/148273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153848
Approved by: https://github.com/Skylion007
2025-05-19 21:20:53 +00:00
8c40c9ffcb Revert "[CI] Reuse old whl (#153838)"
This reverts commit cc48550e6f6fa8888b7d90d030b36c4e6d6581ab.

Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to testing on main is hard ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892272494))
2025-05-19 21:13:27 +00:00
a237831bc2 [JIT] Optimize DCE by storing a MemoryLocations for an entire set<Value*> (#153645)
Summary:
**TL;DR**: make DCE faster by replacing a Set<Value*> with a MemoryLocations sparse bitset (representing all the memory locations stored by the collection of all values in the set).

**Details**
The goal of this PR is to optimize this function from AliasDb:

```
bool AliasDb::writesToAlias(Node* n, const ValueSet& vs) const {
  const auto writtenTo = getWrites(n);
  if (writtenTo.empty()) {
    return false;
  }

  MemoryLocations locs;
  for (const auto v : vs) {
    auto it = elementMap_.find(v);
    if (it != elementMap_.end()) {
      const auto& vlocs = memoryDAG_->getMemoryLocations(it->second);
      if (writtenTo.intersects(vlocs)) {
        return true;
      }
    }
  }

  return false;
}
```

In the DCE use case, we have a ValueSet of live values, into which we insert `Value*`s; and sometimes need to check whether a node mutates any of the live values using `writesToAlias`.

Looping through all the values in the ValueSet and indexing into the elementMap_ is slow; so if we can pre-compute the MemoryLocations set, this speeds up the function. In some large model examples, I see ~15-25x speedups from this change.

**Implementation**: To avoid exposing too many details of AliasDb, I introduce a friend class `ValueAndMemoryLocationSet`, which is an insert-only set of Values, which also maintains the corresponding MemoryLocations.

Then in AliasDb, I use `ValueAndMemoryLocationSet` if we're using AliasDb for analysis, and otherwise use a `Set<Value*>` if we don't have AliasDb.

Test Plan: Rely on unit tests.

Differential Revision: D74827086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153645
Approved by: https://github.com/eellison
2025-05-19 21:04:59 +00:00
cc48550e6f [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 20:56:44 +00:00
9180bb187c Cache code generation during triton template expansion and enable it for mm_template. (#151773)
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.

with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.

This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.

How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.

Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
2025-05-19 20:38:04 +00:00
1ccacc028d Revert "[CI] Reuse old whl (#153838)"
This reverts commit 0716acff3a3692daaf31d97f91ae5aee70f10f24.

Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to forgot to comment some stuff out ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892195387))
2025-05-19 20:33:14 +00:00
6383ddcfa4 Update serialization docs (#153631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153631
Approved by: https://github.com/albanD
2025-05-19 20:22:07 +00:00
2fcbb903cb [BE][EZ] Delete unsued conda-env-IOS.txt (#153849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153849
Approved by: https://github.com/janeyx99, https://github.com/seemethere, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-05-19 20:06:38 +00:00
eqy
6ae0c42278 [CUDA][cuBLASLt] Respect allow[FP16/BF16]ReductionCuBLAS in cuBLASLt (#153095)
cuBLASLt matmuls have been silently allowing all reduction types, which meant that e.g., `allow_fp16_reduced_precision_reduction = False` had no effect.

In practice split-K with reduced precision reductions were unlikely to happen as the default `CUBLASLT_WORKSPACE_SIZE` of 1MiB tends to prevent this.

However this isn't guaranteed and we are on the path to increasing the default workspace size following #151163

This setting is effectively already tested in e.g., `test_cublas_addmm_size_100_cuda_float16` and `test_cublas_addmm_size_100_cuda_bfloat16` but the backend selection is not deterministic. Running the full `test_matmul_cuda.py` seems to exercise the Lt interface, but running a standalone test does not (apparently due to spurious alignment differences).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153095
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-19 20:05:37 +00:00
e581e1c0f4 [BE][Ez]: Propogate some nodiscard in RNN (#153836)
Follow up @cyyever #153805 to propagate [nodiscard] from the empty() method call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153836
Approved by: https://github.com/eqy
2025-05-19 19:55:45 +00:00
0716acff3a [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 19:26:08 +00:00
674a85cf26 Revert "[Distributed][CI] Rework continuous TestCase (#153653)"
This reverts commit 0d5c628a6e96e0a960af39d1d0de4bf04df69c39.

Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))
2025-05-19 18:29:27 +00:00
0d5c628a6e [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-19 18:20:42 +00:00
c54b9f2969 [Monitoring] Add util for linux build (#153456)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153456
Approved by: https://github.com/huydhn
2025-05-19 17:28:17 +00:00
be36bacdaa [pytorch] Delete TorchScript based Android demo app and point user to ExecuTorch (#153767)
Summary: A retry of #153656. This time start from co-dev to make sure we capture internal signals.

Test Plan: Rely on CI jobs.

Differential Revision: D74911818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153767
Approved by: https://github.com/kirklandsign, https://github.com/cyyever, https://github.com/Skylion007
2025-05-19 17:20:36 +00:00
6487ea30b3 [c10d] Fix new_subgroups(group=) bug (#153798)
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.

Test Plan: contbuild & OSS CI

Differential Revision: D74964213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
2025-05-19 17:01:10 +00:00
b0e5402377 Revert "Recheck autotune cache on static cuda launcher load (#153565)"
This reverts commit 02af4e88e4e76309672dbc9b5970ae630df525c7.

Reverted https://github.com/pytorch/pytorch/pull/153565 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see ee72c53c88/1 ([comment](https://github.com/pytorch/pytorch/pull/153565#issuecomment-2891673913))
2025-05-19 16:52:48 +00:00
ee72c53c88 Enable ruff check for all ipynb files (#153820)
Fixes #146411, following #148654

After test, seems this could be enabled for all ipynb file.

```bash
lintrunner --take RUFF --all-files
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153820
Approved by: https://github.com/Skylion007
2025-05-19 16:45:26 +00:00
ed5f4a4fa8 Replace size() checks with empty() (#153805)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805
Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-05-19 16:20:57 +00:00
0ec8fe46d7 cleanup, refactor and add missing self._dde_suppressed checks (#152657)
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type  DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
2025-05-19 16:15:14 +00:00
dccd19c2ef [Inductor] Construct subgraph with benchmarking args not example_inputs (#153753)
If the inputs to a subgraph has FlexibleLayout, the subgraph does not currently freeze the layouts here. Therefore, the `example_inputs` generated might not be consistent in layout with the `args` based in for benchmarking

Differential Revision: [D74900879](https://our.internmc.facebook.com/intern/diff/D74900879/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153753
Approved by: https://github.com/eellison
2025-05-19 15:58:40 +00:00
7a46f4bde0 Enable accelerator to perform streaming backward (#153412)
Also see https://github.com/pytorch/pytorch/pull/142097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412
Approved by: https://github.com/albanD
ghstack dependencies: #151079
2025-05-19 15:52:42 +00:00
c5cba39d46 Improve torch.ops typing (#153558)
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558
Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever
2025-05-19 14:52:32 +00:00
3cd5b3b1e7 [AOTI] Skip a rocm test (#153828)
Summary: Skip test_aot_inductor_package.test_compile_after_package. https://github.com/pytorch/pytorch/pull/150739 added an opt-in feature which doesn't work for rocm yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153828
Approved by: https://github.com/malfet
2025-05-19 14:13:19 +00:00
02af4e88e4 Recheck autotune cache on static cuda launcher load (#153565)
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.

Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.

Test Plan:
New unit test that failed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
2025-05-19 12:50:22 +00:00
c45515c2ed Update slow tests (#153815)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153815
Approved by: https://github.com/pytorchbot
2025-05-19 11:15:25 +00:00
4f1a52fba4 [xla hash update] update the pinned xla hash (#153816)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153816
Approved by: https://github.com/pytorchbot
2025-05-19 11:05:51 +00:00
f3daedb263 [BE]: Remove redundant copy (#153629)
Add typing and remove redundant copy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153629
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-19 08:25:20 +00:00
5506baa4ed Refactoring FSDP2 (_composable/fsdp) test cases to be device agnostic (#149848)
The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc)

The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: https://github.com/pytorch/pytorch/pull/139184/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149848
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2025-05-19 05:46:51 +00:00
6f835a4769 [amd] fix tunableop gemm (#153764)
Summary: Tunableop on AMD has perf regression for a while. It turns out that the tunableop code path will first run tuned GEMM and then run heuristics GEMM (so run two GEMMs...)....

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 buck test @//mode/opt-amd-gpu -c fbcode.rocm_arch=mi300 -c fbcode.rocm_ck_rtz=true fbcode//accelerators/workloads/microbench/RE:test_emu_v1p4 -- --exact 'accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)' --run-disabled
```

Before the diff
```
  File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/ecc11ed52295855f/accelerators/workloads/microbench/RE/__test_emu_v1p4__/test_emu_v1p4#link-tree/accelerators/workloads/microbench/RE/test_emu_v1p4.py", line 47, in test_gemm
    self.assertTrue(result < AMD_GEMM_BASELINE * AMD_GEMM_THRESHOLD)

Buck UI: https://www.internalfb.com/buck2/b4b8dfca-0301-4c5d-83d6-d866d840c42d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14355223896396807
Network: Up: 10MiB  Down: 1.9GiB  (reSessionID-23b213fe-a460-4788-86c6-a52343ff10f4)
Loading targets.   Remaining      0/5144                                      93161 dirs read, 753263 targets declared
Analyzing targets. Remaining      0/70523                                     2837379 actions, 3262810 artifacts declared
Executing actions. Remaining      0/472286                                    217:26:58.1s exec time total
Command: test.     Finished 122 local, 522 remote, 199785 cache (99% hit)     211:26:30.5s exec time cached (97%)
Time elapsed: 12:50.2s
Test execution completed but the tests failed
Tests finished: Pass 0. Fail 1. Fatal 0. Skip 0. Build failure 0
1 TESTS FAILED
  ✗ accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)

Run $ fdb buck test <args> to debug accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)
      ^^^ just prefix your previous command! ($ fdb !!)
Learn more at https://fburl.com/fdb
```

After the diff
```
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: henryoier, henryhu6

Differential Revision: D74910115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153764
Approved by: https://github.com/yangsiyu007, https://github.com/xw285cornell
2025-05-19 04:07:48 +00:00
2ade886412 [XPU] [Windows] Auto turn on kineto XPU build when compiler version support. (#153681)
Since SYCL compiler 20250101, it will remove dependency of level zero header. We can turn on kineto XPU by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153681
Approved by: https://github.com/chuanqi129, https://github.com/cyyever, https://github.com/EikanWang
2025-05-19 03:07:14 +00:00
1bc5762495 [Intel GPU][Inductor] Fallback embedding_dense_backward on XPU (#151637)
Reopen #146888, now the modification only affects xpu device. We do not  want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151637
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
2025-05-19 02:19:37 +00:00
74d0300804 Change unsafe_marked_cacheable_functions to a dictionary, so that you can specify a static cache key (#152486)
Fixes https://github.com/pytorch/pytorch/issues/152434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152486
Approved by: https://github.com/oulgen
2025-05-19 02:16:33 +00:00
694748dd9d [MPSInductor] Fix conv_transpose channels last (#153787)
Regardless of the input layout, transposed convolution always returns contiguous tensor on MPS
Add test to validate that
This fixes torch.compile for SegmentAnything network

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153787
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #153786
2025-05-19 02:01:48 +00:00
6fe5d9215f [EZ][MPS] Enable rsub op (#153786)
Nothing really to enable, just add it to native functions, TensorIterator abstraction takes care of the rest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153786
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/dcci
2025-05-19 02:01:48 +00:00
a2d0ef242d [AOTI] Embed cubin files into .so (#150739)
Summary: Embed cubin files so AOTI is one step closer to generate a single binary. Controlled by a flag and off as default.

Differential Revision: [D72535357](https://our.internmc.facebook.com/intern/diff/D72535357)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150739
Approved by: https://github.com/angelayi
2025-05-19 01:11:46 +00:00
cyy
a8986963da Fix some CMake issues (#153686)
These issues were discovered when trying CMake 3.27:
1. set C++ language on HIP sources.
2. add missing link to gtest_main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153686
Approved by: https://github.com/Skylion007
2025-05-19 00:31:34 +00:00
75eb2f3ff6 Revert "[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)"
This reverts commit aac30ef50366b03f0ef2d1e770f45a3465f6ea66.

Reverted https://github.com/pytorch/pytorch/pull/153744 on behalf of https://github.com/jeanschmidt due to Need to revert as it is breaking internal signals: [D74935585](https://www.internalfb.com/diff/D74935585) ([comment](https://github.com/pytorch/pytorch/pull/153744#issuecomment-2889187038))
2025-05-18 20:13:00 +00:00
cb57b19c3a [ATen-CPU] Use math.h for GeLU as well as cmath (#153742)
Summary:
## Context

See https://github.com/pytorch/pytorch/pull/149164 for more context.

Originally, this fix worked but more recently including `cmath` by itself no longer provides access to math constants on Windows platforms. I found that including `math.h` resolves this.

I'm not sure exactly what changed, but this PR updates the header to just use both includes fix the symbols not being found. It might be a bug with a recent Windows update perhaps?

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153742
Approved by: https://github.com/swolchok, https://github.com/Skylion007
2025-05-18 19:06:45 +00:00
aa84c037f0 FakeTensorMode dispatch shouldn't include bypass in exception context (#153780)
In the FakeTensor cache when we get a bypass exception while computing the cache key (call this exc_1) we need to dispatch to the original operation.

It's possible for the dispatch to the original operation to get its own exception which we want to bubble up to the caller (call this exc_2).

If we directly dispatch from within the handler for exc_1 then exc_2 will have a `__context__` of exc_1 - which can cause deviations between cached and non-cached behavior - so we need to be a bit careful when we call the dispatch.

Testing:
test_aotdispatch.py::TestAOTExport::test_aot_export_predispatch_outdtype fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153780
Approved by: https://github.com/oulgen
2025-05-18 17:21:46 +00:00
68034198e5 [HOP] Mutation and alias rework (#146658)
This PR reworks the way the input mutations and various aliases are checked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146658
Approved by: https://github.com/ydwu4
2025-05-18 08:05:22 +00:00
0e805aad7f [ONNX] Support float4 (#151069)
- Support exporting float4 models (note: currently we use IR version 10 universally in the exporter, which does not include float 4 support. Eventually when onnx runtime and the ecosystem moves to support the new IR version 11 we should bump our version to 11 in the exporter as well)
- The shape of the type is set according to https://github.com/pytorch/pytorch/pull/148791#discussion_r2038704986 (added last dim with size 2)
- Use ml_dtypes types when converting to numpy for consistency with ONNX IR

Fix https://github.com/pytorch/pytorch/issues/150202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151069
Approved by: https://github.com/titaiwangms
2025-05-18 03:19:35 +00:00
8568dbce1d [inductor] Clean typing in codegen/common.py and codecache.py (#150767)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150767
Approved by: https://github.com/aorenste
2025-05-17 13:56:50 +00:00
27f7b65a69 [BE] Ensure generated stub files by gen_pyi are properly formatted (#150730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150730
Approved by: https://github.com/aorenste
2025-05-17 12:30:40 +00:00
7ebea09986 [Cutlass] Enable fusion with FusedSchedulerNodes (#153588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153588
Approved by: https://github.com/eellison
ghstack dependencies: #152815
2025-05-17 12:29:10 +00:00
f604732e2e [Cutlass] E2E Tests for EVT (#152815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152815
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-05-17 12:29:10 +00:00
b4fb801b2d [export] Move PT2 constants to torch::_export (#153206)
Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/1970325119807758

Differential Revision: D74417085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153206
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-05-17 08:21:59 +00:00
40339c1e99 Revert "[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)"
This reverts commit 3bde364996d53571a9fb799f5951a203a352ed18.

Reverted https://github.com/pytorch/pytorch/pull/153655 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/153655#issuecomment-2888212597))
2025-05-17 08:11:54 +00:00
9b2a45ac7d Refactor torch/utils/data/datapipes/gen_pyi.py with torchgen (#150626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150626
Approved by: https://github.com/aorenste
2025-05-17 06:21:41 +00:00
eqy
e802b29ed4 [SDPA][EZ] Abate narrowing conversion warning spam in flash_api.cpp (#153643)
for messages like
```/workspace/pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp:1396:38: warning: narrowing conversion of ‘(char)(& q)->at::Tensor::<anonymous>.at::TensorBase::get_device()’ from ‘char’ to ‘c10::DeviceIndex’ {aka ‘signed ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153643
Approved by: https://github.com/Skylion007
2025-05-17 02:07:35 +00:00
aac30ef503 [Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744
Approved by: https://github.com/williamwen42
2025-05-17 00:43:18 +00:00
e88c4db302 [BE]: Update ruff linter to 0.11.10 (#153625)
Fixes a bug with #153543 where I forgot to add pyproject.toml to the list of files RUF can scan and also updates it to the latest version (which is just minor bugfixes).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153625
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-17 00:39:47 +00:00
clr
a952f42bdb dynamo: Log if we're using dynamic shapes via set_feature_usage (#153490)
This makes it extremely clear if a specific model didn't use dynamic shapes and
should have (except it had a bad config option).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153490
Approved by: https://github.com/jansel
2025-05-16 23:59:00 +00:00
1e9666b32d Add cudaLaunchKernel to cuda_to_hip_mappings (#153690)
Summary: as $title

Test Plan:
Used in D74789639

Rollback Plan:

Reviewed By: cenzhaometa

Differential Revision: D74789639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-16 23:37:11 +00:00
cyy
7ae7324ac4 [submodule] Update google benchmark to v1.9.3 (#153676)
And remove `include_directories`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153676
Approved by: https://github.com/Skylion007
2025-05-16 23:31:53 +00:00
59c3463653 [Inductor] Fallback bmm to mm when batch == 1 (#153572)
Summary:
This change introduces a fallback path from `bmm` to `mm` when the batch dimension is `1`.
The motivation is to unlock specialized `mm` kernel paths (e.g., `decomposeK`, `persistent+TMA`, etc.) which often don't have `bmm` equivalents.

### Rationale

- **No regression:** On shapes where the fallback triggers, we see no performance loss.

- **Performance wins:** On select shapes (especially with large `K`), we observe measurable speedups by triggering `mm`-specific optimizations.
  For example, on `bmm` shapes of the form `(1, H, K, H)` where `H ∈ {16, 32, 48, 64}` and `K ∈ {4096 ... 32768}`, we see an **average speedup of 10%**.

- **Prevalence in prod:** Internal workloads frequently emit `bmm` ops with `batch=1`, making this fallback broadly useful in practice.

Test Plan:
contbuild & OSS CI

Tests in test/inductor/test_torchinductor.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153572
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
2025-05-16 22:35:03 +00:00
76f182f8e0 [cutlass backend] Reduce log level for cutlass compilation error (#153397)
Differential Revision: [D74596410](https://our.internmc.facebook.com/intern/diff/D74596410/)

This change should only affect cutlass backend. We realize that we are going to have Cuda compilation errors, and we do a really good job handling them and caching them. So reduce the logging levels there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153397
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007
2025-05-16 21:46:14 +00:00
3bde364996 [CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone

Tests that should exercise both backends should explicitly parametrize this setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
2025-05-16 21:31:13 +00:00
084c4aa614 Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)"
This reverts commit 7ed377f5776578aec4a6a9bc4eeef221a6b80a77.

Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403))
2025-05-16 21:00:11 +00:00
e4a636df80 [dynamo] Make OptimizedModule more robust in attribute reads and writes (#153637)
Fixes #138157.

Differential Revision: [D74834872](https://our.internmc.facebook.com/intern/diff/D74834872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153637
Approved by: https://github.com/williamwen42
2025-05-16 20:29:19 +00:00
1748fa529a Revert "cleanup, refactor and add missing self._dde_suppressed checks (#152657)"
This reverts commit f7fb2f66e3b60b6e3d8b3ac78aa435b76f49bc11.

Reverted https://github.com/pytorch/pytorch/pull/152657 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/152657#issuecomment-2887539146))
2025-05-16 19:42:20 +00:00
62d8e3cb40 [BE][MPS] Cleanup log ops migration (#153727)
Introduced by https://github.com/pytorch/pytorch/pull/153398

Workaround internal compiler error on MacOS-13 by providing boolean specialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153727
Approved by: https://github.com/Skylion007
2025-05-16 19:32:17 +00:00
cf226cb4d4 [BE]: Enable misc RUF rules and fix pyproject.toml indent (#153624)
Enables a variety of misc ruff rules and fixes some incorrect indentation in the file. Now that we updated ruff recently we can enable this rule lints. Most of these lints I've already applied, but now they are out of preview can apply them as stable lints.

Including:
* Do not bother why typing union with Never as this gets cancelled otu
* Simplify nested Literal into a single Literal
* Properly use packaging to parse version instead of `map(int(`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153624
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-16 19:29:16 +00:00
f7fb2f66e3 cleanup, refactor and add missing self._dde_suppressed checks (#152657)
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type  DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
2025-05-16 19:10:04 +00:00
c2dda47bc5 Revert "[dynamo] Make OptimizedModule more robust in attribute reads and writes (#153637)"
This reverts commit 2ce0b66db8b6a22e90b430a73b8914c2d73512e9.

Reverted https://github.com/pytorch/pytorch/pull/153637 on behalf of https://github.com/malfet due to Looks like it broke slow tests, see cda572b053/1 ([comment](https://github.com/pytorch/pytorch/pull/153637#issuecomment-2887449037))
2025-05-16 18:49:57 +00:00
cda572b053 codecache: Remove cpp_prefix.h duplication per build, then precompile it (#144293)
Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time.

Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch.

Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293
Approved by: https://github.com/desertfire
2025-05-16 17:41:36 +00:00
befb5bd52a [dynamic shapes] simplify int(x / y) pattern (#153477)
Fixes #138853

Summary: Converts `TruncToInt(IntTrueDiv(x / y))` to `x // y` if divisible, helps detect symint specializations where we didn't previously

Differential Revision: D74664734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153477
Approved by: https://github.com/bobrenjc93
2025-05-16 17:32:15 +00:00
3aa84775e7 [hipify] Replace cuda error cudaErrorContextIsDestroyed (#153576)
Summary: The cuda symbol the cuda symbol cudaErrorContextIsDestroyed is not converted to hipErrorContextIsDestroyed. Add this convertion

Test Plan: CI

Differential Revision: D74542735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153576
Approved by: https://github.com/xw285cornell, https://github.com/cyyever
2025-05-16 16:19:42 +00:00
a060f3d272 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-16 15:42:22 +00:00
2ce0b66db8 [dynamo] Make OptimizedModule more robust in attribute reads and writes (#153637)
Fixes #138157.

Differential Revision: [D74834872](https://our.internmc.facebook.com/intern/diff/D74834872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153637
Approved by: https://github.com/williamwen42
2025-05-16 15:17:07 +00:00
f66a159db5 [Set] Raise TypeError if set is called with the wrong number of arguments (#152990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152990
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907, #152908
2025-05-16 14:28:32 +00:00
5a0ca65555 [Set] Add correct set/frozenset __init__ behavior (#152908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152908
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907
2025-05-16 14:28:32 +00:00
053025494f [Set] Raise KeyError on empty set.pop() (#152907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152907
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989
2025-05-16 14:28:32 +00:00
5964cb5eb1 [Set] Update set.union and set.update to support *args (#152989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152989
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906
2025-05-16 14:28:32 +00:00
4759922c5e [Set] Add set.intersection(_update) (#152906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152906
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905
2025-05-16 14:28:32 +00:00
ca96d55322 [Set] Add set.difference(_update) (#152905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152905
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903
2025-05-16 14:28:32 +00:00
5c6830ced0 [Set] Raise KeyError if elem not contained in the set (#152903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152903
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902
2025-05-16 14:28:32 +00:00
574f4c507a [Set] Add set.issubset and set.issuperset (#152902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152902
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901
2025-05-16 14:28:32 +00:00
5926b7a38f [Set] Add set.symmetric_difference(_update) (#152901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152901
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904
2025-05-16 14:28:32 +00:00
fe51ce62ca [Set] Raise TypeError if number of arguments mismatch (#152904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152904
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988
2025-05-16 14:28:32 +00:00
481c345f49 [Set] Raise TypeError if argument is unhashable (#152988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152988
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987
2025-05-16 14:28:32 +00:00
cf7021a0ee [Set] Handle exception in ConstantVariable operation (#152987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152987
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #150792
2025-05-16 14:28:32 +00:00
477f13c3fb [Set] Add CPython set tests (#150792)
Tests:
* test_set.py

This PR adds test_set.py from the CPython 3.13 branch and ~400 files to test/dynamo_expected_failures. Most of these are expected to be fixed in upcoming PRs. Only minimal changes were made to test_set.py to enable compilation with Dynamo using the PYTORCH_TEST_WITH_DYNAMO=1 environment variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150792
Approved by: https://github.com/anijain2305
2025-05-16 14:28:32 +00:00
6592086ac3 Add metal kernel for log ops (#153398)
Move unary log ops to metal kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153398
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-05-16 14:25:28 +00:00
8ca985b365 [Break XPU] Skip newly added test case on XPU that failed because torch._C._scatter not implemented. (#153685)
Fixes #153608
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153685
Approved by: https://github.com/malfet
2025-05-16 14:15:50 +00:00
9ccd601a14 [easy] Fix endif comments in functional_base.h (#153696)
The first one of these confused me on #152388. Happened to notice the second.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153696
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-16 14:08:41 +00:00
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
86c6f71ddb Revert "[Ez][BE]: Remove accidental classvar (#153540)"
This reverts commit e0dece510b703376d50a5d6536be6c601ca67d9e.

Reverted https://github.com/pytorch/pytorch/pull/153540 on behalf of https://github.com/jeanschmidt due to Broken internal tests, @albanD may you help the author get his PR merged? D74804063 ([comment](https://github.com/pytorch/pytorch/pull/153540#issuecomment-2886011101))
2025-05-16 08:26:37 +00:00
4d073af58c Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 725bbb6b5fffa2f2d219a0692ed27e376c9dd48a.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))
2025-05-16 08:20:39 +00:00
741539a790 Split out second pass of LayerNorm for profiler attribution reasons (#153578)
Summary:
Split out second pass of LayerNorm so it's more likely to show up in
profiler output. In my testing with perf, the samples from the lambda in the
current implementation are attributed somewhat haphazardly.

Differential Revision: D74181627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153578
Approved by: https://github.com/hl475
2025-05-16 08:07:13 +00:00
a9adc9a9b6 [Linter] Add linter to detect device-bias hard code in test cases. (#152948)
Since XPU does not gate community pull requests, we’ve observed that contributors often hardcode "cuda" in functions decorated with @requires_gpu() when adding new test cases. This causes the tests to fail on XPU and breaks XPU CI.
This PR adds a linter to detect such issues automatically. An example is shown below.

```
  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode device='cuda'

        11670  |                .contiguous()
        11671  |            )
        11672  |
    >>> 11673  |        inp = torch.rand((64, 64), device="cuda") * 2 - 1
        11674  |        boundaries = torch.tensor([-0.9, -0.8, 0.1, 0.2, 0.5, 0.9])
        11675  |
        11676  |        self.common(fn, (inp, boundaries), check_lowp=False)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .cuda() call

        11700  |            self.assertEqual(ref, res)
        11701  |
        11702  |            for offset2 in (0, 1, 2, 3, 4):
    >>> 11703  |                base2 = torch.randn(64 * 64 + 64, dtype=torch.float32).cuda()
        11704  |                inp2 = torch.as_strided(base2, (64, 64), (64, 1), offset2)
        11705  |                ref2 = fn(inp2)
        11706  |                res2 = fn_c(inp2)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode torch.device('cuda:0')

        11723  |            return x.sin() + x.cos()
        11724  |
        11725  |        base = torch.randn(
    >>> 11726  |            64 * 64 + 64, dtype=torch.float32, device=torch.device("cuda:0")
        11727  |        )
        11728  |
        11729  |        inp1 = torch.as_strided(base, (32, 32), (32, 1), 4)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .to('cuda') call

        11771  |            torch.manual_seed(42)
        11772  |            base = torch.randn(64 * 64 + 64, dtype=torch.float32, device=self.device)
        11773  |            torch.manual_seed(42)
    >>> 11774  |            base_ref = torch.randn(64 * 64 + 64, dtype=torch.float32).to("cuda")
        11775  |
        11776  |            inp = torch.as_strided(base, size, stride, offset)
        11777  |            inp_ref = torch.as_strided(base_ref, size, stride, offset)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152948
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/malfet, https://github.com/jansel
2025-05-16 08:03:54 +00:00
658d17dfb5 [ONNX] Add test for decomp_table update (#153671)
Added a test to strengthen the case for cherry-picking #153168. The original PR didn’t include this test since the fix for decomp_table and the registry was already covered by existing tests. However, it's reasonable to include a dedicated test for the specific issue (https://github.com/pytorch/pytorch/issues/150367 ) when considering the cherry-pick.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153671
Approved by: https://github.com/justinchuby
2025-05-16 08:00:16 +00:00
3fe42d4d5d [export] Dynamo symint support (#152677)
Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0).

What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677
Approved by: https://github.com/pianpwk
2025-05-16 07:51:50 +00:00
d965fa2c4b [CUDA][cuBLAS] Remove IS_ARM64 skip in test_matmul_cuda.py (#153660)
Original skip seems stale and the test appears to run fine on Grace + Hopper and Grace + Blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153660
Approved by: https://github.com/Skylion007
2025-05-16 07:31:16 +00:00
1503b3f897 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-16 07:18:39 +00:00
1a722f62c2 [Quant][X86] add an op to compute uint8 batch norm 2d (#152811)
**Summary**
This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #152411
2025-05-16 06:13:40 +00:00
7e16cb99b6 [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-16 04:56:50 +00:00
459ce6c12a [export] Flatten frame local logs (#153627)
Summary:
Some new errors have been showing up on the PT2 dashboard with
```
Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0)
```
This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails.

I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure.

Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅

Differential Revision: D74744326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627
Approved by: https://github.com/yiming0416
2025-05-16 04:45:09 +00:00
7ed377f577 Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)
This reverts commit ae0e8f0c7316addab3f415dc767a9d34f58b0dae.

Keep android/libs/fbjni because it's being used by other components of
PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
56e1c236bf [Dynamo] Catch unserialisable NN modules (#153503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153503
Approved by: https://github.com/c00w, https://github.com/jansel
2025-05-16 02:55:28 +00:00
d1f1ff8610 [ddp] propagate use_python_reducer to C++ reducer (#152735)
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
2025-05-16 01:38:03 +00:00
1b4749f748 [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-16 01:38:03 +00:00
5aea57d653 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-16 01:37:48 +00:00
cyy
9d3b6ee4c1 [submodule] Update gtest to v1.17.0 (#153618)
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
ab757dcddc [MPS][Testing] Add GoogleFnet, YituTechConvBert and Super_SloMo to benchmarks (#153658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153658
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/cyyever
ghstack dependencies: #153657
2025-05-16 01:09:31 +00:00
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
2489b6470b [c10d] Allow split_group to work with non nccl backends (#152175)
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.

The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.

I am open to making them part of base Backend based if folks prefer.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2025-05-16 00:15:29 +00:00
cb5f31a4a1 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-15 23:18:52 +00:00
e7a40fb301 [Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595)
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.

## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))

## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
2025-05-15 21:44:57 +00:00
ea17cd067d Add vec_reduce_all specialization for std::plus on AArch64 (#152388)
AArch64 has an instruction for this.

Differential Revision: [D73817183](https://our.internmc.facebook.com/intern/diff/D73817183/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152388
Approved by: https://github.com/Skylion007
ghstack dependencies: #152365, #152366
2025-05-15 21:26:18 +00:00
b972435158 vec::map: directly process reduced-precision floats when reasonable (#152366)
The immediate motivation is to make map support match
ExecuTorch so we can delete ExecuTorch-specific mapping functions, but
this should also straightforwardly improve performance.

Testing: there is existing coverage for this in
vec_test_all_types.cpp. Verified that it really does cover the newly
enabled "don't convert through float" paths by temporarily adding a
TORCH_INTERNAL_ASSERT(false).

Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366
Approved by: https://github.com/malfet
ghstack dependencies: #152365
2025-05-15 21:26:18 +00:00
e4adf5df39 [ROCm] cpp_extension allow user to override default flags (#152432)
We need -fgpu-rdc for projects such as DeepEP + rocSHMEM. The default of -no-gpu-rdc doesn't work for such cases.

As per https://github.com/pytorch/pytorch/pull/152432#issuecomment-2840899088:
"rocshmem shares the same global variable in different files, as deepEP uses CUDAExtention to build the project 65e2a700f0/setup.py (L51) and depends on rocshmem, this -fgpu-rdc is needed. The current logic in Pytorch prevents users from overriding this flag."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152432
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-15 21:06:18 +00:00
b8fad785d5 Change trigger for autoformat, use --all-files (#153289)
Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR.  Tested this on a different PR, and it seems to be working

Changed tag name because ciflow prefixed labels have special handling

Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things.  Idk if it would miss for format, but it does for checking lint.  Format seems to take shorter than normal lint.  I don't know if the comment about making suggestions on non edited file changes is a concern.  I didn't really test this part

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 20:38:33 +00:00
90deff6d59 Refactor tests in test_max_autotune into a few separate test cases. (#153486)
Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me.

Test Plan: existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486
Approved by: https://github.com/eellison
2025-05-15 20:35:22 +00:00
a2e2f908fd add is_vec_specialized_for (#152365)
Let people detect at compile time whether Vectorized is specialized for a given type. See vec_base.h.

Differential Revision: [D73802129](https://our.internmc.facebook.com/intern/diff/D73802129/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152365
Approved by: https://github.com/jgong5, https://github.com/malfet
2025-05-15 20:21:48 +00:00
ae0e8f0c73 Revert "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)"
This reverts commit b22f01fcb9d69bb7d77e08d69004c7265ef7fa4a.

Reverted https://github.com/pytorch/pytorch/pull/153633 on behalf of https://github.com/malfet due to But libtorch build regressions are real, fbjni is still used for C++ builds ([comment](https://github.com/pytorch/pytorch/pull/153633#issuecomment-2884951805))
2025-05-15 20:16:05 +00:00
b03e4f53d2 [Monitoring] enable windows monitoring test (#153453)
enable the utilization for win tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153453
Approved by: https://github.com/huydhn
2025-05-15 20:03:07 +00:00
f7ecc091a0 c10d/TCPStore: better logs on remote shutdown (#153586)
This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address.

Test plan:

```
[W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
```

```py
import os
rank = int(os.environ["RANK"])

import time
from torch import distributed as dist

store = dist.TCPStore(
    host_name="localhost",
    port=1234,
    is_master=(rank == 0),
    wait_for_workers=False,
)

time.sleep(1)

print("starting")

if rank != 0:
    store.get("foo")
else:
    time.sleep(1)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586
Approved by: https://github.com/XilunWu
2025-05-15 20:02:51 +00:00
064f4c18f9 [Monitoring] Enable perf tests (#153452)
Enable monitoring for more perf tests, currently for perf, we collect usage data every 4 seconds and aggregate every 15 seconds.

Can reduce the number down if the monitoring does not affect the perf testx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153452
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-15 19:19:19 +00:00
a4c828199e [BE] Add __all__ to torch/nn/functional.pyi and torch/return_types.pyi (#150729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150729
Approved by: https://github.com/aorenste
2025-05-15 19:01:57 +00:00
b22f01fcb9 Delete TorchScript based Android demo app and point to ExecuTorch (#153633)
Delete TorchScript demo app and point people to ExecuTorch demo app.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153633
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/atalman, https://github.com/janeyx99, https://github.com/seemethere
2025-05-15 18:43:59 +00:00
00e5cb3db3 [ez][trymerge] Edit revert message for reverted ghstack PRs (#153573)
Change comment about successful revert so it also contains info about the original PR that got the comment (if it is a ghstacked PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153573
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 18:23:20 +00:00
480ae2dab8 Add needs_contiguous_strides to more collective ops (#153523)
Differential Revision: D74705770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153523
Approved by: https://github.com/fmassa
2025-05-15 17:27:37 +00:00
cfee9046b6 cpu: enable gemm-bf16f32 for SDPA BF16 (#140159)
This PR enables SDPA BF16:  gemm:bf16f32 for aarch64.  This will enable faster inference for models with attention layers  for autocast mode (bf16).

Benchmark results from  [PyTorch CI HUD - branch](https://hud.pytorch.org/benchmark/huggingface/inductor_no_cudagraphs?dashboard=torchinductor&startTime=Fri%2C%2028%20Mar%202025%2021%3A26%3A20%20GMT&stopTime=Fri%2C%2004%20Apr%202025%2020%3A26%3A20%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/gemm_bf16f32&lCommit=d5aeab452e4b1f0580a4636b15a604c77a02c57b&rBranch=main&rCommit=bc72420bcb37390af3fced885e019903e6e425bd)
Overall Geometric mean speedup in HUD dashboard  : for Huggingface: `[0.48x → 0.58x]` and for Blueberries: `[0.88x → 1.13x]`

Benchmark numbers for `torch.nn.functional.scaled_dot_product_attention`on Neoverse™ V1.

`batch_size = 1, num_attention_heads = 64, sequence_length = 512, attention_head_size = 128`
 `threads=16`
<img width="319" alt="Screenshot 2024-12-20 at 16 23 22" src="https://github.com/user-attachments/assets/c863f97d-0761-4fb8-aa6c-fc67b22ac3f9" />

Script to benchmark & profile SDPA:

    import torch
    import torch.nn as nn
    import time
    import numpy as np
    from torch.profiler import profile, record_function, ProfilerActivity
    class SimpleAttentionModel(nn.Module):
        def __init__(self, query, key, value):
            super(SimpleAttentionModel, self).__init__()
            self.query = query
            self.key = key
            self.value = value

        def forward(self, attn_mask=None):
            torch.nn.functional.scaled_dot_product_attention(
                        self.query,
                        self.key,
                        self.value,
                        attn_mask=attn_mask)

    #batch_size = 1, num_attention_heads = 64, sequence_length = 512, hidden_size = 128
    def bench_sdpa(batch_size = 1, num_attention_heads = 64, sequence_length = 512, query_sequence_length = 128 , hidden_size=128, precision=torch.float32):
        with torch.no_grad():
            attention_head_size = int(hidden_size / num_attention_heads)
            query = torch.rand(size=(batch_size, num_attention_heads, query_sequence_length, attention_head_size), dtype=precision)
            key = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)
            value = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)

            model = SimpleAttentionModel(query, key, value)
            model.eval()
            for _ in range(10):
                model()
            times = []
            n_iters = 100
            for _ in range(n_iters):
                s = time.time_ns()
                model()
                times.append((time.time_ns() - s) / 1e3)
            min_times = np.min(times)
            mean_times = np.mean(times)
            print(f"Min Times = {min_times} us")
            print(f"Mean Times = {mean_times} us")
            print("Times = ", times)

    print("BF16 mode:")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with record_function("model_inference"):
            bench_sdpa(precision=torch.bfloat16)
    profile_data = prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total")
    print(profile_data)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140159
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/nikhil-arm, https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/cfRod, https://github.com/fadara01
2025-05-15 17:21:18 +00:00
236b08cbf8 Revert "[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)"
This reverts commit 4863e5c843722eb2a34fb0ca1d518a33431a38c0.

Reverted https://github.com/pytorch/pytorch/pull/153300 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:52 +00:00
2327c9eedc Revert "[ca][dtensor] run real PG dtensor tests under CA (#152689)"
This reverts commit b297e01f4b1f43ffd1769313f077a2a68928f012.

Reverted https://github.com/pytorch/pytorch/pull/152689 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:51 +00:00
db26aeaec2 [MPSInductor] Support numpy scalars handling (#153598)
By default, numpy computes results in float64 format, but when passed as an argument to MPS function, must be implicitly converted to float32, which naturally occurs in some networks, for example in speech_transformer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153598
Approved by: https://github.com/cyyever, https://github.com/dcci
ghstack dependencies: #153582
2025-05-15 16:48:25 +00:00
0cb48633d9 [ez][CI] Add linux aarch64 to upload test stats, change format of trigger for upload test stats (#153505)
Change from inline list to yml list
Add linux aarch64 for list of triggering workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153505
Approved by: https://github.com/Skylion007
2025-05-15 15:33:59 +00:00
fa8543454a [dynamo][torch-function] Prevent unnecessary __torch_function__ tracing (#153551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153551
Approved by: https://github.com/mlazos
2025-05-15 14:06:17 +00:00
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
7482eb217c [Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373)
### Summary

Fixes #148494

Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.

Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.

\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance

### Performance data with BS 1

Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine

| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms|  9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms|  7.01% |

While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen

Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
2025-05-15 11:55:58 +00:00
cyy
e5e06d9cab [submodule] Update kleidiai to v1.8.0 (#153592)
And cleans up some CMake instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592
Approved by: https://github.com/malfet
2025-05-15 10:14:05 +00:00
22b124335e [BE] Update .pyi stub template to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150728)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150728
Approved by: https://github.com/cyyever, https://github.com/aorenste
ghstack dependencies: #150726, #150727
2025-05-15 09:36:42 +00:00
f7a5aa1d8d [torchgen] Refactor and simplify gen_pyi.py to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150727)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
2025-05-15 09:36:42 +00:00
129a2976a8 [ROCm] Improvements to non-vectorized elementwise kernels (#153184)
* Unroll loops manually to hide memory access latency

Co-authors: @akadutta @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153184
Approved by: https://github.com/jeffdaily
2025-05-15 09:14:43 +00:00
6e107899da [Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508)
Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices.

Test Plan: CI

Differential Revision: D74539233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508
Approved by: https://github.com/exclamaforte
2025-05-15 08:41:46 +00:00
b297e01f4b [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-15 08:10:35 +00:00
4863e5c843 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-15 08:10:35 +00:00
71027b13b2 Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)"
This reverts commit 881a598a1e38ef06d4f51d1e3fd8e359fed0c3a0.

Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))
2025-05-15 07:58:27 +00:00
004dad48f7 Allow to set custom PYTHONPATH for torch.inductor (#152832)
When using Bazel, it’s common to encounter issues like [this](https://github.com/bazelbuild/bazel/issues/14640) and [this](https://github.com/bazel-contrib/rules_python/issues/792) where the `PYTHONPATH` environment variable becomes too long and results in an error such as: `OSError: [Errno 7] Argument list too long` . To work around this, users often resort to custom logic to manipulate PYTHONPATH.

Currently, PyTorch Inductor constructs the PYTHONPATH for a subprocess using sys.path, which can lead to this issue in certain environments.

This PR introduces support for a new environment variable, `TORCH_CUSTOM_PYTHONPATH`, allowing users to override the default `PYTHONPATH` passed to the subprocess. This provides a clean way to avoid an exception when using PyTorch in Bazel.

Please let me know if I need to add some documentation to support this PR. I haven't found an open issue specific to this change but I'm confident that this change (or a similar one) would be appreciated by few.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152832
Approved by: https://github.com/masnesral
2025-05-15 06:35:41 +00:00
55784be01b [Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411)
**Summary**
This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-05-15 06:23:01 +00:00
a762dd1f67 [Memento] On-demand mode using without torch api (#153171)
Summary:
CUDA Post: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/2020094788475989/

# Context
In this diff, we want to enable the on-demand mode of memory snapshot to allow user to trace any remote process via dyno command line.

# Design decision

**How do we send on-demand signal to remote process**
We leverage the dyno-Kineto approach.
Since dyno is running on all machine in Meta, it can send a request to the remote machine to start the Kineto.
Kineto will start another thread for memoryProfiler (https://fburl.com/code/dxsmmrok)

**why we use different approach as CUDA**

On CUDA side, we are using pybind to load torch Module and invoke the python api to start/stop the profiling. However, this requires us to compile the whole torch binary in the predictor which is not recommended by runtime(andruwang)

Thus, we decide to use the CPP api directly to avoid un-necessary dependency

**why the snapshot is saved as json string directly instead of pickle**
Pickle is primarily designed for use with Python and doesn't have well support in cpp. Also, it is hard for user to download the snapshot file and open locally.
Due to the dependency issue, it is hard to import the gzip/pickle library to decode the data. Thus, let's use JSON for now. I will work on the visualizer to fasten the render and support other format later.

**Plan**:
* Now, we will encoded file into gz for MTIA ondemand only and update the visualizer to support both type.
* Update auto-trace and CUDA side to encode in gzip as well
* Fully remove pickle dependency.

Test Plan:
# Remote cogwheel test
Servicelab: https://fburl.com/servicelab/pckux7a3
snapshot file manifold: https://fburl.com/manifold/fnotk18c
snapshot file in pastry: P1805522232

Visualization on D74399684
 {F1977786422}

# Local Predictor Test
url: https://fburl.com/pytorch_memory_visualizer/y06kskkm

 {F1977787329}

Differential Revision: D74179606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153171
Approved by: https://github.com/sraikund16
2025-05-15 06:07:04 +00:00
181bfabb9e fix set_logs for a single child log file (#153580)
Tested via

```
+        import logging
+        torch._logging.set_logs(modules={"torch._functorch._aot_autograd.autograd_cache": logging.DEBUG})
```

```
python test/dynamo/test_aot_autograd_cache.py -k test_multi_graph_specialization
```
and verifying logs are printed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153580
Approved by: https://github.com/ColinPeppler
2025-05-15 05:58:45 +00:00
9839ec1383 [dynamo][compile-time] Cache method on load builtin (#153524)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153524
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #153522
2025-05-15 05:54:15 +00:00
b47be23461 [dynamo][compile-time] Faster inspect getattr_static for torch.Tensor (#153522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153522
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-05-15 05:54:15 +00:00
910d2f96af [cutlass backend] forward fix cutlass backend A100 test (#153428)
Forward fix of https://github.com/pytorch/pytorch/pull/153006, which broke a test.

In the long run, we should get rid of CUDATemplateCaller.category.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153428
Approved by: https://github.com/ColinPeppler
2025-05-15 05:45:38 +00:00
0ca91af6b8 Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593)
### Motivation:

Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`.
 By default, `USE_XCCL` is OFF and allowed set to ON explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593
Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever
2025-05-15 05:39:00 +00:00
ebd3268538 Removed duplicate patterns from gitignore (#153515)
Removed duplicate patterns from gitignore. These patterns are duplicated verbatim on lines 148-169.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153515
Approved by: https://github.com/soulitzer
2025-05-15 05:38:42 +00:00
b992a665d1 Fix AsyncMM not compiled with SM90a issue (#153519)
The CMakeLists.txt is wrong and doesn't enable SM90a for AsyncMM.cu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153519
Approved by: https://github.com/drisspg, https://github.com/ngimel, https://github.com/cyyever
2025-05-15 05:23:29 +00:00
d5ddc5ab20 [MPS] Fix float64 scalar tensor handling (#153582)
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor

Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
2025-05-15 05:15:14 +00:00
3e8bda4ad5 [pytorch][triton] flex attention fwd kernel with TMA loads (#151923) (#152460)
Summary:

Device side TMA for flex_attention fwd kernel, Q K V tensors

Test Plan:
Unit test:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options
```
https://www.internalfb.com/intern/testinfra/testrun/14355223891618726

Differential Revision: D71082691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460
Approved by: https://github.com/drisspg
2025-05-15 04:49:32 +00:00
756fd80734 [BE] Improve the typing related to model input argument of torch.compile() (#153559)
Summary: Match the `overload` typing with the original typing in function definition and adjust the corresponding comments.

Test Plan: contbuild & OSS CI

Differential Revision: D74746243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153559
Approved by: https://github.com/Skylion007
2025-05-15 04:49:26 +00:00
d2f6c6df1d unbreak fb:operator_benchmark_test (#152049)
Summary: unbreak fb:operator_benchmark_test

Test Plan: works on my machine

Differential Revision: D73540912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152049
Approved by: https://github.com/hl475
2025-05-15 03:38:48 +00:00
014726d9d3 [torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.

This allows us to simplify the code such as:

1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.

95a5958db4/torchgen/utils.py (L155)

95a5958db4/torchgen/utils.py (L176)

2. `os.path.basename(...)` with `Path(...).name`.
 95a5958db4/torchgen/utils.py (L161)

3. Manual file extension split with `Path(...).with_stem(new_stem)`

95a5958db4/torchgen/utils.py (L241-L256)

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/aorenste
2025-05-15 02:52:24 +00:00
881a598a1e [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-15 02:41:38 +00:00
eaf2dee10e don't run triton mm for k<32 (#153550)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550
Approved by: https://github.com/suo

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-05-15 02:36:44 +00:00
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
f5e0806f34 [cutlass backend] Add back descriptive names for epilogue fusion (#153405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153405
Approved by: https://github.com/mlazos
2025-05-15 01:47:52 +00:00
82dc3457e0 Add load_state_dict hint doc about invoke order work with lr_scheduler (#149942)
Fixes #119168

## Test Result

![image](https://github.com/user-attachments/assets/edb8124c-f103-475a-b903-20fbc71fdea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149942
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-15 01:07:36 +00:00
cyy
781ba0ac9d Update CMake to 3.27 in Windows CI (#153380)
Before it's possible to use enable newer CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153380
Approved by: https://github.com/albanD
2025-05-15 00:19:32 +00:00
c2bc7e2827 API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536)
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.

For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t

Error we see without the change
```
RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`

To execute this test, run the following from the base repo dir:
    python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536
Approved by: https://github.com/jcaip, https://github.com/atalman
2025-05-14 23:36:53 +00:00
72fee137dd [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-05-14 22:34:55 +00:00
e0dece510b [Ez][BE]: Remove accidental classvar (#153540)
Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540
Approved by: https://github.com/albanD, https://github.com/aorenste
2025-05-14 21:55:56 +00:00
7412b33e91 [inductor] Use get to avoid possible keyerror at the end of precompilation (#153417)
Shameful admission: I have encountered this error 1-2 times, but don't have a repro.

torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures
    elapsed_times[future],
    ~~~~~~~~~~~~~^^^^^^^^
torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153417
Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler
2025-05-14 21:49:43 +00:00
f2e8e41855 [Easy][Inductor] Adds safety checks in get_estimated_runtime (#152821)
This PR adds checks on `gpu_memory_bandwidth` and `gpu_flops` in `get_estimated_runtime`. This will prevent division by zero and other potential incorrect values:
9210a98b92/torch/_inductor/scheduler.py (L864-L865)

9210a98b92/torch/_inductor/scheduler.py (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152821
Approved by: https://github.com/eellison, https://github.com/jansel
2025-05-14 21:46:59 +00:00
f887bfffda Fix typo (#153561)
Fix typo from #153386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561
Approved by: https://github.com/albanD
2025-05-14 21:38:51 +00:00
03d01860fd [dynamo][compile-time] Compute logging related flags once (#153426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
1bd6bc7190 [BE]: Enable ruff YTT linter for Python version checks (#153547)
Adds ruff YTT checks to help future proof version checks and follow best practices here. Also makes it easier for static linters like mypy to detect python version branching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153547
Approved by: https://github.com/albanD
2025-05-14 21:09:16 +00:00
f363a3f51a Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))
2025-05-14 20:53:49 +00:00
c92ea3bc98 [BE] Upgrade XPU support package to 2025.1 in CICD (#151899)
Address #151097. Including below changes,

- Add XPU support package 2025.1 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release
- Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows
- Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows
- Update XPU runtime pypi packages of CD wheels
- Remove deprecated support package version docker image build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151899
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-05-14 20:21:09 +00:00
5e6e52e7c9 [JIT] add GRAPH_DEBUG for setGraphExecutorOptimize (#153549)
Summary: Optionally log when setGraphExecutorOptimize is called, so we can get insight into the GraphExecutor behavior.

Differential Revision: D74692508

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153549
Approved by: https://github.com/PaulZhang12, https://github.com/SamGinzburg
2025-05-14 20:07:25 +00:00
dda2c7c8fc Pass inductor config for static cuda launcher to workers (#153382)
Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-05-14 20:01:32 +00:00
6a28cc826f Add TEST_HPU flag to set device type (#153461)
MOTIVATION
This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

CHANGES
add TEST_HPU flag to some of the conditions checking the environment
use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi.
@ankurneog , @EikanWang , @cyyever , @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD
2025-05-14 19:31:40 +00:00
a54bf43baa Fix support of MixtureSameFamily [bugfix]. (#151317)
Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code:

```
from torch import rand
from torch.distributions import MixtureSameFamily, Categorical, Binomial

max_count = 20
probs = rand(10, 5)
binom_probs = rand(10, 5)

d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs))
d.log_prob(d.sample())
```

which results in:

```
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    d.log_prob(d.sample())
  File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob
    self._validate_sample(x)
  File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample
    valid = support.check(value)
            ^^^^^^^^^^^^^^^^^^^^
  File "pytorch\torch\distributions\constraints.py", line 307, in check
    (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1
```

### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons)

- The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation.
- The fix itself does not alter the calculations at all. It only affects the sample validation process.
- The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise).
- I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions.
- Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5).

### Updated fix explanation (for all cases)

- The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation.
- The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed.
- This issue was already anticipated in the [code](331423e5c2/torch/distributions/mixture_same_family.py (L127)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317
Approved by: https://github.com/albanD, https://github.com/fritzo
2025-05-14 19:24:36 +00:00
clr
534b66fe30 torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297)
tests

This config option is not set anywhere, and does nothing, so this should cause
no changes to tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297
Approved by: https://github.com/Skylion007
2025-05-14 19:02:51 +00:00
bf0fe4f828 Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)"
This reverts commit ced90d23d3dfff42379fa032fe6a83b764d12e9f.

Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))
2025-05-14 18:52:07 +00:00
8749fe8439 [CI][MPS] Speedup test_large_bmm (#153562)
By computing matmuls of only one random non-zero batch on CPU

This reduces test runtime from 11 minutes to 14 sec
```
 % python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok

----------------------------------------------------------------------
Ran 2 tests in 27.495s

```

TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
2025-05-14 18:49:42 +00:00
47d6feff7c [export] Support no inputs in unflattened module (#153474)
Encountered in this diff D74589491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474
Approved by: https://github.com/avikchaudhuri
2025-05-14 18:45:47 +00:00
6ef1cbc191 Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)"
This reverts commit e6a90672601ad3d636145dd8a68952281a6d1199.

Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal builds, @seemethere may you help the author? [D74729252](https://www.internalfb.com/diff/D74729252) ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2881122917))
2025-05-14 18:18:17 +00:00
533fc58453 [BE]: Fix typing None override other optimizers (#153386)
Follow up to #153367 to fix other instances of it throughout the codebase

Also fully type NamedOptimizer since we were so close

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386
Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever
2025-05-14 17:48:47 +00:00
2362bd4a4c [Torch][NT] Fix NestedTensor contiguous check condition. (#153237) (#153529)
Fixes #153237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153529
Approved by: https://github.com/jbschlosser
2025-05-14 17:15:48 +00:00
8bb67700a3 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-14 17:03:59 +00:00
6765df052c [dynamo] Emit warning on global module hooks when calling using output of torch.compile(module) (#152740)
When we do `torch.compile(module)`, we eventually end up returning a new
`OptimizedModule` instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) for the compiled module.

`OptimizedModule` also inherits `nn.module.__call__`, and thus
has its own hook logic. This is useful for torchao, which injects module
forward hooks to run in eager for quantization purposes.

However, this might create unexpected behavior for global module hooks,
because `torch.compile(module)` causes the hook to fire one extra time
for `OptimizedModule`, when compared to eager.

To preserve BC, we simply emit a warning for this behavior, and let
users decide what to do. This is reasonable because the global module
hooks are documented to be used for debugging/profiling purposes only.

Fixes #149502

Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-05-14 17:03:59 +00:00
b3dea0c0dd Change aoti cpp tests to run serially within file (#152960)
Fixes #152674
https://github.com/pytorch/pytorch/issues/152889
https://github.com/pytorch/pytorch/issues/152888
https://github.com/pytorch/pytorch/issues/152891

`--dist=loadfile` ensures all tests in the same source file run in the same worker.

Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE).

With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state.

```
 python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960
Approved by: https://github.com/desertfire, https://github.com/cyyever
2025-05-14 17:02:39 +00:00
ba70876407 Update lint_urls.sh (#153246)
Treat 403, 429 and 503 http errors as success.
Ignore non-verbal hostnames.
Kill child jobs immediately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153246
Approved by: https://github.com/malfet
2025-05-14 16:54:49 +00:00
b6b0080419 [DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488)
Summary:
### Diff Context
- PR introduces Pipes for multiprocess comms with checkpointer process.
- Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults).

Test Plan: CI

Differential Revision: D74668559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488
Approved by: https://github.com/saumishr
2025-05-14 16:47:43 +00:00
8799bffc34 [BE][Ez]: RUF200 - validate pyproject.toml metadata (#153543)
Since we have pyproject.toml metadata for [project] and [build-requires], let's turn on the linter rules which validates this optional metadata to make sure it's properly formatted and follows the correct schema for standard Python build tools.

Right now, incorrect metadata could silently error with how our CI is invoked or only provide warnings for invalid metadata. This check will help surface those errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153543
Approved by: https://github.com/albanD
2025-05-14 16:42:22 +00:00
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
de92296bbb [Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976)
Fix https://github.com/pytorch/pytorch/issues/152290.

The model **hubert** uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN.  This PR handles this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-05-14 16:09:03 +00:00
1f48bab377 Update torch-xpu-ops commit pin (#153445)
Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](2071050389), includes:

- Improve the accuracy of `upsample_bilinear2d_backward`
- Enhance the performance of `avg_pool2d`
- Update the implementation of scatter-gather and indexing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-05-14 15:34:47 +00:00
2e440e39a6 [nativert] Move Placement to pytorch core (#152953)
Summary:
Move Placement to pytorch core.

Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace.

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:placement_test

./bin/test_nativert
```

OSS and internal CI

Differential Revision: D74190745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953
Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-14 15:26:54 +00:00
eqy
ced90d23d3 [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-14 15:22:47 +00:00
0ce941f994 [audio hash update] update the pinned audio hash (#153507)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153507
Approved by: https://github.com/pytorchbot
2025-05-14 15:16:35 +00:00
cd119ddd7c Add matching against hypothetical (new) ghstack pull-request trailer (#153528)
I would like to change ghstack to use a new trailer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153528
Approved by: https://github.com/malfet
2025-05-14 14:07:01 +00:00
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
2344eca5eb Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)"
This reverts commit ee096b89f63394b2c18826288783eef241f3959c.

Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))
2025-05-14 13:15:03 +00:00
2c1912452d Revert "Rewrite autograd producer consumer stream sync logic (#151079)"
This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a.

Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))
2025-05-14 13:07:12 +00:00
a628efd1e8 Revert "Enable accelerator to perform streaming backward (#153412)"
This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83.

Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))
2025-05-14 13:04:27 +00:00
e8f7a97e2e [Refactor] Explicilty spell out the namespace for device() function (#153248)
Summary: To prepare for the coming up header-only file change. The same files have been using a mixed style of using at::device() and device(). Given these .cpp files are not in the at namespace, it makes sense to spell them out explicitly.

Differential Revision: [D74577412](https://our.internmc.facebook.com/intern/diff/D74577412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153248
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/janeyx99
2025-05-14 12:00:47 +00:00
0ef5ba43a6 Fix negative dim issue in for parallel loss context manager (#152785)
Facing similar issue as on #152016  , and added as per @tianyu-l 's solution.
Fixes #152016

 Tagging @tianyu-l @atalman  for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785
Approved by: https://github.com/tianyu-l
2025-05-14 10:43:27 +00:00
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
0139ce9303 Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513)
Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.

Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
2025-05-14 09:14:11 +00:00
4015166e5d [ROCm] Maxpool backward NHWC Perf Improvement targeting Resnet scenarios (#152267)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152267
Approved by: https://github.com/jeffdaily
2025-05-14 06:59:29 +00:00
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
0f891cad5a Enable ruff check for torch/utils/data/*.ipynb (#148654)
Fixes part of #146411

Enable ruff check for `torch/utils/data/*.ipynb` files

## Test Result

```bash
lintrunner -a --take RUFF torch/utils/data/*.ipynb
```

![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654
Approved by: https://github.com/Skylion007
2025-05-14 06:21:47 +00:00
f7798d8645 Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136)
`OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict.

This just adds in that functionality to the test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136
Approved by: https://github.com/eellison
2025-05-14 06:05:23 +00:00
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
e2ce17c6ef [SymmMem][a2av] Use more CTAs for intra-node case (#153509)
Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth.

This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism.  The kernel now achieves 350 GB/s SOL for Hopper. See figure.

It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8)

For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s).

![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509
Approved by: https://github.com/ngimel
ghstack dependencies: #153483
2025-05-14 04:24:32 +00:00
20dbe644c7 [CD] Fix the libgomp twice load issue (#150084)
Fixes #149422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150084
Approved by: https://github.com/malfet, https://github.com/leslie-fang-intel, https://github.com/atalman

Co-authored-by: LifengWang <lifeng.a.wang@intel.com>
2025-05-14 04:06:18 +00:00
316c15297c [MemoryZ] Show the current and max entries rendered (#153446)
Summary: as title

Test Plan: {F1977904091}

Differential Revision: D74626081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446
Approved by: https://github.com/sraikund16
2025-05-14 03:16:12 +00:00
c797f1285c [dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458
Approved by: https://github.com/jansel
2025-05-14 03:04:38 +00:00
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
9ad9a04ca7 Add TensorLR variant for fused Adagrad on CPU (#153078)
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).

I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
2025-05-14 02:23:33 +00:00
d51bc27378 [export] Make draft_export public (#153219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219
Approved by: https://github.com/pianpwk
2025-05-14 02:18:36 +00:00
b15b870903 [BE] remove outdated torch/README.md (#153500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-14 02:10:30 +00:00
d759a517af Update the heuristic for AArch64 bmm/baddbmm (#149122)
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location

 - The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn.
 - The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic:
 ```
import torch
import time

# Set below to True to use cases selected by only one of the hueristics.
USE_ONLY_DIVERGENT_TEST_CASES = True
BATCH_SIZES  = [ 1, 8, 32, 64, 128, 256 ]
M_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
N_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
K_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
ITERS = 50

def old_heuristic(m, n, k):
     is_above_min_dims = m > 8 and n > 8 and k > 8
     is_above_min_size = m*n*k > 8_192
     return is_above_min_dims and is_above_min_size

def new_heuristic(b, m, n, k):
     return b*b*m*n*k >= 4_194_304

def generate_test_cases():
    test_cases = []
    for b in BATCH_SIZES:
        for m in M_DIMS:
            for n in N_DIMS:
                    for k in K_DIMS:
                        if USE_ONLY_DIVERGENT_TEST_CASES:
                            if old_heuristic(m, n, k) != new_heuristic(b, m, n, k):
                                test_cases.append([b, m, n, k])
                        else:
                            test_cases.append([b, m, n, k])
    return test_cases

def test(x, y):
    for _ in range(5):
        torch.bmm(x, y)
    perf = 0.0
    for _ in range(ITERS):
        start = time.time()
        torch.bmm(x, y)
        end = time.time()
        perf += (end - start) / ITERS
    return perf

def main():
    print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}")
    cumulative_mean_time = 0.0
    for b, m, n, k in generate_test_cases():
        mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k))
        cumulative_mean_time += mean_time
        print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}")
    print(f"Cumulative mean time = {cumulative_mean_time:.4f} s")

if __name__ == "__main__":
    main()
```

From the script we see that cumulative mean time from all test cases (at 16 threads) is:
 - 1.6195 s for the old heuristic
 - 0.7012 s for the new heuristic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-05-14 02:03:50 +00:00
e8662e836a Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424)
Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901

Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-14 02:01:32 +00:00
clr
85f97b5a8c compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983)
This is a pretty minor change, but by having exact correspondence, we can
easily confirm data differences between perfetto and wait counters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983
Approved by: https://github.com/jansel, https://github.com/masnesral
2025-05-14 01:54:42 +00:00
90001554bf [SymmMem][a2av] Fix TODO: change stride unit (#153483)
Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483
Approved by: https://github.com/fegin, https://github.com/ngimel
2025-05-14 01:47:54 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
8521a690f7 [dynamo] fix potential circular import error in decorators.py (#153217)
Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217
Approved by: https://github.com/jansel
2025-05-14 01:01:57 +00:00
e6a9067260 [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/jeffdaily

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-05-14 00:58:00 +00:00
7f79222992 Upgrade to NCCL 2.26.5 for CUDA 12 (#152810)
Upgrade NCCL to latest 2.26.5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152810
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/cyyever
2025-05-14 00:52:50 +00:00
8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
8ac82c3e72 [export] support functools.partial forward (non-strict) (#153408)
Fixes #153086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408
Approved by: https://github.com/tugsbayasgalan
2025-05-13 23:30:13 +00:00
40b719c97d [nativert] move executor config to torch (#153087)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs

Test Plan: CI

Differential Revision: D74278789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087
Approved by: https://github.com/zhxchen17
2025-05-13 23:26:00 +00:00
3498201e57 GPU lowering uses aoti_call_delegate (#153282)
Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D73704385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282
Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad
2025-05-13 23:23:27 +00:00
81719ebde3 [caffe2] Make c10::str works with scoped enum (#152705) (#152714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705

Test Plan:
```
buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast
```

Differential Revision: D74087796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714
Approved by: https://github.com/Skylion007
2025-05-13 21:05:36 +00:00
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
a13c8f2ecb [EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415)
Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock

Test Plan: Induced error manually and saw that GIL was released

Differential Revision: D74593564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-13 20:47:52 +00:00
5ff2cb8587 Add justknobs for static cuda launcher (#153400)
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.

It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.

Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.

Differential Revision: D74599203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
2025-05-13 20:10:13 +00:00
clr
20ba8fe7e6 induct: Log a pt2 compile event + waitcounter for node fusing. (#153270)
This appears to be slow in production (potentially a quadratic explosion), and
logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how
bad of an issue this is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270
Approved by: https://github.com/masnesral
2025-05-13 19:02:36 +00:00
8ac82a1d20 [dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416)
This adds a regression test for #149831, also as part of getting it
cherry-picked into 2.7.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416
Approved by: https://github.com/atalman
2025-05-13 18:28:02 +00:00
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
dde705864a Fix test broken by D73809989 (#153413)
Summary: I forgot to remove this unused field in D73809989.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
2025-05-13 16:44:30 +00:00
216e28f7e9 [ca] run xfails up until their last passing backend (#153279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279
Approved by: https://github.com/jansel
ghstack dependencies: #153193, #153222
2025-05-13 16:42:10 +00:00
a80eb84a5f [ca] support higher order gradients (create_graph=True) (#153222)
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").

Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
2025-05-13 16:42:09 +00:00
37efaf4af9 [ca][api] config api shouldn't error with optimize_assert (#153193)
Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193
Approved by: https://github.com/jansel
2025-05-13 16:42:02 +00:00
a4459cd4e3 Remove property from python_type function (#152900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900
Approved by: https://github.com/amjames, https://github.com/anijain2305
ghstack dependencies: #153070
2025-05-13 16:26:25 +00:00
f67eb6f8c5 Fix path matching in CPythonTestCase/setUpClass (#153070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153070
Approved by: https://github.com/amjames, https://github.com/anijain2305, https://github.com/Skylion007
2025-05-13 16:26:25 +00:00
c5ebc12f7f [ROCm] unkip test_non_standard_bool except for failings ops (#152956)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152956
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-05-13 15:55:42 +00:00
445d8fd77d [MemoryZ] Sync changes to internal page (#153166)
Summary:
For MTIA on-demand mode, since we are not using torch Module. The data upload happens in cpp and doesn't support pickle.
Thus, we store as JSON at the end and need the update visualizer to support it

Test Plan: Check Test plan in D74179606

Differential Revision: D74406209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153166
Approved by: https://github.com/sraikund16
2025-05-13 15:35:10 +00:00
ea3eaf68bf Fix AOTI cpp tests (#153423)
`Error in dlopen: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found` error  was caused by cmake migration (as conda one probably have some extra link rules), while `C++ exception with description "CUDA error: no kernel image is available for execution on the device` were caused by the fact that test were build for Maxwell, but run on SM_86

Remaining test was failing before, but was probably disabled
TODOs:
 - Move build to the build step

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153423
Approved by: https://github.com/huydhn, https://github.com/cyyever
2025-05-13 15:25:03 +00:00
6b02e60838 [Intel GPU] Use user-friendly err msg in mm (#151655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151655
Approved by: https://github.com/EikanWang
2025-05-13 15:13:21 +00:00
7fdd754136 [compile-time traces] Profile large missing gaps in compile time (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel
2025-05-13 14:44:51 +00:00
ee096b89f6 Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151315
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-13 14:44:17 +00:00
d9ef1012db [PP] Optimize memory usage by releasing output memory earlier (#153383)
Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-05-13 14:42:38 +00:00
f1de3f9f07 Rename "output_tensor" -> "out" in autotune_process.py (#153169)
Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'".

Test Plan: Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169
Approved by: https://github.com/aorenste, https://github.com/eellison
2025-05-13 14:18:29 +00:00
9f98e37eb4 [Intel GPU] add tf32 support for matmul on XPU (#144240)
Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only
~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570
We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240
Approved by: https://github.com/EikanWang
2025-05-13 14:03:01 +00:00
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
d0faa9985d [Dynamo] Fix typing in graph_deduplication.py (#152572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570
2025-05-13 12:17:59 +00:00
a415c9831f [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-13 12:17:59 +00:00
57dafb90ef [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-13 12:17:59 +00:00
118192011e [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-13 12:17:59 +00:00
3592cb52d9 [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-13 12:17:59 +00:00
023a3dc69f [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-13 12:17:59 +00:00
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
8d7dec6e92 Revert "[DSD] Don't pop tensors if they are on Meta device (#153185)"
This reverts commit 7243c69421cd0b868f3fa3b552c17e9c8b3023a1.

Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))
2025-05-13 09:13:27 +00:00
cyy
9785b32189 Remove unused typing-extensions BUCK target (#153229)
This target is unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153229
Approved by: https://github.com/colesbury
2025-05-13 04:29:59 +00:00
cyy
15e08f9571 [submodule] Update ONNX to 1.18 (#152200)
Update ONNX to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-05-13 04:18:45 +00:00
c4fb0b6f33 refresh expected results (#150166)
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
483bbb639a [CI] Collect accuracy for MPS inductor benchmarks (#153443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153443
Approved by: https://github.com/atalman
2025-05-13 03:49:28 +00:00
36722c287f [cutlass backend] make compile name independent of command (#153388)
Differential Revision: D74291603

The goal is to reuse the kernels as much as possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153388
Approved by: https://github.com/ColinPeppler
2025-05-13 03:49:24 +00:00
29c8ae825f [OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309)
As the title stated.

**Next Chages**:
- Migrate remaining functionality to OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153309
Approved by: https://github.com/albanD
2025-05-13 03:49:19 +00:00
a6c5b59067 [MPSInductor] Fix multistage reduction suffixes (#153362)
By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache.

Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device half* out_ptr1,
    constant half* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp0 = static_cast<float>(in_ptr0[r0_0]);
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024);
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp2 = static_cast<float>(in_ptr0[r0_0]);
        auto tmp3 = tmp2 / tmp1;
        out_ptr1[r0_0] = static_cast<half>(tmp3);
    }
}
```

Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps`

Fixes https://github.com/pytorch/pytorch/issues/152155

Though inductor tests are still failing, need to keep refining the variable invalidation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel
2025-05-13 03:07:53 +00:00
27e9d9b103 [c10d][fr] Add try catch to update entry due to cuda error (#153414)
During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414
Approved by: https://github.com/d4l3k
2025-05-13 01:10:00 +00:00
8b507a9809 convert guard_size_oblivious to runtime check in infer_size_impl (#148872)
its ok to check the requirement  numel == newsize at runtime in case of unbacked instead of at compile time and assume that its true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148872
Approved by: https://github.com/bobrenjc93
2025-05-13 00:32:28 +00:00
0cf61ca7e4 make use_mem_pool threadlocal (#153356)
Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-13 00:16:07 +00:00
d5d26ce436 Enable accelerator to perform streaming backward (#153412)
Also see https://github.com/pytorch/pytorch/pull/142097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412
Approved by: https://github.com/albanD
ghstack dependencies: #151079
2025-05-13 00:02:24 +00:00
71c8231742 fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066)
Summary:
https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention

Aligns the arguments for the triton inputs

Differential Revision: D74085173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066
Approved by: https://github.com/jansel
2025-05-12 23:56:49 +00:00
641e4bee67 Revert "[export][cond] support merging constant ints as unbacked symint (#152742)"
This reverts commit a805911d15f0da0b3b07203d5cb727e84ef40cf0.

Reverted https://github.com/pytorch/pytorch/pull/152742 on behalf of https://github.com/ydwu4 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/152742#issuecomment-2874410372))
2025-05-12 23:06:33 +00:00
a87e810980 add needs_contiguous_strides tag (#153399)
Summary:
The padding operations could lead to non-contiguous tensors, which will fail the test in `reduce_scatter_tensor`: https://fburl.com/code/5wt5xkig

The `needs_contiguous_strides` tag is to tell inductor that `reduce_scatter_tensor` needs contiguous inputs, so it will not to execute padding operations.

Test Plan:
W/o the tag, job failed on the check:
https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_check_256bs_8t-fc398c39d3?job_attempt=0&version=0&tab=summary&env=PRODUCTION

With this tag, previously failed job succeeded:
https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_i10_tag-2ed5b05276?job_attempt=11&version=0&tab=summary&env=PRODUCTION

Differential Revision: D74598810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153399
Approved by: https://github.com/fmassa
2025-05-12 23:03:56 +00:00
f05b38aa26 [BE]: Improve decorator typing for Optimizer subclasses (#153374)
Improves typing so that all the optimizer subclasses (which all of them that subtype step) do not erase their type signature when this decorator is used. Now *kwarg values and returns will propogate

This complements @tsunghsienlee PR #153367  as the type signature of step() was being erased on all the optimizer subclasses by this untyped decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153374
Approved by: https://github.com/janeyx99, https://github.com/tsunghsienlee
2025-05-12 22:55:25 +00:00
b0f2891e43 [AOTInductor] Fix clang-tidy warnings in wrapper (#153197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197
Approved by: https://github.com/desertfire
2025-05-12 22:35:59 +00:00
3ff22fe2df [BE]: Use shutil.which in inductor codegen (#153377)
Use shutil.which instead of subprocess. Is more secure, has better error handling and is more cross platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153377
Approved by: https://github.com/albanD
2025-05-12 22:11:26 +00:00
dbb4444ce3 [Memento] Add PT2 to Memory Snapshot (#152707)
Summary:
To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following:

1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack
2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected
3. Piping for compile context to pickle output

Test Plan:
In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658}

Differential Revision: D74028214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707
Approved by: https://github.com/eqy
2025-05-12 21:12:51 +00:00
f78e4529a9 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-12 21:07:16 +00:00
f136046919 Clean up right nav (#153090)
- Move community and language binding links to the horizontal bar
- Add an intro to the community page.
- Fix the link in the ogp_image
- Fix the link in the version switcher
- Clean up unneeded links

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153090
Approved by: https://github.com/albanD
2025-05-12 21:00:45 +00:00
a805911d15 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = if u0 return 0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-12 20:26:31 +00:00
88a068f33b [2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770)
Summary:
Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version.

Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it.

Test Plan:
### how to enable

```
    torch._inductor.config.post_grad_fusion_options = {
        "activation_quantization_aten_pass": {
            "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type
            "use_scaling": False,  -> default is False, if you want to use scaling verison, set it to True
            "size_in_mb": 0.0,  -> default is 100, you can tune the value.
             "exclude_primals": False, -> whether want to exclude quantize parameters, default is False
              "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16
        },
    }
```

### toy model

```
buck2 run mode/opt //scripts/qyz/autoac:quantization
```

```
Epoch [80/200], Loss: 19227.2109
Epoch [100/200], Loss: 1353.5272
Epoch [120/200], Loss: 38630.6758
Epoch [140/200], Loss: 6239.9155
Epoch [160/200], Loss: 6039.1567
Epoch [180/200], Loss: 3994.3569
Epoch [200/200], Loss: 146.3966
```

Differential Revision: D73015996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770
Approved by: https://github.com/Mingming-Ding
2025-05-12 19:43:18 +00:00
45df18dcd0 [BE]: Enable ruff rule TC007 (#153394)
Enables [TC007] https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007 this finds type aliases that should be quoted if they have to interact with IF TYPE_CHECKING blocks: https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007

Disabled it when we updated RUFF, but really should only have disabled TC006 as that is the one that is going to cause some changes codebase wide.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153394
Approved by: https://github.com/albanD
2025-05-12 19:18:29 +00:00
fb85ebd710 [BE]: Use undocumented temp shim to restore setuptools compat (#153052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153052
Approved by: https://github.com/albanD
2025-05-12 18:33:41 +00:00
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
5c3fddb9cc Revert "[Hierarchical Compilation] Track node mutations (#152389)"
This reverts commit c2936ebfd58be7a6519f51d165dfac8407020140.

Reverted https://github.com/pytorch/pytorch/pull/152389 on behalf of https://github.com/jeanschmidt due to Humm, interesting, there seems to be a bug in stack PRs, as it should be part of the stack and be reverted with the other ones ([comment](https://github.com/pytorch/pytorch/pull/152389#issuecomment-2873540451))
2025-05-12 18:18:44 +00:00
e1d03fa251 [Inductor] Optimize grid calculation by using // instead of FloorDiv (#153230)
https://github.com/pytorch/pytorch/pull/146942 introduced an 8.3% regression on the `benchmark_torchbench_run_bert_pytorch_training:defaults-speedup-x1000` perf metric. This was flagged by internal CI testing (task T223596372).

The root cause seems to be that `FloorDiv` is now used to calculate the launch grid in certain scenarios, which is slower than the previously-used `//`. Since launch grid calculations happen at runtime, they can have a significant performance impact on some models.

The reason for switching to `FloorDiv` in https://github.com/pytorch/pytorch/pull/146942 was to allow the FX backend to generate runnable Python code. `FloorDiv(x, y)` maps to `x // y` in Python, whereas `sympy.floor(sympy.Rational(x,y))` maps to `floor(x/y)`, which crashes as FX doesn't know what `floor` is.

To get the best of both worlds, this PR reverts to using `//` to calculate launch grids, but then post-processes the resulting sympy expressions in the FX converter, converting `floor(x / y)` to `FloorDiv(x, y)`. Since this sympy manipulation happens at compile time, the perf impact should minimal, and should only affect the FX backend. This is similar to the approach previously explored in https://github.com/pytorch/pytorch/pull/151144, but the implementation is more minimal and self-contained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153230
Approved by: https://github.com/jansel
2025-05-12 18:08:52 +00:00
498f364518 Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286)
The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this.

Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286
Approved by: https://github.com/kwen2501
2025-05-12 17:38:49 +00:00
7e1790d86b [xla hash update] update the pinned xla hash (#153368)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153368
Approved by: https://github.com/pytorchbot
2025-05-12 17:11:23 +00:00
dc47295dc5 [Inductor UT][Break XPU] Generalize newly added device-bias code in Inductor UT. (#153355)
Fixes #153123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153355
Approved by: https://github.com/desertfire, https://github.com/Skylion007
2025-05-12 15:53:05 +00:00
ea4b65ab60 Fix the type hint of step() with default value (#153367)
Summary: Because the default value of `closure` is `None`, this fixes the situation when `step()`. The previous typing (https://github.com/pytorch/pytorch/pull/102593) could only be used as `step(closure=None)` and `step(None)`.

Test Plan: contbuild & OSS CI

Differential Revision: D74560785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153367
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/janeyx99
2025-05-12 15:52:59 +00:00
de5c5f4fb7 Opt-out LF runners from of inductor jobs (#153151)
Opt-out of inductor jobs for the lf experiment configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153151
Approved by: https://github.com/seemethere
2025-05-12 15:52:53 +00:00
89aa6eb19b Stop codegen-ing post_grad_custom_pass in repros (#153243)
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.

This PR skips codegenning of these passes.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
2025-05-12 15:21:11 +00:00
7657d80a58 [aoti] when generating example input shapes, use unbacked replacements (#153220)
## Context
Suppose we have this graph like this :
```
a: "[s1 + u2, 200]"
b: "[u0, 32]"
cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1)
```

NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)."

So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian.

Notice how a  has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value.

```
# s1 is an arbitrary integer
a = generate_example_value(size=(s1 + 8192, 200))
b = generate_example_value(size=(8192, 32))
out = generate_example_value(size=(s1 + 8192, 232))
triton_cat.run(a, b, out ...)
```

## Error
```
wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.

RuntimeError: CUDA error: device-side assert triggered
```

Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220
Approved by: https://github.com/desertfire
2025-05-12 15:20:57 +00:00
1c659b5bc0 [BE]: Use more portable shutil.which call for cpp_builder (#153325)
We should be using shutil.which instead of calling some binary subprocess here for portability and security.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153325
Approved by: https://github.com/xuhancn, https://github.com/cyyever, https://github.com/albanD
2025-05-12 15:15:21 +00:00
78d752e96a Revert "[Hierarchical Compilation] Use universal flatten APIs (#152505)"
This reverts commit f9e3a9058e80fde310e5f0919d3a21e28cd024a8.

Reverted https://github.com/pytorch/pytorch/pull/152505 on behalf of https://github.com/jeanschmidt due to [TENTATIVE] reverting to check if reverting this stack partially caused the introduction of https://github.com/pytorch/pytorch/actions/runs/14966121510/job/42049638969#step:22:875 ([comment](https://github.com/pytorch/pytorch/pull/152505#issuecomment-2872869990))
2025-05-12 14:48:08 +00:00
cb35a2b15d Add missing in-place on view check to custom autograd.Function (#153094)
Fixes https://github.com/pytorch/pytorch/issues/152773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094
Approved by: https://github.com/albanD
ghstack dependencies: #153005
2025-05-12 14:42:46 +00:00
a67dd2083c [dynamo] Guard serialization for SHAPE_ENV (#153258)
Differential Revision: [D74483150](https://our.internmc.facebook.com/intern/diff/D74483150/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153258
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256, #153257
2025-05-12 14:42:01 +00:00
e2f6870c98 [dynamo] Guard serialization for DEFAULT_DEVICE (#153257)
Differential Revision: [D74483147](https://our.internmc.facebook.com/intern/diff/D74483147/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153257
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256
2025-05-12 14:42:00 +00:00
ef1dcc21ee [dynamo] Guard serialization for global state guards (GRAD_MODE, DETERMINISTIC_ALGORITHMS, TORCH_FUNCTION_STATE, FSDP_TRAINING_STATE) (#153256)
serialization for global state guards.

Differential Revision: [D74483149](https://our.internmc.facebook.com/intern/diff/D74483149/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153256
Approved by: https://github.com/jansel
ghstack dependencies: #153255
2025-05-12 14:41:53 +00:00
0210986cc4 [dynamo] Guard serialization for EMPTY_NN_MODULE_HOOKS_DICT (#153255)
EMPTY_NN_MODULE_HOOKS_DICT

Differential Revision: [D74483148](https://our.internmc.facebook.com/intern/diff/D74483148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153255
Approved by: https://github.com/jansel
2025-05-12 14:41:44 +00:00
daca611465 Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)"
This reverts commit 5683965f02c4091a864484917f74e3a42c9c56ae.

Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2872361816))
2025-05-12 12:29:28 +00:00
8511d21081 Revert "Forward fix #151727 (#153306)"
This reverts commit 64518ca7420271562c4920c13c44221c54e534df.

Reverted https://github.com/pytorch/pytorch/pull/153306 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153306#issuecomment-2872339570))
2025-05-12 12:22:13 +00:00
23ecd35a96 Update slow tests (#151207)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151207
Approved by: https://github.com/pytorchbot
2025-05-12 12:05:58 +00:00
47df195065 Revert "[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)"
This reverts commit bc8b305eb816106de31602f8b7fd80d4113e6ee8.

Reverted https://github.com/pytorch/pytorch/pull/152410 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
0e36887209 Revert "[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)"
This reverts commit 779e647999645d19eebf01fa686fb792176f8940.

Reverted https://github.com/pytorch/pytorch/pull/152506 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
53ebcabb52 Revert "[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)"
This reverts commit 50df08eb5e4d9276b72929fd859ad892880bab0f.

Reverted https://github.com/pytorch/pytorch/pull/152570 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
0071fdab9e Revert "[Dynamo] Fix typing in graph_deduplication.py (#152572)"
This reverts commit 15166be691454f8a0e626b54b6be0bea51938f86.

Reverted https://github.com/pytorch/pytorch/pull/152572 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72ec6d1b004b05284e9553e65ee03abc.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
7243c69421 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-12 07:04:59 +00:00
032ef48725 [BE]: Add PEP621 project section to pyproject.toml (#153055)
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.

* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels

These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.

Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
2025-05-12 02:16:07 +00:00
ceb009baee [map] always turn on dynamo for map (#152041)
Summary:
X-link: https://github.com/pytorch/executorch/pull/10409

Reland D72896450

Make map consistent with other control flow ops. After the change, map is able to support accessing closures in the map fn.

Test Plan: See existing tests.

Reviewed By: zou3519

Differential Revision: D73138427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152041
Approved by: https://github.com/zou3519
2025-05-12 02:10:08 +00:00
c5b4dc9898 [executorch hash update] update the pinned executorch hash (#152238)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152238
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-05-12 01:50:12 +00:00
930de01861 [Typing] Apply torch.types.Device in torch/cuda/memory.py (#153027)
Part of: #152952

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

It contains `int`, so the `int` in `Union[Device, int]` is redundant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153027
Approved by: https://github.com/Skylion007
2025-05-11 23:32:59 +00:00
0104ac0f6f [Ez][BE]: Fix click ImportError in torch/csrc/jit (#153323)
Fixes unnecessary import for torch script. Unblocks #153020 as it appears to fix circular importer linter into importing every Python file under torch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153323
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-05-11 19:16:01 +00:00
c51bdf5acf [export] Exporter API prototype. (#153205)
Summary: see inline code comments for documentation

Test Plan:
CI

buck2 test --flagfile fbcode//mode/opt fbcode//caffe2/test:test_export -- -r TestPackage

Differential Revision: D74426900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153205
Approved by: https://github.com/tugsbayasgalan
2025-05-11 14:20:09 +00:00
909ec495b8 [audio hash update] update the pinned audio hash (#153301)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153301
Approved by: https://github.com/pytorchbot
2025-05-11 03:47:56 +00:00
1f5cf19f56 [cutlass backend] Use src code to generate cutlass gemm name (#153006)
This shaves off 40s for at least small cases, since we don't have to recompile the kernel again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153006
Approved by: https://github.com/mlazos
2025-05-11 00:57:03 +00:00
64518ca742 Forward fix #151727 (#153306)
#151727 is failing internally with the following error `error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153306
Approved by: https://github.com/eqy, https://github.com/cyyever, https://github.com/wdvr
2025-05-11 00:39:59 +00:00
fdc387ec7c Revert "refine fp32 precision api (#125888)"
This reverts commit 4c11b26158691cfd9ad48338ddebd1ca9bded788.

Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))
2025-05-11 00:35:46 +00:00
e4f22822cb Revert "Cleanup VS 2019 refs in pytorch (#145863)" (#152613)
This reverts commit b45e6fa707ced2adb68eaf1a2c1ccb389a6283d7.

revert PRs:
https://github.com/pytorch/pytorch/pull/145863
https://github.com/pytorch/pytorch/pull/145319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152613
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-10 19:33:26 +00:00
4f068598c4 [BE] Delete now unused mac-mps.yml (#153263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153263
Approved by: https://github.com/Skylion007, https://github.com/cyyever
ghstack dependencies: #153013, #153057, #152719
2025-05-10 19:10:41 +00:00
d22c40373f [Ez][BE]: Fix KeyError LOGNAME (#153324)
Unblocks #153020 which accidentally improves the CircularImportLinter to check all Python files. It doesn't set a logname so it errors, there is another FSDP script which already defaults LOGNAME to '' if not specified, this does the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153324
Approved by: https://github.com/awgu
2025-05-10 18:23:38 +00:00
6a84fe65ec Fix code portability when looking for Dot (#153259)
When trying to plot a trace graph, Inductor checks if "dot" is installed. Currently, the code runs a "which dot" command.

By default, Windows doesn't have the "which" command. This patch replaces it with the more portable alternative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153259
Approved by: https://github.com/Skylion007
2025-05-10 16:12:44 +00:00
01cbf5a30a [AOTInductor] Add wrapper and kernel code to debug code logging (#153181)
This is a simple PR to make the AOTInductor wrapper and kernel code get output by `TORCH_COMPILE_DEBUG=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153181
Approved by: https://github.com/desertfire
2025-05-10 15:31:18 +00:00
01bb249978 Revert "has_triton: Use the device interface for detecting Triton availability (#139171)"
This reverts commit 48bfe9afc70a98addd5aa738bf501c029e4a9285.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/masnesral due to Performance regression for huggingface ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2868939790))
2025-05-10 14:46:23 +00:00
70c8047c2d include user stacks with constraint violation error message (#152924)
Fixes #152918

Before:

```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
```

After:

```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.

User stack:
  File "/home/bobren/local/a/pytorch/error.py", line 5, in foo
    return torch.randn(5) * x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152924
Approved by: https://github.com/pianpwk
2025-05-10 13:36:47 +00:00
4c11b26158 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-05-10 11:13:04 +00:00
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
15166be691 [Dynamo] Fix typing in graph_deduplication.py (#152572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570
2025-05-10 08:27:56 +00:00
50df08eb5e [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-10 08:27:45 +00:00
779e647999 [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-10 08:27:31 +00:00
bc8b305eb8 [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-10 08:27:19 +00:00
f9e3a9058e [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-10 08:27:07 +00:00
c2936ebfd5 [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-10 08:27:01 +00:00
bc4cf1c13a [BE] fix failing test_dp_state_dict_save_load on ROCm CI where world_size=7 (#153283)
**Summary**
I saw an unrelated CI failure `distributed/_composable/fsdp/test_fully_shard_state_dict.py::TestFullyShardStateDictMultiProcess::test_dp_state_dict_save_load` in one of my PR: https://hud.pytorch.org/pr/pytorch/pytorch/153225#41930032096

This is caused by triggering uneven sharding in FSDP2 at cbb03e6971/torch/distributed/fsdp/_fully_shard/_fsdp_param.py (L353-L361)

This didn't show up because the cuda CI has even number of GPUs (e.g. 2/4/8) but it's not true on ROCm CI. For the failing CI case, the device number is 7.

**Solution**
Skip the test if `self.world_size` can not divide `mlp_dim` (i.e. 16).

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153283
Approved by: https://github.com/fegin, https://github.com/weifengpy
2025-05-10 04:46:32 +00:00
fc7d8c6808 [Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) (#152938)
Fixes #132644

`_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour.

For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism.

## Changes:

* Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element.
* Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`.
* Updated references from `dist.Work` to `list[dist.Work]`.

## Testing:

* `pippy_bert.py` from #132644 now works with gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152938
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2025-05-10 04:19:38 +00:00
b86d46ff21 [torch][ao] Properly strip tracking stats in _fold_conv_bn_qat for 1D (#152982)
Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this.

Test Plan:
Run N7113483 without this fix.

{F1977726982}

```
bento kernel build sensorml
```

Re-run with local version of kernel, containing this diff:

{F1977727151}

Notice that now, num_batches is removed.

Differential Revision: D74269649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982
Approved by: https://github.com/andrewor14, https://github.com/yushangdi
2025-05-10 01:20:18 +00:00
9c99ea2991 error out on negative offs or on K=0 in group gemm (#153226)
Error out if K=0 in one of the grouped gemms to avoid hangs in #152668
Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already)

One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with
```
  File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm
    if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias):
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel
    offs is not None
  File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__
    raise TypeError("cannot determine truth value of Relational")
torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational
```
which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for.
Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226
Approved by: https://github.com/kwen2501
2025-05-10 01:13:18 +00:00
639793c17e [pytorch] Expose c10_retrieve_device_side_assertion_info() for use by external code (#153211)
Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code.  The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA

Test Plan: OSS CI

Differential Revision: D74432771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211
Approved by: https://github.com/Skylion007
2025-05-10 01:08:45 +00:00
658aea980c [inductor] Rename knobs > triton_knobs in static_cuda_launcher (#153189)
Summary: A follow up from https://github.com/pytorch/pytorch/pull/152457 since I didn't address the comment then

Test Plan: CI

Differential Revision: D74421432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153189
Approved by: https://github.com/jamesjwu
2025-05-10 00:26:21 +00:00
fbb6412fdb Stop uploading sccache stats to benchmark database (#153285)
This is not used for anything atm and potentially bloat up the size of the database
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153285
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-05-10 00:17:38 +00:00
e6dccb036e Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit 4f425a0397eb0c63b8864bb9b168a519dcfbebbe.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see d07fbd41e3/1 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2868100487))
2025-05-09 23:43:56 +00:00
4e24ee7283 Move mps_linear forward to use MPS kernels directly instead of MPSGraph (#152210)
This PR moves `mps_linear` to use MPSNDArrays and call into the MPS kernel directly instead of going through MPSGraph. It also adds a caching mechanism for reusing MPS kernels as there is also a small overhead attached to creating the kernel object.

The impact of the improvement is relatively more significant for small input kernels where the MPSGraph overhead represents a larger portion of the overall execution time of the operation but the speedup shows for both small and large input sizes as expected.

`mps_linear` before the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x109d67110>
func(*args, **kwargs)
  Median: 199.29 us
  IQR:    9.56 us (196.71 to 206.27)
  979 measurements, 1 runs per measurement, 1 thread

input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1063b4510>
func(*args, **kwargs)
  Median: 979.29 us
  IQR:    25.29 us (964.83 to 990.13)
  205 measurements, 1 runs per measurement, 1 thread
```

`mps_linear` after the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10693a190>
func(*args, **kwargs)
  Median: 176.08 us
  IQR:    15.02 us (172.42 to 187.44)
  1103 measurements, 1 runs per measurement, 1 thread

input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10d524dd0>
func(*args, **kwargs)
  Median: 952.56 us
  IQR:    15.63 us (945.47 to 961.10)
  210 measurements, 1 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152210
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-05-09 23:41:23 +00:00
d07fbd41e3 [BE][MPS] Use squeeze/unsqueeze in Linear (#153288)
Instead of views, to reshape weight to 2D tensor if necessary

Already tested by `test_linear_1d_weight`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153288
Approved by: https://github.com/wdvr
2025-05-09 23:34:54 +00:00
ee326137a9 [reland] Add graph module runtime asserts to AOTI (#153182)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

A reland of https://github.com/pytorch/pytorch/pull/152125.

added a try-except around the justknob internally. Also added more documentation

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already.

One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv.
See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor.

In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1  buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r  test_unbacked_equals_input_size
```

Differential Revision: D74361799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182
Approved by: https://github.com/henrylhtsang
2025-05-09 22:56:19 +00:00
298b43792b [RFC][inductor] Refactor AlgorithmSelectorCache to spit out make_precompile_fn (#153212)
Motivation is that `AlgorithmSelectorCache.__call__` is getting very long and hard to work with. There are nested layers of local functions in it. For example, we pass `precompile_fn`, a local variable, to `do_autotuning`, a local function, which already has a pointer to choices, a local variable, and then have `do_autotuning` calls `choices` in `self.lookup`.

When I was trying to make changes to do_autotuning, I would get `UnboundLocalError: cannot access local variable 'choices' where it is not associated with a value`. But no idea why it was even working in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153212
Approved by: https://github.com/eellison
2025-05-09 22:35:10 +00:00
37f92bbe0a [ROCm][CI] fix nightly build after rocm 6.4 upgrade (#153253)
rocm-smi adds inclusion of drm.h and libdrm-devel package was missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153253
Approved by: https://github.com/jeffdaily, https://github.com/atalman

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-09 22:08:15 +00:00
9ae722cdb4 allocate cuMem memory with rdma flag (#153261)
to be able to register memory with ibverbs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007
2025-05-09 21:48:48 +00:00
f11d7a5978 [ROCm] Update spack includes (#152569)
* Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead
* `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on
* `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack
* Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack
* `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case
* Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569
Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily

Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>
2025-05-09 21:36:38 +00:00
4f425a0397 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-09 21:17:54 +00:00
cbb03e6971 [BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-05-09 20:40:54 +00:00
3976e52264 Fix torch.isin decomposition for scalar inputs (#153216)
This patch fixes a corner case of `torch.isin` decompisition when both
inputs are scalars. This pattern showed up from #141196.

Fixes #141196.

Error stack befor this patch:
```
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition
    res = opt_f()
          ^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base
    fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph(  # type: ignore[misc]

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner
    t = dispatch_trace(
        ^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace
    (self.create_arg(fn(*args)),),
                     ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
          ^^^^^^^^^^^
  File "<string>", line 1, in <lambda>
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper
    f_outs = fn(*f_args)
             ^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call
    out = PropagateUnbackedSymInts(mod).run(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run
    self.env[node] = self.run_node(node)
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node
    result = super().run_node(n)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__
    outs_unwrapped = func._op_dk(
                     ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call
    r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp
    out = CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin
    return isin_default(elements, test_elements, invert=invert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default
    x = elements.view(*elements.shape, *((1,) * test_elements.ndim))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: view() received an invalid combination of arguments - got (), but expected one of:
 * (torch.dtype dtype)
 * (tuple of ints size)

While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {})
GraphModule: class GraphModule(torch.nn.Module):
    def forward(self):
         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0)
        x: "i64[][]" = torch.tensor(0)

         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x)
        isin: "b8[][]" = torch.isin(x, x);  x = None
        return (isin,)

Original traceback:
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f
    return torch.isin(x, x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216
Approved by: https://github.com/williamwen42, https://github.com/peterbell10
2025-05-09 20:26:25 +00:00
180cbf46f2 Fix 'TensorBox' object has no attribute 'is_input_buffer' (#152980)
Summary: Fix for https://fb.workplace.com/groups/1075192433118967/permalink/1664491270855744/

Test Plan: Used reproducer from D74262030

Differential Revision: D74270090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152980
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-09 19:58:32 +00:00
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
fe11d300ac [nativert] Improve MPMCQueue tests. (#153154)
Summary:
- Use std::this_thread::yield and stop busy wating.
- Sort test file orders.

Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837
Test Plan: CI

Differential Revision: D74402536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-09 19:25:42 +00:00
287b1ca30c [Ez][BE]: Ensure matplotlib remains optional dependency via fake_quantize (#153244)
Unblocks #153055 and ensure that matplotlib should always be optional in PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153244
Approved by: https://github.com/albanD
2025-05-09 19:19:30 +00:00
90fde0dc09 [ONNX] Support sym_float (#153200)
Fixes #153115

Note: torch.sym_int is not supported in this PR because it's not appeared in exported program, instead, it's `torch.ops.aten.sym_size.int()`.

```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s35, s16]"):
             #
            sym_size_int_1: "Sym(s35)" = torch.ops.aten.sym_size.int(x, 0);  x = None
            return (sym_size_int_1,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153200
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-05-09 19:10:17 +00:00
da0b89bcbf Scheduler Flops refactor (#152708)
This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes:
1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`.
1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`.

Prelude to: https://github.com/pytorch/pytorch/pull/149697

Testing:
New unit tests cover functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152708
Approved by: https://github.com/xmfan, https://github.com/eellison
2025-05-09 19:01:43 +00:00
073b0257ba [Graph Partition] Maintain relative order within partition during reordering (#153111)
PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes.

This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111
Approved by: https://github.com/eellison
2025-05-09 18:49:53 +00:00
ec24f8f58a Format all headers under ATen/cpu/vec, not just top-level (#152364)
not formatting these seems like an oversight. Had to add a few clang-format suppressions to keep includes in the same order to avoid breaking builds.

This PR was generated using `lintrunner --paths-cmd "rg --files -g '*.h' aten/src/ATen/cpu/vec/" format`

Differential Revision: [D73802128](https://our.internmc.facebook.com/intern/diff/D73802128/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152364
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/CaoE
2025-05-09 18:46:07 +00:00
76e34e3850 [Kineto] Upgrade the kineto commit to fb36cce (#152007)
XPU intends to upgrade oneAPI version(https://github.com/pytorch/pytorch/issues/151097) to support torch Distributed. However, the PTI within the oneAPI to be upgraded introduces breaking changes. It changed the signature of the APIs as follows.
- ptiViewEnableRuntimeApi
- ptiViewGetApiIdName

To avoid the breaks due to the PTI upcoming non-backward-compatible changes, we refined the XPU PTI integration with the kineto. We check the PTI version and then invoke the PTI API accordingly. It means that the kineto of this PR can overcome the non-backward-compatible issue for the sake of the upcoming oneAPI 2025.1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152007
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/sraikund16, https://github.com/malfet
2025-05-09 18:38:41 +00:00
192f7140d1 [fbgemm_gpu] Replace C10_CUDA_KERNEL_LAUNCH_CHECK() in the KernelLauncher (#153178)
Summary:
- Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher`, as the
  latter does not print __FILE__ and __LINE__

The existing `C10_CUDA_KERNEL_LAUNCH_CHECK()` implementation does not print the source file and line number when a CUDA kernel launch throws an error, leaving users confused with a context-less message like `CUDA error: invalid arguments`.  This new check is a slimmed re-implementation of the macro with extra context information added to the error (beyond just file and line number) so that we can at least locate the FBGEMM source file or template where the error first surfaces.

Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

buck2 run 'fbcode//mode/opt-amd-gpu' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```

Reviewed By: sryap

Differential Revision: D74364031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153178
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-05-09 17:43:16 +00:00
595e21a9dd [cutlass-3] Add cutlass key for fbcode and OSS (#153081)
Differential Revision: [D74337959](https://our.internmc.facebook.com/intern/diff/D74337959/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153081
Approved by: https://github.com/drisspg
2025-05-09 17:38:31 +00:00
ffda46e3be [Graph Partition] remove weak dep from partition_input_names (#152863)
Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names.

The following test failure is fixed by removing weak dependency from partition_input_names:
`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863
Approved by: https://github.com/eellison
2025-05-09 17:20:04 +00:00
286de0d601 [CI] Enable XCCL in XPU CI build (#150927)
As XCCL has been enabled for torch xpu, enable it in CI build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150927
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/atalman
2025-05-09 17:12:34 +00:00
e73a4c3643 [BE][CI] Merge regular and MPS test config shards (#152719)
Unsure why there were separate to beging with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152719
Approved by: https://github.com/seemethere, https://github.com/atalman
ghstack dependencies: #153013, #153057
2025-05-09 17:01:35 +00:00
309ecb2277 [CI] Add opt-in h100 tests (#153170)
So far only run:
 - inductor/test_fp8.py
 - test_matmul_cuda.py
 - inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-09 17:01:05 +00:00
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
e21ff9c3be Add logging for guard miss failure (#153125)
Differential Revision: [D74371381](https://our.internmc.facebook.com/intern/diff/D74371381/)

This PR adds some logging for guard misses to tlparse, so that we know when AOTAutogradCache and FxGraphCache miss due to guards.

Example tlparse result:
https://gist.github.com/jamesjwu/afa19335c0aee85b24546b13c1cf6427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153125
Approved by: https://github.com/oulgen, https://github.com/jingsh
2025-05-09 16:51:04 +00:00
9d00f2b375 [autograd][docs] Add more details on why save_for_backward is important in extending autograd note (#153005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153005
Approved by: https://github.com/albanD
2025-05-09 16:36:57 +00:00
50657120a0 Allow workflows to opt-out of experiments (#153085)
This change adds support to allow workflows to opt-out of experiments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153085
Approved by: https://github.com/ZainRizvi

Co-authored-by: Zain Rizvi <ZainRizvi@users.noreply.github.com>
2025-05-09 16:34:46 +00:00
18e13a67ce [dynamo] Harden torch function dispatchability check for attributes and methods access (#153082)
See more details in
https://github.com/pytorch/pytorch/issues/151771#issuecomment-2836372110.

Fixes #151771.

Differential Revision: [D74342291](https://our.internmc.facebook.com/intern/diff/D74342291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153082
Approved by: https://github.com/mlazos
2025-05-09 16:14:23 +00:00
c227865720 [AOTInductor] Fix state of ConstantFolding (#153152)
Summary:
Bug fix for constant folding states. We are not setting the correct state for each updates.
One race condition would be:
(1) All threads obtain the model_exec_lock from main run.
(2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead.
(3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time.
(4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment)

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152
Approved by: https://github.com/jingsh, https://github.com/chenyang78
2025-05-09 16:03:05 +00:00
f2ea63658f Refactor nested benchmarking functions in select_algorithm.py (#153084)
Summary: I'll need some of the benchmark-related functions surfaced so I can use them for remote autotuning. This PR just lifts the main in-process benchmarking helpers to classmethods. It wasn't strictly necessary to also move the sub-process benchmarking helper, but I think it improves readability. Also added some missing types.

Test Plan: Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153084
Approved by: https://github.com/aorenste, https://github.com/eellison
2025-05-09 15:09:51 +00:00
916f6bafe7 Fix HF loading when there's no metadata file to work with fsspec (#152856)
Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys

Test Plan: unit test and e2e test reading from hf

Differential Revision: D74181513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856
Approved by: https://github.com/joecummings
2025-05-09 13:32:01 +00:00
e06a08059a Add device guard for xpu conv on multi device (#153067)
# Motivation
fixes https://github.com/pytorch/pytorch/issues/153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067
Approved by: https://github.com/albanD, https://github.com/EikanWang
2025-05-09 09:41:51 +00:00
aca2c99a65 xpu: get xpu arch flags at runtime in cpp_extensions (#152192)
This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list.

CC: @malfet, @jingxu10, @EikanWang, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD
2025-05-09 05:43:50 +00:00
9fa07340fd [Cutlass] Implement memory planning for EVT (#153177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153177
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #153196, #150907
2025-05-09 05:39:05 +00:00
a3154ca34a [Cutlass] Changes to gemm template for EVT (#150907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150907
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #153196
2025-05-09 05:39:05 +00:00
c54aa0da01 [Cutlass] Fix tests (#153196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153196
Approved by: https://github.com/BoyuanFeng
2025-05-09 05:39:05 +00:00
34196301d5 Revert "[CI] Add opt-in h100 tests (#153170)"
This reverts commit f87a0fe2cae5be82ffd845fa7e6053396c8222d1.

Reverted https://github.com/pytorch/pytorch/pull/153170 on behalf of https://github.com/clee2000 due to workflow doesnt have right concurrency group? ([comment](https://github.com/pytorch/pytorch/pull/153170#issuecomment-2864951319))
2025-05-09 03:04:50 +00:00
eqy
b30d276abc [CUDA][cuBLASLt] Fix scale setting for allowFP16AccumulationCuBLAS true case (#153083)
Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py`
Should help resolve #151890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153083
Approved by: https://github.com/janeyx99
2025-05-09 02:27:17 +00:00
10234ccefe xpu: rely on sycl/sycl.hpp to include bfloat16.hpp (#152562)
Fixes: https://github.com/intel/torch-xpu-ops/issues/1503

`sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail).

We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`):
```
In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23:
/opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec'
   60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
      |                       ^
/workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here
   60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
      |
```
While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required.

Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level.

See: f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc

CC: @EikanWang, @guangyey, @gujinghui

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-05-09 02:25:44 +00:00
faff387bfd Mini tutorial for provenance tracking (#152211)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152211
Approved by: https://github.com/svekars, https://github.com/eellison, https://github.com/desertfire
2025-05-09 01:41:04 +00:00
f87a0fe2ca [CI] Add opt-in h100 tests (#153170)
So far only run:
 - inductor/test_fp8.py
 - test_matmul_cuda.py
 - inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170
Approved by: https://github.com/drisspg
2025-05-09 01:03:12 +00:00
ab829ec629 [dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883, #153105
2025-05-09 00:09:19 +00:00
cbcb57d09d [CI] Use sccache installed in docker image in xla build (#153002)
The edited comment should have the info.  The code change looks large, but its copied from the install_cache script that our docker images use 6a8006472e/.ci/docker/common/install_cache.sh (L42)

Sccache stopped working on xla at some point near dec 17 2023.  I am not sure what commit caused it.  I think it was having trouble writing to the cache.

Either way, there is an sccache already installed on the docker image, so we should use that instead of a binary from s3 which we're probably no longer sure where it came from/what commit it was built from

The one in the docker image is installed here 69d438ee65/.github/upstream/Dockerfile (L61) and is also very old, so I have https://github.com/pytorch/xla/pull/9102 to update it

sccache still not writing properly, i will investigate, but xla build currently broken after the above xla pr, and this should fix it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153002
Approved by: https://github.com/malfet
2025-05-08 23:22:20 +00:00
0203f89cc1 Revert "[BE]: Add PEP621 project section to pyproject.toml (#153055)"
This reverts commit 5976419c6939207834492a1f5fba4a62f2c91b0d.

Reverted https://github.com/pytorch/pytorch/pull/153055 on behalf of https://github.com/malfet due to And failures seems related to this change, but I don't know how, see for example 7cb5c751c3/1 ([comment](https://github.com/pytorch/pytorch/pull/153055#issuecomment-2864664725))
2025-05-08 23:17:58 +00:00
7cb5c751c3 Fix the basic description of torch.min(), torch.max(), torch.all(), torch.any() (#152658)
Fixes #152176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152658
Approved by: https://github.com/malfet
2025-05-08 22:59:14 +00:00
5683965f02 [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/eqy
2025-05-08 22:38:23 +00:00
5dd746b4b5 [c10d] Reduce test verbosity (#153116)
Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153116
Approved by: https://github.com/fduwjj
2025-05-08 22:22:22 +00:00
5a8c9c3ab0 [FSDP2][Doc] add pointer to torchtitan (#153079)
<img width="838" alt="Screenshot 2025-05-08 at 10 51 05 AM" src="https://github.com/user-attachments/assets/4cf43a16-3801-424b-a74f-ede1d41ff052" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153079
Approved by: https://github.com/mori360
2025-05-08 22:22:07 +00:00
88b56774bd At least one of ROCM_HOME or CUDA_HOME must be None (#152236)
Copied description by @hj-wei from
https://github.com/ROCm/pytorch/pull/1809

> Hi all, I manually generating nvcc to bypass NVIDIA component
checks(Megatron-LM),
see
2da43ef4c1/megatron/legacy/fused_kernels/__init__.py (L57)

> but it can lead to incorrect CUDA_HOME configurations. This can cause
initialization anomalies in downstream libraries like DeepSpeed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152236
Approved by: https://github.com/jeffdaily
2025-05-08 22:20:25 +00:00
4064062e18 [c10d] Test multiple CUDA Graph captures (#150040)
1. Do multiple captures
2. Perform multiple collectives in one capture
3. Multiple replays (existing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150040
Approved by: https://github.com/fduwjj
2025-05-08 22:14:03 +00:00
d9dc6b56ec Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112)
A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error):
```
Traceback (most recent call last):
  File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call
    CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call
    return fn(*proxy_args, **proxy_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm
    self = self.expand((dim1, dim2, dim3))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers
```
This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112
Approved by: https://github.com/jansel
2025-05-08 21:34:24 +00:00
4166373908 [dynamic shapes] guard_or_false for infer_size (#152146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
5976419c69 [BE]: Add PEP621 project section to pyproject.toml (#153055)
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.

* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels

These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.

Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
2025-05-08 21:27:19 +00:00
9608e7fee9 [nativert] Address tooling setup for torch/nativert/ (#153164)
Summary:
As discussed with @malfet , we're porting nativert code to torch/nativert/.
Following up some concerns over the new directory, I'm trying to setup the tooling on OSS so various things (like linters) can run on torch/nativert/ properly.

Test Plan: CI

Differential Revision: D74407808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153164
Approved by: https://github.com/dolpm, https://github.com/Skylion007
2025-05-08 21:11:33 +00:00
e820b05cab [inductor] Generate synthetic offsets appropriately for autotuning _scaled_grouped_mm (#152968)
Summary: The autotuner is using zero-filled tensors to autotune
_scaled_grouped_mm and that's not appropriate for the offsets tensor, since it
essentially corresponds to "no input" and thus yields invalid perf results.

We can't really use the actual input tensors, since we might be compiling this
op in the context of an entire graph.

So instead, I decided to create a synthetic offsets tensor assuming that each
group is (roughly) the same size.  I don't have data but I'd guess this
approach is OK for MoE since we're generally hoping to load-balance the
experts; I'm not sure how well it applies to other scenarios that might be more
heavy-tailed.

Test Plan:
```
pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152968
Approved by: https://github.com/ngimel
2025-05-08 21:07:04 +00:00
590965f92f [Graph Partition][Flex Attention] analyze symints from subgraph inputs and outputs (#152878)
Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878
Approved by: https://github.com/drisspg
2025-05-08 20:25:35 +00:00
6ae7730eeb Use gcc13 in Manylinux 2.28 images (#152825)
Related to: https://github.com/pytorch/pytorch/issues/152426
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152825
Approved by: https://github.com/malfet
2025-05-08 20:04:48 +00:00
8b8051f6ed [Minimizer] Fix the path naming (#153130)
Summary:
Added some logging and captured the indexing. See below image.
{F1977773416}

This is why the saved module path is called `/tmp/jimwan/minimizer_a_acc.pt`

Now the updated module paths are `/tmp/jimwan/minimizer_addmm_default_103_acc.pt`.

Test Plan:
```
MTIAC_USE_DIST_REF_KERNELS=all  buck2 run @//mode/opt mtia/accuracy/minimizer:mtia_minimizer_runner --  --mode sequential  --compare_fn allclose  --pt_save_dir  /tmp/debug3  --atol 1e-4 --rtol 1e-4 --all_outputs --start_idx native_layer_norm_default_80 --end_idx getitem_272 2>&1 | tee ~/test.log
```
{F1977773610}

Reviewed By: qcyuan

Differential Revision: D74369107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153130
Approved by: https://github.com/Skylion007
2025-05-08 19:59:52 +00:00
086e2c2399 [TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ (#152814)
The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on `sm_120+` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152814
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-05-08 19:34:20 +00:00
4b8b7c7fb9 [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 18:58:10 +00:00
b3524080dc [AOTInductor] Generate kernels separately for const graph and main graph (#153040)
Summary:
We should generate the kernel for const graph and main graph separately.
The reason is that when we run autotuning, we would create separate
kernel calls and we should make sure that main graph also contains the
runner.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040
Approved by: https://github.com/angelayi
2025-05-08 18:45:45 +00:00
e5f869999c [inductor] Fix ModularIndexing assumptions (#152993)
Fixes https://github.com/pytorch/pytorch/issues/151198.

Since the result of ModularIndexing can be zero due to the modulo
operation, we should not make any assumption about ModularIndexing
being positive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152993
Approved by: https://github.com/yf225
2025-05-08 18:26:45 +00:00
d900c68ea6 c10d/gloo: add ibverbs backend (#153015)
Summary:
X-link: https://github.com/pytorch/gloo/pull/437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Differential Revision: D73291471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153015
Approved by: https://github.com/fduwjj
2025-05-08 18:26:29 +00:00
7cdf5048ea Fix evaluate_expr to include suppress_guards_tls in cache key (#152661)
ShapeEnv.evaluate_expr() behaves differently based on the (tls) global "suppress_guards" - so its cache key needs to include that value.

This came up because #152662 triggered it in the test `test/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error` - fixing this caused that test to work again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152661
Approved by: https://github.com/laithsakka
2025-05-08 18:25:34 +00:00
30a3c5d970 Skip lintchecks for now (#153156)
As devs has been complaining it's failing. Completely remove them from lint.yml as https://github.com/pytorch/pytorch/pull/153157 moved it to nightly

See https://github.com/pytorch/pytorch/issues/152439  as well as https://github.com/pytorch/pytorch/issues/152884 and https://github.com/pytorch/pytorch/issues/152489 for more details

Was introduced in https://github.com/pytorch/pytorch/pull/152377
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153156
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-05-08 17:58:05 +00:00
e86b6b2a19 Add tests to check pretty print when padding is a string in C++ API (#153126)
Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126
Approved by: https://github.com/Skylion007
2025-05-08 17:55:25 +00:00
d36261d2e6 Revert "[dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)"
This reverts commit 0886d402f155e0b34760a2906f4bd71c878fd98f.

Reverted https://github.com/pytorch/pytorch/pull/152740 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
34d424d813 Revert "[dynamo] Support delattr on result of torch.compile(module) (#152741)"
This reverts commit 6c025b5a8270e456405eccc26db1344ddd016d7b.

Reverted https://github.com/pytorch/pytorch/pull/152741 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
6a8006472e Fix doc cosineannealinglr 152081 (#152936)
## Summary

This PR updates the docstring for `CosineAnnealingLR` to accurately reflect its recursive learning rate schedule. The previous docstring displayed only the SGDR closed-form expression, which doesn't match the actual recursive implementation in code.

Changes:

- Added the recursive update formula used in `get_lr()`
- Retained the original closed-form SGDR expression for reference
- Clarified that warm restarts are not implemented in this scheduler

This addresses confusion raised in issue #152081.

## Related issue

[#152081](https://github.com/pytorch/pytorch/issues/152081)

## Testing

Doc-only change. Ran pre-commit to verify formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152936
Approved by: https://github.com/janeyx99
2025-05-08 17:25:30 +00:00
3cd69350ed [export] Unflatten None (#153000)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153000
Approved by: https://github.com/pianpwk
2025-05-08 16:40:13 +00:00
7b806a8cb1 Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 93576351270383ca37deaec6b2417a33dc045a93.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))
2025-05-08 16:39:28 +00:00
cyy
d291fa8ecc Avoid std::chrono::system_clock (#153135)
This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times.

Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet
2025-05-08 16:30:29 +00:00
fe8ebacee4 [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-08 16:12:16 +00:00
05326b7e49 Revert "Add runtime asserts to AOTI (#152125)"
This reverts commit 834bc5e4148538b7544aafdf5b090d007600fbd6.

Reverted https://github.com/pytorch/pytorch/pull/152125 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152125#issuecomment-2863554139))
2025-05-08 15:58:18 +00:00
1d3e8f326a [CI] Increase shards number for XPU ci UT tests (#149113)
The XPU CI test met timeout issue, refer https://github.com/pytorch/pytorch/actions/runs/14897047392/job/41842336828 and this PR will reduce the ci time cost
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149113
Approved by: https://github.com/etaf, https://github.com/EikanWang
2025-05-08 15:42:33 +00:00
8141b146ca Run URL linter on nightly only (#153157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153157
Approved by: https://github.com/malfet
2025-05-08 15:32:42 +00:00
efa07df257 [c10d] Remove unordered PG destroy test (#153110)
torch.distributed does not support unordered ProcessGroup destroy. Removing the test.

Resolves #137507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-05-08 15:29:44 +00:00
500cbeee4e [dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)
Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119
Approved by: https://github.com/jansel
ghstack dependencies: #151731, #151962
2025-05-08 15:12:16 +00:00
6dea8ef555 [ca] hide unused scalar int sizes from dynamo (#151962)
together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962
Approved by: https://github.com/jansel
ghstack dependencies: #151731
2025-05-08 15:12:16 +00:00
8f380b239f [ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)
This is the only way to support dynamic shapes on scalars right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731
Approved by: https://github.com/jansel
2025-05-08 15:12:08 +00:00
a7ea115494 Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537)"
This reverts commit 941062894a1accfd472d0acd2716493e1f173bd7.

Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/malfet due to Sorry to revert this PR, but it broke doc builds, see 4976b1a3a8/1 ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2863337268))
2025-05-08 14:53:34 +00:00
4976b1a3a8 Keep raw cubin file around in case it gets deleted underneath us (#153064)
This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process.

We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton.

This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay.

Tests:
- Make test_static_cuda_launcher always delete the cubin path and reload it

Fixes #153030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064
Approved by: https://github.com/oulgen, https://github.com/jansel
2025-05-08 14:29:19 +00:00
13bdfe6577 get right function declaration on windows inductor (#152939)
Fixes #152251

`get_export_declaration` introduced one more ')' in Windows platform, which cause this pattern of function declaration different with Linux.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152939
Approved by: https://github.com/xuhancn, https://github.com/jansel
2025-05-08 14:28:33 +00:00
0f9821d0e3 [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114
Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007
2025-05-08 14:01:49 +00:00
2926dd4d8e Stop proxy-ing autograd.Function.ctx into the graph (#152621)
The reason why we did this before is because that's how our older
autograd.Function x Dynamo interaction work, but we've since adopted
newer designs that don't actually need the autograd.Function.ctx proxied
into the graph.

We still need a fx.Proxy for the autograd.Function.ctx object, so
whenever we do I create one via discard_graph_changes.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621
Approved by: https://github.com/oulgen
2025-05-08 13:32:54 +00:00
22c31046d4 Fixed rerr computation in lobpcg (#152789)
Fixes #101075

This PR fixes an issue with the computation of residuals in the LOBPCG algorithm.

**Bug**: [Line 788](8f54e56e62/torch/_lobpcg.py (L788)) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](8f54e56e62/torch/_lobpcg.py (L776)), but it uses the raw eigenvalue-estimates instead of their absolute values.

**Consequence**: This made the algorithm's success sensitive to initialization of eigenvectors.

**Tests**:
- I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds).
- I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG).
- I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully.

Let me know if further test cases or benchmarks are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789
Approved by: https://github.com/pearu, https://github.com/lezcano
2025-05-08 12:22:31 +00:00
34d4363e6d [dynamo] Fix super and classmethod binding of cls object (#153105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153105
Approved by: https://github.com/jansel
ghstack dependencies: #152883
2025-05-08 12:07:08 +00:00
941062894a [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 10:10:27 +00:00
bfc0920d95 [C10D] Move getNcclDataType into NCCLUtils (#153113)
Differential Revision: D74365214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153113
Approved by: https://github.com/ngimel
2025-05-08 08:54:05 +00:00
dfb91a627f Clean up of CUTLASS_VERSION (#152947)
Fixes #152847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152947
Approved by: https://github.com/eqy, https://github.com/cyyever
2025-05-08 08:32:34 +00:00
9357635127 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-05-08 08:28:05 +00:00
4f9dd3c3e5 [cutlass backend] Fix EVT test for fbcode post cutlass 3.9.2 upgrade (#153106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153106
Approved by: https://github.com/mlazos
2025-05-08 08:20:40 +00:00
f9df09da08 [mm sampling] extract more triton information (#153099)
Summary:
# Why

capture more triton config information that was not being captured

# What

capture and extract

- group_m
- allow_tf32
- acc_type
- matrix_instr_nonkdim
- waves_per_eu
- kpack

to achieve this, add

- matrix_instr_nonkdim
- waves_per_eu
- kpack

to the info_dict of the TritonTemplateCaller

Test Plan:
with D74342290

```
buck2 run -c fbcode.rocm_arch=mi300 -m rocm621 mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 2>&1 | tee /tmp/tmp.52Igj8lthj/15.txt
```

(edited for clarity and brevity)

```
AutotuneMetrics03LogEntry(
    backend='Triton',
    exectime_ms=0.007449999917298555,
    perf_model_name='scripts.vandrei.pytorch_experiments.matmul_estimator_lib.estimate_matmul_time_new',
    perf_model_exectime_ms=0.009558684365573179,
    config_triton_block_m=16,
    config_triton_block_n=256,
    config_triton_block_k=128,
    config_triton_num_stages=2,
    config_triton_num_warps=8,
    config_triton_group_m=16,
    config_triton_allow_tf32='False',
    config_triton_acc_type='tl.float32',
    config_triton_matrix_instr_nonkdim=16,
    config_triton_waves_per_eu=1,
    config_triton_kpack=2,
    x_batch_dim=0,
    x_row_dim=8,
    x_col_dim=96,
    x_batch_stride=0,
    x_row_stride=96,
    x_col_stride=1,
    x_dtype='torch.float16',
    x_dtype_size=16,
    w_batch_dim=0,
    w_row_dim=96,
    w_col_dim=512,
    w_batch_stride=0,
    w_row_stride=512,
    w_col_stride=1,
    w_dtype='torch.float16',
    w_dtype_size=16,
    vendor='AMD',
    model='gfx942:sramecc+:xnack-',
    major=9,
    minor=4,
    sms=304,
    l2_cache=4194304,
    warp_size=64,
    regs_per_sm=65536,
    max_threads_per_sm=2048,
    total_mem=206141652992,
    hip_version='6.2.41134',
    triton_upstream_hash='3889f3f3b97b817741e308c173409927b7c4536f',
    environment='experiment-xzy-default',
    session_id='8a7001bd-652c-440c-bc56-4cb1e25146ea',
    [...]
)
```

Reviewed By: exclamaforte

Differential Revision: D74342286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153099
Approved by: https://github.com/exclamaforte, https://github.com/eellison
2025-05-08 07:24:28 +00:00
3c87529d23 Make device check error message more descriptive (#150750)
Fixes #122757

## Test Result

```python
import torch

model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)

loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
    loss = loss_fn(input=model_output, target=labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/malfet
2025-05-08 06:19:44 +00:00
c73bd990cf fix shard tensor gather when a local tensor on certain ranks has zero elements (#150914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150914
Approved by: https://github.com/fduwjj
2025-05-08 05:06:22 +00:00
94ca3a4666 Add torch._C.Tag.needs_contiguous_strides (#152859)
this forces inductor to force the inputs to be contiguous.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152859
Approved by: https://github.com/eellison
2025-05-08 04:49:59 +00:00
2d25e4d478 [1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380)
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB  Down: 42MiB  (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining     0/4                                                                                 6.7s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },
```

Differential Revision: D70522237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380
Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803
2025-05-08 04:44:15 +00:00
6f6fac6a41 [dynamo] Fix bug in hasattr(tensor, "size") (#152883)
Fixes https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152883
Approved by: https://github.com/StrongerXi
2025-05-08 01:16:01 +00:00
834bc5e414 Add runtime asserts to AOTI (#152125)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead
        of just re-using the asserts from input graphs becase we reuse the
        same ShapeEnv as before. In particular, on subsequent graph passes,
        we would immediately turn all of these assertions into noops,
        because when we evaluated their expressions, we would see that
        because we had a deferred runtime assert in the ShapeEnv, we
        know "oh, of course this expression is True" already.
        One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
        When we re-use the ShapeEnv in Inductor lowering, the check that checks
        a and nonzero have the same shape would be evaluted to True after we resolve
        unbacked bindings using the ShapeEnv.
        See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor.

        In addition to the Inductor generated runtime asserts, we also
        need the runtime asserts from the input graph, because some derived
        runtime asserts are not generated in Inductor. One example is
        below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
        See test_aoti_runtime_asserts_backed_symint in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
```

Differential Revision: D73596786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
2025-05-08 00:27:24 +00:00
20e2ca3e29 [Dynamo] Allow inlining into AO quantization modules (#152934)
This adds dynamo inlining into `torch.ao.quantization.fake_quantize`.

This is needed for QAT compatbility w/ an RL training model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152934
Approved by: https://github.com/williamwen42
2025-05-07 23:58:11 +00:00
5bf0c3518c Detect NVSHMEM location (#153010)
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)

### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007
2025-05-07 23:35:04 +00:00
df1ec045b5 [Cutlass] Add epilogue inputs/outputs to def_kernel (#151406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151406
Approved by: https://github.com/eellison
ghstack dependencies: #152733, #150906
2025-05-07 23:09:02 +00:00
d483aefafa [Cutlass] Integrate EVT into CUDACPPScheduling (#150906)
Previously merged:
* #151713
* #151405
* #150905
* #152306
* #152305

Allow epilogue nodes in cuda combined scheduling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150906
Approved by: https://github.com/eellison
ghstack dependencies: #152733
2025-05-07 23:09:02 +00:00
6b9d741e1c [Cutlass] Handle broadcasting in EVT python codegen (#152733)
Previously merged:
* #151713
* #151405
* #150905
* #152306
* #152305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152733
Approved by: https://github.com/eellison
2025-05-07 23:09:02 +00:00
4270517cbf Fix test/test_optim.py error message. (#153076)
Fixes an error message in test/test_optim.py

Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable".

Fix: The error message now says correctly: "Adagrad does not currently support capturable".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076
Approved by: https://github.com/janeyx99
2025-05-07 22:46:05 +00:00
7706074ece Fix TORCH_CHECK error message in FusedSgdKernel (#153074)
This fixes an issue in the TORCH_CHECK error message in the FusedSgdKernel.

Current behavior: If the LR tensor is not on the same device as the parameters, the error message reads: "found_inf must be on the same GPU device as the params".

Fix: The error message now correctly points out "lr must be on the same GPU device as the params".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153074
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-05-07 22:10:09 +00:00
cecfc7dc53 [CUDA][cuDNN] Fix handling of CPU side input and target length tensors in CTCLoss (#152745)
https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU`

see also #152421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745
Approved by: https://github.com/Skylion007
2025-05-07 22:01:18 +00:00
773a91c775 [ONNX] dynamic_shapes uses DYNAMIC (#153065)
Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065
Approved by: https://github.com/justinchuby
2025-05-07 21:48:41 +00:00
a2891cba2f [cutlass backend] Skip cuda lib path if it is torch/lib (#153003)
Differential Revision: [D74284808](https://our.internmc.facebook.com/intern/diff/D74284808/)

This is a bit risky for cutlass backend, so decided to separate it out. Tested offline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153003
Approved by: https://github.com/chenyang78
2025-05-07 21:28:15 +00:00
5bb154e6fd [nativert] Move MPMCQueue to torch/nativert. (#152837)
Summary:
Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime.

Differential Revision: D74184245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837
Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok
2025-05-07 21:17:42 +00:00
d2ee606e9b [Inductor] Set correct baseline for decomposek test (#152897)
Differential Revision: D74218923

Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897
Approved by: https://github.com/eellison
2025-05-07 21:02:47 +00:00
1ff3c223d2 [c10d][fr] Make FR vendor neutral so that other backends can use it (#152563)
Current FR code is built with `USE_C10D_NCCL` we should remove it to make it generic. And we keep existing API used by NCCL so that we can have some bc compatibility because lots of use cases are around FR with NCCL. The generic version with C10::Event can then be used for other backend like Gloo, etc.

The current Unit test should cover the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152563
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
ghstack dependencies: #152585
2025-05-07 20:37:40 +00:00
642e9305eb Fixes detection of ArmPL on Linux platform (#150031)
On Linux it failed to detect that there is bin directory as it wasn't looking for armpl-info which is the only file that is in that directory on Linux and also adding link to math library as it is required to link against when checking for LAPACK functions.

Fixes #149610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150031
Approved by: https://github.com/fadara01, https://github.com/malfet
2025-05-07 19:47:21 +00:00
f5f8f637a5 [Typing] Improve device typing for torch.set_default_device() (#153028)
Part of: #152952

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

So `_Optional[_Union["torch.device", str, builtins.int]]` is equivalent to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153028
Approved by: https://github.com/Skylion007
2025-05-07 19:31:43 +00:00
dd7d231ed3 [cutlass backend][test] re-enable test_cuda_compile_command for fbcode (#153001)
Differential Revision: [D74284047](https://our.internmc.facebook.com/intern/diff/D74284047/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153001
Approved by: https://github.com/ColinPeppler
2025-05-07 19:06:24 +00:00
62b7ef06cc [Dynamo] Remove unused guard PYMODULE_MATCH (#152961)
Not used anywhere: https://www.internalfb.com/code/search?q=repo%3Afbcode%20PYMODULE_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152961
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865, #152872
2025-05-07 18:58:18 +00:00
d9b8473b59 [Dynamo] Guard serialization for RANGE_ITERATOR_MATCH (#152872)
Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes.

This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865
2025-05-07 18:58:18 +00:00
52f7106c00 [Dynamo] Guard serialization for TUPLE_ITERATOR_LEN (#152865)
Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes.

Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730
2025-05-07 18:58:18 +00:00
fb500d0b1c [Dynamo] Guard serialization for SEQUENCE_LENGTH (#152730)
Tests only; no other changes needed. Test logic uses a tuple function input to trigger installation of a SEQUENCE_LENGTH guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152730
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728
2025-05-07 18:58:18 +00:00
42954ab28e [Dynamo] Guard serialization for CLOSURE_MATCH (#152728)
Unsupported because it uses unsupported FUNCTION_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152728
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727
2025-05-07 18:58:18 +00:00
a9186ec723 [Dynamo] Guard serialization for FUNCTION_MATCH (#152727)
Unsupported because it uses unsupported ID_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152727
Approved by: https://github.com/jansel
ghstack dependencies: #152725
2025-05-07 18:58:18 +00:00
a6f51be2fd [Dynamo] Guard serialization for NN_MODULE (#152725)
Throws an error when attempting to serialize an NN_MODULE guard. It is not supported because it uses the unsupported ID_MATCH guard (#152330):

a6dd1c2208/torch/_dynamo/guards.py (L1738-L1739)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152725
Approved by: https://github.com/jansel
2025-05-07 18:58:17 +00:00
2cf7fd0d2b Update docs of saved_tensors_hooks to avoid ref cycle (#153049)
Fixes #115255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153049
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-05-07 18:54:56 +00:00
7cf8049d63 [BE] Update ruamel to 0.18.10 (#153057)
To address the feedback from https://github.com/pytorch/pytorch/pull/153013
Previously it was pinned to 0.17.4, that was released in 2021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153057
Approved by: https://github.com/Skylion007
ghstack dependencies: #153013
2025-05-07 18:11:14 +00:00
d042ec856b Use gather in index_select (#151715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151715
Approved by: https://github.com/ngimel
2025-05-07 17:55:34 +00:00
eqy
172e641529 [CUDA] Rest peak memory stats before running test_set_per_process_memory_fraction (#152540)
Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative

Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540
Approved by: https://github.com/Skylion007
2025-05-07 17:02:39 +00:00
8b9c9a327f [cutlass backend] cache filtered ops based on layouts (#152580)
Differential Revision: [D73972687](https://our.internmc.facebook.com/intern/diff/D73972687/)

Add cache to store the list of filtered ops for a specific shape + layout + dtype (aka hash on input_nodes).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152580
Approved by: https://github.com/eellison
2025-05-07 16:38:22 +00:00
61dd2a0cc3 Revert "[BE] Update numba versions (#152557)"
This reverts commit 80d2116405367e1dd11648ab4225d4207d5e6132.

Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/malfet due to This time it breaks torchbench tests, see 9c114934f7/1(inductor_torc&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2858945427))
2025-05-07 15:03:41 +00:00
9c114934f7 [Lint] Add install command for GHA step (#153013)
Otherwise, it fails to run the script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153013
Approved by: https://github.com/wdvr, https://github.com/cyyever
2025-05-07 14:55:00 +00:00
42b3e560ee Thread through options so GraphPickler can allow all ops (#152801)
Fixes #151904

In #151904 we discussed the feasibility of including all ops in the GraphPickler. This PR changes it so we can filter which ops are allowed and which are blocked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152801
Approved by: https://github.com/masnesral
2025-05-07 14:36:50 +00:00
f393ee5ab5 Use torch.types.Device in device_interface.py (#152935)
This is just a clean-up change that I noticed was possible; it removes the duplicate `_device_t` type which had the same semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152935
Approved by: https://github.com/Skylion007
2025-05-07 13:20:10 +00:00
cyy
2f09e79142 Fix Codegen.cmake warning (#153023)
Fix
```
CMake Warning (dev) in cmake/Codegen.cmake:
  A logical block opening on the line

    /var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if)

  closes on the line

    /var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif)

  with mis-matching arguments.
```
by removing the condition in `endif`.

We could instead fix it, however, that is not best practice.  For example, cmake_lint warns that, and CMake says
```
The optional <condition> argument is supported for backward compatibility only.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023
Approved by: https://github.com/aditew01, https://github.com/Skylion007
2025-05-07 12:45:20 +00:00
48bfe9afc7 has_triton: Use the device interface for detecting Triton availability (#139171)
This PR replaces the `has_triton()` global method which was previously used for this task.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel, https://github.com/shink
2025-05-07 12:23:10 +00:00
56879f64a8 [Break XPU] Fix XPU UT failures introduced by community. (#152945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152945
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-07 08:01:31 +00:00
5c878d4b04 [c10d][fr] Decouple the core logic of FR with the entry and event type (#152585)
We want to make FR generic enough so the first step is to make the FR a template struct so that most of common code logic can be reused. The reason for this is that CudaEvent does not inherit c10::Event and we just want to swap the event part so that for NCCL we use CudaEvent and for the rest of backends, we use c10::event.

Differential Revision: [D74262695](https://our.internmc.facebook.com/intern/diff/D74262695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152585
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-05-07 06:21:33 +00:00
93a0a7a0bf Fix bug visualizing 1D Tensor using rich (#152871)
Fixes https://github.com/pytorch/pytorch/issues/152848

I didn't fix the bug earlier because the example script didn't exhaustively present all combinations of 1D/2D tensor, 1D/2D mesh, and all possible sharding specs. Therefore, in this PR, I enriched the example script to cover all possible combinations.

<img width="1008" alt="f" src="https://github.com/user-attachments/assets/1745a804-a004-4f98-8332-d7498453f397" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152871
Approved by: https://github.com/wanchaol
2025-05-07 06:04:22 +00:00
bb9fbb294a [Testing] Add logic for running MPS tests (#153012)
Prep change for getting rid of `_mac-test-mps.yml`
A complete no-op for now, but will be used by PR above the stack, but they should be landed few days apart to avoid forcing lots of people to rebase their PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153012
Approved by: https://github.com/wdvr
2025-05-07 04:27:31 +00:00
ae1e51b6ad Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-07 04:03:14 +00:00
13fbf21a76 [nativert] Port string join and split to c10/util (#152873)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
Port string utils functions join and split to c10/util

Test Plan:
Added tests in `string_util_test.cpp`
buck2 run mode/opt caffe2/c10/test:util_base_tests

Differential Revision: D74202473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152873
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-07 03:58:11 +00:00
5796212d48 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/misc.py [1/2] (#152274)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/misc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152274
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-05-07 03:37:24 +00:00
cyy
ab997d9ff5 Pass UNINSTALL_DILL to docker build (#152792)
`UNINSTALL_DILL` was not really passed to docker before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152792
Approved by: https://github.com/wdvr
2025-05-07 03:17:45 +00:00
dfcfad2112 [c10d] Fix unused group input argument in new_subgroups() (#152765)
Summary: This diff fixes an unused input argument [`group`](8faa225695/torch/distributed/distributed_c10d.py (L5341)) in the `new_subgroups()` function.

Test Plan: contbuild & OSS CI, see

Differential Revision: D74132537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152765
Approved by: https://github.com/wz337
2025-05-07 02:37:51 +00:00
ecd74c953f [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-07 02:36:44 +00:00
1965a2ca1e [dynamo][ez] Remove unused guard OBJECT_MUTATION. (#152855)
Summary: seems not used anywhere https://www.internalfb.com/code/search?q=case%3Ayes%20filepath%3Acaffe2%20OBJECT_MUTATION

Test Plan: CI

Differential Revision: D74196559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152855
Approved by: https://github.com/jansel, https://github.com/jbschlosser
2025-05-07 02:32:32 +00:00
81b6920c68 [aoti] skip input symbol codegen for sympy expr w/ many symbols (#152579)
Issue was that
- symbol-ids appeared out-of-order w.r.t to the order of the forward inputs
```
def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..)
```
- this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already.
- well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs.

```
# for example
s3 = arg1.size(0) + 1
s4 = argN.size(0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-05-07 01:18:09 +00:00
60ecc560af [export] Add draft-export docs (#152637)
Sample page: https://docs-preview.pytorch.org/pytorch/pytorch/152637/draft_export.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152637
Approved by: https://github.com/zou3519, https://github.com/svekars
2025-05-07 01:12:45 +00:00
a28dcdba2c Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860)"
This reverts commit 613bd462721f3246888030de0a3f6932d52f515a.

Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
f6db749e60 Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)"
This reverts commit 18229a5300a61b2d76ca95bee8ae8d4f4d5fa938.

Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
8f208dc75a Revert "[ca] hide unused scalar int sizes from dynamo (#151962)"
This reverts commit 4555ed8c83b47c450e31f1192e1f0fc4147d435f.

Reverted https://github.com/pytorch/pytorch/pull/151962 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
64bbf58fb4 Revert "[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)"
This reverts commit 7aebb127bf309658770be93b264d4009c20a7f40.

Reverted https://github.com/pytorch/pytorch/pull/152119 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
56492bfcb9 [MPS] SDPA specialized kernels (#152781)
Paritally fixes #139668 and #152550

Still work in progress. Following needs to be addressed:
- [x] Some tests are failing and need to check why and bugfix
- [x] Benchmark the new kernels and  add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels)
- [x] Add tests to cover the specialized paths(if applicable)
- [x] Code cleanup

**Tested on Macbook M1 Pro**
### Vector Fast Path (q_len=1, k_len=256)
- Old: 0.378 ms
- New: 0.260 ms
- **31.2% speed improvement**

### Vector 2-pass (q_len=1, k_len=4096)
- Old: 0.627 ms
- New: 0.370 ms
- **41.0% speed improvement**

### Vector Fast Path (q_len=8, k_len=256)
- Old: 0.545 ms
- New: 0.322 ms
- **40.9% speed improvement**

### Vector 2-pass (q_len=8, k_len=4096)
- Old: 1.318 ms
- New: 1.057 ms
- **19.8% speed improvement**

Script to get perf:
```
import torch
import time

def benchmark_sdpa(config, iterations=100):
    device = config.get("device", "cpu")
    batch = config["batch"]
    heads = config["heads"]
    q_len = config["q_len"]
    k_len = config["k_len"]
    head_dim = config["head_dim"]

    q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32)
    k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)
    v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)

    for _ in range(5):
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()

    total_time = 0.0
    for i in range(iterations):
        start = time.perf_counter()
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()
        end = time.perf_counter()
        total_time += end - start

    avg_time = total_time / iterations
    print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations")
    return avg_time

def main():
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Running benchmarks on device: {device}")

    benchmarks = [
        {
            "name": "Vector Fast - Small q_len & moderate k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length triggers vector fast path
            "k_len": 256,    # moderate key length
            "head_dim": 64,
            "device": device,
        },
        {
            "name": "Vector 2-pass - Small q_len & long k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length
            "k_len": 4096,   # long key length triggers the 2-pass variant
            "head_dim": 64,
            "device": device,
        },
        # {
        #     "name": "Full Attention - Moderate q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # longer query sequence length
        #     "k_len": 8192,    # matching key length for full attention paths
        #     "head_dim": 64,
        #     "device": device,
        # },
        # {
        #     "name": "Full Attention - Longer q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # very long sequence length
        #     "k_len": 8192,
        #     "head_dim": 64,
        #     "device": device,
        # },
    ]

    iterations = 100
    for config in benchmarks:
        benchmark_sdpa(config, iterations=iterations)

if __name__ == "__main__":
    main()

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781
Approved by: https://github.com/malfet
2025-05-07 00:40:11 +00:00
2b2b790908 [Dynamo] Guard serialization for CONSTANT_MATCH (#152724)
This PR adds testing only; no non-test changes were needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152724
Approved by: https://github.com/jansel
ghstack dependencies: #152704
2025-05-07 00:36:39 +00:00
d2935a9f85 [CI] Upgrade sccache to 0.10.0 (#152957)
Newest release handles cuda better, and I think this fixes the cases I saw where some cuda related builds weren't being cached correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152957
Approved by: https://github.com/malfet
2025-05-07 00:33:43 +00:00
6d1e8994d3 [Dynamo] Guard serialization for EQUALS_MATCH (#152704)
This PR:
* Makes no changes to non-test code to support serialization for EQUALS_MATCH
* Adds test logic involving a custom-defined constant type to trigger the guard installation here:

72337bdcf2/torch/_dynamo/variables/user_defined.py (L792)

Q: Is there a better way to trigger installation of this guard or is this sufficient?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152704
Approved by: https://github.com/jansel
2025-05-07 00:28:31 +00:00
9919d6b872 [Testing] Add copysign from scalar regression test (#152997)
But instead of adding it just for MPS backend, add it to OpInfo

Fixes https://github.com/pytorch/pytorch/issues/152582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152997
Approved by: https://github.com/wdvr
2025-05-07 00:19:42 +00:00
327d1b6ef0 Move additional MPS Unary ops to Iterator (#152876)
Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks

At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876
Approved by: https://github.com/malfet
2025-05-07 00:06:54 +00:00
61aa77e216 [cutlass backend][BE][clean-up] refactor to remove use of autotune_fallback_to_aten=True in cutlass backend tests (#152850)
Differential Revision: [D74192001](https://our.internmc.facebook.com/intern/diff/D74192001/)

Motivation: clean up post https://github.com/pytorch/pytorch/issues/147479. I plan to leave the rest of the clean-up as an first time issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152850
Approved by: https://github.com/chenyang78
2025-05-06 23:48:57 +00:00
5fa5017479 [ONNX] Suggest users setting dynamo=True when exporting (#152478)
Fixes #152025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152478
Approved by: https://github.com/justinchuby
2025-05-06 23:18:11 +00:00
80d2116405 [BE] Update numba versions (#152557)
Let's see if PyTorch is compatible with latest
`test_unary_funcs` are no longer failing due to https://github.com/pytorch/pytorch/pull/148024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557
Approved by: https://github.com/Skylion007
2025-05-06 23:15:21 +00:00
911b838aae [Memory Viz] Add Compile Context to Visualizer (#152862)
Summary: Adds PT2 info to visualizer. Also makes sure we have a case when compile context is not in pickle file.

Test Plan: {F1977637362}

Differential Revision: D74202811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152862
Approved by: https://github.com/aaronenyeshi
2025-05-06 23:09:59 +00:00
6c025b5a82 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-06 22:30:37 +00:00
0886d402f1 [dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)
When we do `torch.compile(mod)`, we eventually end up returning a new
module instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) from the default `torch.nn.Module.__call__`.
As a result we can't reuse the inherited default `__call__` as is,
because we'd end up running the logic twice.

This patch makes the returned `OptimizedModule` override the default
`__call__`, and directly calls into its compiled `forward` method.

Fixes #149502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305
2025-05-06 22:30:37 +00:00
1c30862d8f Partilally revert https://github.com/pytorch/pytorch/pull/152288 (#152909)
Summary: As it results in build failures for some internal targets that stuck on older compiler. Platform update is tracked in [T223408150](https://www.internalfb.com/tasks?t=223408150)

Test Plan: CI

Differential Revision: D74220384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152909
Approved by: https://github.com/cyyever, https://github.com/wdvr
2025-05-06 22:02:42 +00:00
5fe58ab5bd Devcontainer: Optimize apt-get commands to reduce Docker image size (#152882)
## Summary
- Added --no-install-recommends flag to all apt-get install commands to reduce unnecessary dependencies
- Added apt-get clean after package installations to remove package cache and reduce image size
- Combined multiple apt commands into single instructions to reduce Docker image layers

## Test plan
Test by building the devcontainer and verifying functionality while ensuring reduced image size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152882
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/Skylion007
2025-05-06 20:33:02 +00:00
ed63cb20ec [ROCm] Fix SymmetricMemory build error on NAVI arch (#152838)
NAVI arch doesn't support `__builtin_amdgcn_s_memtime()`, using `clock64()` instead which works for both NAVI and MI archs.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152838
Approved by: https://github.com/jeffdaily
2025-05-06 19:37:58 +00:00
8faa0b18c3 [ROCm] opportunistic fastatomics - fix build error with newer compilers (#152841)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152841
Approved by: https://github.com/jeffdaily
2025-05-06 19:37:48 +00:00
1f4f4a61c2 Devcontainer: Replace conda with apt-based setup (#152881)
## Summary
- Replaced miniconda base image with base Ubuntu 22.04 image
- Installed Python and required dependencies using apt
- Replaced conda-based CUDA installation with apt-based version
- Updated paths in install-dev-tools.sh to reflect the new non-conda environment
- Removed conda-specific files and added requirements.txt for Python dependencies

## Test plan
Test by building and running the devcontainer in VS Code with both CPU and CUDA configurations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152881
Approved by: https://github.com/atalman
2025-05-06 19:23:58 +00:00
200df50c05 Devcontainer: Fix context path and workspace mount (#152880)
## Summary
- Changed the devcontainer context path from '../..' to './' for both CPU and CUDA configurations
- Added workspace mount configuration to properly mount the repository in the container
- Added containerEnv to disable implicit --user pip flag

## Test plan
Test by building and running the devcontainer in VS Code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152880
Approved by: https://github.com/atalman
2025-05-06 19:22:29 +00:00
08f5371571 [float16]: Fix the accumulation type for dot and gemv (#152676)
Fixes #147860

Also, partially address: https://github.com/pytorch/pytorch/issues/125438

Use float32 for accumulation with float16 and and bfloat16 types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676
Approved by: https://github.com/malfet
2025-05-06 18:10:08 +00:00
7a0781eaad Improve cache key graph printing performance (#151928)
Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster.

On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better).

Fixes #151823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928
Approved by: https://github.com/laithsakka
2025-05-06 17:39:53 +00:00
7dd9d514d2 [Graph Partition] remove PRECOMPUTED_SIZE from partition symbol inputs (#152864)
PRECOMPUTED_SIZE is computed during runtime and should not be included in graph_partition_inputs. See the following example for a PRECOMPUTED_SIZE `ps0`.

![image](https://github.com/user-attachments/assets/5aa949a9-b8e0-4b77-8702-95b96b58694e)

full output code: [P1803820480](https://www.internalfb.com/phabricator/paste/view/P1803820480)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152864
Approved by: https://github.com/eellison
2025-05-06 17:35:29 +00:00
5d36485b4a Log aot and idx waitcounters. (#152444)
Summary:
Added for create_aot_dispatcher_function and compile_fx_inner.

Note:
Log wait counters flag is already set for:
1. async_compile.precompile
2. remote_fx_graph_cache_get
3. remote_fx_graph_cache_put

Test Plan: contbuild

Differential Revision: D73866124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444
Approved by: https://github.com/ppanchalia, https://github.com/masnesral
2025-05-06 16:21:58 +00:00
07a29dbe81 [BE]: Update cutlass submodule to 3.9.2 (#152779)
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
f56bcd2408 [precompile] [easy] Refactor FxGraphCache to add cache_hit_post_compile function (#152839)
This PR refactors CompiledFxGraph by adding a new post_compile step that only runs on cache hit. This refactors a bunch of code in _lookup_graph to its own function so that we can use it in BundledAOTAutogradCacheEntry. No difference in behavior here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152839
Approved by: https://github.com/oulgen
ghstack dependencies: #152836
2025-05-06 15:33:24 +00:00
a8f727c439 [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-06 15:27:30 +00:00
12a8b70247 [precompile] Refactor AOTAutogradCacheEntry to be generic (#152836)
The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph.

Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly.

This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards.

Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph.

For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836
Approved by: https://github.com/oulgen
2025-05-06 15:19:17 +00:00
fcd5e49138 Revert "[dynamo] Recursively realize the stack_values (#152853)"
This reverts commit 460888f908ea4b634ecc863a6da6b2132108bc79.

Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))
2025-05-06 15:02:57 +00:00
f47bf38e30 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-06 14:59:27 +00:00
b06cbd49f1 [Dynamo] Guard serialization for TENSOR_SUBCLASS_METADATA_MATCH (#152626)
This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.)

This PR tests the guard on several traceable wrapper tensor subclasses:
* `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally
* `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata
* `SubclassWithMeta`: stores non-trivial extra metadata
* `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod
* `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626
Approved by: https://github.com/jansel
2025-05-06 14:06:36 +00:00
199d5a408a [partitioner] Fix argument to _broadcast_on_rank0 (#152846)
Summary:
There was a bug when I refactored my original implementation.

This should fix it

Test Plan: Run on some internal workloads

Differential Revision: D74190485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152846
Approved by: https://github.com/danthe3rd
2025-05-06 13:45:59 +00:00
bc11afd41f [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-06 10:06:39 +00:00
e32a16a9da Correct torch.xpu.is_bf16_supported return False if no XPU detected (#152317)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/152301
When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False).

# Solution
Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and,
- return `False` if XPU is not available
- return `True` if `including_emulation` is True
- return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317
Approved by: https://github.com/EikanWang
2025-05-06 10:03:17 +00:00
8904ba6387 Forward fix D74196435 (#152926)
Summary: Forward fix a misplace declaration from D74196435

Test Plan: Random check with a failed build `buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/opt fbcode//accelerators/workloads/models/emu_flash/tests:test_compile_eager`

Reviewed By: wdvr

Differential Revision: D74225582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152926
Approved by: https://github.com/cyyever, https://github.com/wdvr
2025-05-06 07:33:38 +00:00
689e14ae00 [NFC] [inductor] [compile async] Warn exception if pickler failed (#152401)
A NFC to help us to find issues

See https://github.com/pytorch/pytorch/issues/151904

CC @aorenste

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152401
Approved by: https://github.com/Skylion007
2025-05-06 07:12:35 +00:00
1dd36ad2d4 Fix conditional git diff in _link_check.yml (#152919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152919
Approved by: https://github.com/huydhn
2025-05-06 07:01:45 +00:00
0e2b948256 Revert "cleanup, refactor and add missing self._dde_suppressed checks (#152657)"
This reverts commit 784c666cae00f85ecf675298ddb056bebaf32f55.

Reverted https://github.com/pytorch/pytorch/pull/152657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/152657#issuecomment-2853442594))
2025-05-06 06:45:07 +00:00
451d652873 Revert "Make device check error message more descriptive (#150750)"
This reverts commit 8253970a1f90a5b0b1fe0d4febd949470f6fa265.

Reverted https://github.com/pytorch/pytorch/pull/150750 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/150750#issuecomment-2853438985))
2025-05-06 06:42:08 +00:00
460888f908 [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-06 06:30:31 +00:00
dd766e1dc5 [audio hash update] update the pinned audio hash (#152885)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152885
Approved by: https://github.com/pytorchbot
2025-05-06 05:29:25 +00:00
784c666cae cleanup, refactor and add missing self._dde_suppressed checks (#152657)
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type  DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
2025-05-06 05:24:09 +00:00
e2eb845313 [ez] fix a bunch of typos in dynamo (#152886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152886
Approved by: https://github.com/williamwen42
2025-05-06 05:13:56 +00:00
37c71820f3 Fix nn.LazyModuleMixin examples (#150596)
Fixes #150404

## Test Result

![image](https://github.com/user-attachments/assets/e546339f-c1cb-47db-ab0e-276a42c167b8)

![image](https://github.com/user-attachments/assets/298db7ad-6512-4b17-9453-170ff843c4fd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150596
Approved by: https://github.com/mikaylagawarecki
2025-05-06 05:11:22 +00:00
337895eaaf Run url and xref linters independently (#152899)
Also introduce `skip-xref-lint` label

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152899
Approved by: https://github.com/huydhn
2025-05-06 05:02:32 +00:00
ee0cd1d8b5 Only do shallow clone when checkout nccl (#152826)
Note: `--depth` implies `--single-branch` since git 2.7.6

```sh
git clone https://github.com/NVIDIA/nccl.git
Cloning into 'nccl'...
remote: Enumerating objects: 4205, done.
remote: Counting objects: 100% (238/238), done.
remote: Compressing objects: 100% (122/122), done.
remote: Total 4205 (delta 144), reused 126 (delta 116), pack-reused 3967 (from 3)
Receiving objects: 100% (4205/4205), 4.22 MiB | 7.01 MiB/s, done.
Resolving deltas: 100% (2858/2858), done.
```
```sh
git clone --depth 1 --branch v2.25.1-1 https://github.com/NVIDIA/nccl.git
Cloning into 'nccl'...
remote: Enumerating objects: 249, done.
remote: Counting objects: 100% (249/249), done.
remote: Compressing objects: 100% (227/227), done.
remote: Total 249 (delta 31), reused 111 (delta 15), pack-reused 0 (from 0)
Receiving objects: 100% (249/249), 657.44 KiB | 2.14 MiB/s, done.
Resolving deltas: 100% (31/31), done.
Note: switching to '80f6bda4378b99d99e82b4d76a633791cc45fef0'.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152826
Approved by: https://github.com/albanD
2025-05-06 04:56:19 +00:00
97dfd8dd53 [invoke_subgraph] Run missing graph passes recursively (#152675)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152675
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #152772, #152770
2025-05-06 02:55:34 +00:00
cc254eaa7c [inductor][refactor] Refactor the fetching of subgraph names (#152770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152770
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #152772
2025-05-06 02:55:34 +00:00
b1d34acac5 [fx] Recursive DCE on subgraphs (#152772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152772
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-05-06 02:55:34 +00:00
35c727e7ff Fix typo on test_multi_device_context_manager for XPU (#152812)
# Motivation
Align https://github.com/pytorch/pytorch/pull/152474, fix the typo on UT for XPU introduced by https://github.com/pytorch/pytorch/issues/148864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152812
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-05-06 02:51:19 +00:00
470cd3a995 [aotinductor] Don't alloc weights if they don't exist (#152692)
Fixes https://github.com/pytorch/pytorch/issues/152356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152692
Approved by: https://github.com/henrylhtsang
2025-05-06 02:50:21 +00:00
8253970a1f Make device check error message more descriptive (#150750)
Fixes #122757

## Test Result

```python
import torch

model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)

loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
    loss = loss_fn(input=model_output, target=labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/mikaylagawarecki
2025-05-06 02:33:20 +00:00
1d7728056b [nativert] Move TensorMeta to pytorch core (#152475)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`

Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.

Test Plan:

Added test under `test/cpp/nativert/test_tensor_meta.cpp`

Differential Revision: D73820548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
2025-05-06 01:50:46 +00:00
1798b0db25 Use three-dot diffs in URL and xref lint workflows (#152895)
Only run on the files actually modified in a PR, not every file touched on main since the branch point

Fixes #152884

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152895
Approved by: https://github.com/huydhn
2025-05-06 01:33:52 +00:00
f097e83369 [inductor][retry] Realize bucketize/searchsorted output (#152858)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

**Retry**
Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858
Approved by: https://github.com/aakhundov
2025-05-06 01:32:26 +00:00
14f8066910 Ensure mxfp8 scaled_mm works w/ max-autotune (#152744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152744
Approved by: https://github.com/Skylion007
2025-05-06 01:16:57 +00:00
cyy
ac792a0dca [submodule] Bump ITTAPI to 3.25.5 (#150263)
It hasn't been updated for 3 years. And also to remove CMake 4 workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263
Approved by: https://github.com/sraikund16
2025-05-06 01:02:18 +00:00
721fdfa32d [ez] Fsspec Filesystem ls details should be false (#152693)
Summary: The default action for ls for the local filesystem is with details=False, but this isn't the case for all filesystems (eg. huggingface), so setting details=False explicitly so that the return type of ls is a list of strings, and not a list of dictionaries, which is what it would be with details=True.

Test Plan: tested in notebook

Differential Revision: D74080572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152693
Approved by: https://github.com/joecummings
2025-05-06 01:02:13 +00:00
4979ca5ffa Synchronize in foreach tests after profiling (#152857)
After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857
Approved by: https://github.com/davidberard98
2025-05-06 00:56:48 +00:00
13dcf80a53 [dynamic shapes] use try-catch instead of guard_or_true for reshape_view_helper (#152638)
Test Plan: test_export

Differential Revision: D74033649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152638
Approved by: https://github.com/laithsakka
2025-05-06 00:54:24 +00:00
d197228d43 Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537)"
This reverts commit 3196a3aca0f16792820158cfd451cb977f99ac7e.

Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/huydhn due to We need signals from inductor, cmake version from pip is too old? ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2852820175))
2025-05-06 00:22:23 +00:00
103fe856e1 Revert "Add infra to run CPython tests under Dynamo (#150787)"
This reverts commit 7c96dd8f0c9a7e17f598612405f002441c7f07ae.

Reverted https://github.com/pytorch/pytorch/pull/150787 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a failed test is showing up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150787#issuecomment-2852818113))
2025-05-06 00:20:02 +00:00
0e9874849f [BE]: Update torch core lazy helpers with micropts (#152778)
Some minor nits I noticed. Use reserve when possible
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152778
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-06 00:03:51 +00:00
fd57c16285 Avoid triggering ignored requires_grad warning in our code (#152686)
This one is ok to silence as we're just doing formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686
Approved by: https://github.com/Skylion007
2025-05-05 23:56:40 +00:00
125a3eee5c [ez] Use pip instead of conda in run_tests.sh (#152860)
Part 1 of https://github.com/pytorch/pytorch/issues/148336.  The rest depends on https://github.com/pytorch/pytorch/issues/148335 to remove conda from Docker build process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152860
Approved by: https://github.com/atalman
2025-05-05 23:06:55 +00:00
e3064bf0e3 [inductor] Allow num_program specification for TMA workspace (#152844)
Summary:
Allow TMA workspace creation allow specification for `num_programs`, which defaults to `num_sms` when not specified.

We need a total `num_programs * num_tma_descriptors` no. of descriptors for a kernel.

Test Plan: CI.

Differential Revision: D74189599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152844
Approved by: https://github.com/drisspg
2025-05-05 23:02:55 +00:00
cc954848d4 Revert "[c10d] Fix extra CUDA context created by barrier (#149144)"
This reverts commit 457fa820ad538c7aeadb68f0ec418d63972ba1ee.

Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))
2025-05-05 22:56:50 +00:00
2ce6d169fc [IR] Input Adapter refactor prototype (#152459) (#152575)
Summary:

1. Adding `input` field to `_adapt_flat_args` function
2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error)
3. Pass `args` as input to `_adapt_flat_args`

These changes are made to update the InputAdapter

see more context in D73811508

Test Plan: see D73811508

Differential Revision: D73945419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575
Approved by: https://github.com/angelayi
2025-05-05 22:51:58 +00:00
a2ccda3c60 [pytorch][PR][inductor] Fix one instance of launch_enter_hook (#152831)
Summary: One usage seems missed in https://github.com/pytorch/pytorch/pull/152457

Test Plan: EMS local benchmark

Differential Revision: D74159749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152831
Approved by: https://github.com/danzimm
2025-05-05 22:15:47 +00:00
2b4fe9fa14 [Autotune Cache] Fix the bug of using the wrong key for recording artifacts in CacheArtifactManager (#152678)
Summary: Replace the key (path) from `<hash>.best_config` to `<parent_dir>/<hash>.best_config` to ensure that Autotune artifacts in MegaCache are loaded to the correct location locally.

Test Plan: NA

Differential Revision: D74052400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152678
Approved by: https://github.com/oulgen
2025-05-05 21:03:10 +00:00
d547c7e10d [fbgemm] Implement __obj_flatten__ for LinearPackedParamsBase (#152619)
Differential Revision: D73991241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152619
Approved by: https://github.com/jerryzh168, https://github.com/houseroad
2025-05-05 20:58:25 +00:00
22d1359bc6 Move warning from item to specific number conversions (#152709)
Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-05-05 20:46:05 +00:00
3bc69cc08d Document that dampening is skipped in SGD momentum first step (#152833)
Pointed out by https://x.com/hi_tysam/status/1917318692276174977/photo/2.

It would be BC breaking to change this behavior 7 years after it has been decided, so we are documenting it first at the very least.

<img width="642" alt="image" src="https://github.com/user-attachments/assets/3febcb07-e0ed-44a1-bd3b-a8e685711cb4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152833
Approved by: https://github.com/albanD
2025-05-05 20:07:23 +00:00
99dac7005f Revert "[Inductor] FX backend via Wrapper IR (#146942)"
This reverts commit a7691140a0fed33a838dda11e28ff7da393d9180.

Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see a7691140a0/1 ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))
2025-05-05 20:01:29 +00:00
a7691140a0 [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-05 19:34:49 +00:00
fdadda21b6 Revert "[float16]: Fast path for torch.dot with float16/bfloat16 (#152799)"
This reverts commit d57bf53225004a684952222722a4f7322a21a596.

Reverted https://github.com/pytorch/pytorch/pull/152799 on behalf of https://github.com/malfet due to This broke C10_MOBILE builds, not sure why it was not surfaced on pull, see a766c1d117/1 ([comment](https://github.com/pytorch/pytorch/pull/152799#issuecomment-2852084433))
2025-05-05 19:17:59 +00:00
a766c1d117 [nativert] move intrusive list to c10/util (#152754)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff moves intrusive list to c10/util

Test Plan: CI

Differential Revision: D74104595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152754
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-05 18:49:56 +00:00
51e77f3b30 [dynamo] replace unimplemented with unimplemented_v2 in variables/torch_functions.py (#151278)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151278
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #151277
2025-05-05 18:45:40 +00:00
9e24f9b523 [dynamo] replace unimplemented with unimplemented_v2 in variables/functions.py (#151277)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151277
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-05-05 18:45:40 +00:00
d57bf53225 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-05 18:29:39 +00:00
172a7c942e Revert "Log aot and idx waitcounters. (#152444)"
This reverts commit ea9ea029595a5f628fdd368a6e1dd76e95707161.

Reverted https://github.com/pytorch/pytorch/pull/152444 on behalf of https://github.com/jovianjaison due to needs a fix ([comment](https://github.com/pytorch/pytorch/pull/152444#issuecomment-2851905261))
2025-05-05 18:11:37 +00:00
136ee4c81b Make assertion about pass callable print the bad pass (#152654)
If you passed an invalid string now you can easily see what it is

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152654
Approved by: https://github.com/eellison
2025-05-05 18:07:43 +00:00
fd6d4a6a24 [dynamo] Guard serialization for DICT_KEYS_MATCH (#152723)
DICT_KEYS_MATCH

Differential Revision: [D74091886](https://our.internmc.facebook.com/intern/diff/D74091886/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152723
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716, #152721
2025-05-05 18:05:56 +00:00
2da9ab4b1c [dynamo] Guard serialization for MAPPING_KEYS_CHECK (#152721)
MappingProxyType

Differential Revision: [D74091363](https://our.internmc.facebook.com/intern/diff/D74091363/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152721
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716
2025-05-05 18:05:56 +00:00
24e1666b3a [dynamo] Guard serialization for WEAKREF_ALIVE (#152716)
Punt on WEAREF_ALIVE as weakref won't live across the process and users might need to drop them upfront.

Differential Revision: [D74088735](https://our.internmc.facebook.com/intern/diff/D74088735/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152716
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687
2025-05-05 18:05:56 +00:00
2cb16df6e2 [dynamo] Guard serialization for DUPLICATE_INPUT. (#152687)
Seems this guard is not very active. Adding a test to detect error handling at least.

Differential Revision: [D74074837](https://our.internmc.facebook.com/intern/diff/D74074837/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152687
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616
2025-05-05 18:05:56 +00:00
ffd58293f7 [dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616)
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.

## Test Cases:

0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad

Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
2025-05-05 18:05:56 +00:00
1d1cbcd8a3 [dynamo] Guard serialization for DUAL LEVEL. (#152615)
Seem dual level counter should be stored in OutputGraph so that the value can be preserved through roundtripping.

Differential Revision: [D74008786](https://our.internmc.facebook.com/intern/diff/D74008786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152615
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-05-05 18:05:56 +00:00
0145f9e29e [CI] docker images use tags instead of image name (#152209)
Change CI docker images to be `ci-image:<image name>-<folder sha>` instead of `<image name>:<folder sha>` so we never have to make a new ecr repo ever again

Pros:
never have to make a new ecr repo ever again
Cons:
if it aint broken, dont fix it?

Don't need to change linux-test images since they use the "full name" of the image with the docker registry and the tag

In order to prevent others needing to rebase past this PR, also push the image to the "old name".  This can be removed after this PR has been in main for a while
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152209
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-05-05 18:02:29 +00:00
cyy
45efa1aaa8 [3/N] Use internal linkage in C++ files (#151297)
Follows #151070.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
99287b170b Generate test reports for pytest when option is given (#152170)
The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled.

See also https://github.com/pytorch/pytorch/issues/126523
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152170
Approved by: https://github.com/Skylion007
2025-05-05 17:46:40 +00:00
kyo
a21090a38c Fix incorrect citation of authors in documentation (#145209)
This PR corrects the citation of Adafactor authors "Noam Shazeer" and "Mitchell Stern" in the documentation.
The current text incorrectly lists them as "Shazeer, Noam, and Mitchell Stern," which seems to be a result of a data parsing issue of some reference manager(s) [as you can find many papers with the same issue](https://www.google.com/search?q=%22Shazeer%2C+Noam%2C+and+Mitchell+Stern%22).
The updated citation follows standard conventions for author names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145209
Approved by: https://github.com/janeyx99
2025-05-05 17:45:05 +00:00
ea9ea02959 Log aot and idx waitcounters. (#152444)
Summary:
Added for create_aot_dispatcher_function and compile_fx_inner.

Note:
Log wait counters flag is already set for:
1. async_compile.precompile
2. remote_fx_graph_cache_get
3. remote_fx_graph_cache_put

Test Plan: contbuild

Differential Revision: D73866124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444
Approved by: https://github.com/ppanchalia, https://github.com/masnesral
2025-05-05 17:35:29 +00:00
35475a3e07 Disable SLEEF implementation of vec::maximum in vec128_float_neon.h | Accelerate aten::hardtanh_ by 21x (#152538)
The `has_inf_nan` implementation in `vec::maximum` is scalar, and it slows down certain activations like `tanh` by almost 20 times. Additionally, the `vec::minimum` function simply uses NEON intrinsics and not SLEEF. This PR makes the two fns similar in implementation.

Besides, the SLEEF function `Sleef_fmaxf4` ultimately invokes the `vmaxq_f32` NEON intrinsic through [vmax_vf_vf_vf](d28232a309/src/arch/helperadvsimd.h (L253)).

From a single threaded profile of mobilenet on an Arm Neoverse-V2 machine (code below), the `aten::hardtanh_` takes **5.653ms** per function call while using the current PyTorch 2.7 wheel, whereas it takes **266.096us** per function call while simply using `vmaxq_f32` - a 21x speedup, and overall inference is 1.8x faster.
___

Run the below script: `OMP_NUM_THREADS=1 python profile_mobilenet.py --iterations 10`
<details >
<summary>profile_mobilenet.py</summary>

```
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
import argparse

torch.manual_seed(42)

def load_mobilenet():
    model = models.mobilenet_v2(pretrained=True)
    model.eval()
    return model

def generate_sample_input(batch_size=8):
    return torch.randn(batch_size, 3, 224, 224)

def warmup(model, sample_input, num_warmup=10):
    with torch.inference_mode():
        for _ in range(num_warmup):
            _ = model(sample_input)

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--batch_size', type=int, default=8)
    parser.add_argument('--iterations', type=int, default=100)
    return parser.parse_args()

def main():
    args = parse_args()
    model = load_mobilenet()

    sample_input = generate_sample_input(args.batch_size)
    print("Warming up...")
    warmup(model, sample_input)
    print("Warmup complete.")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with torch.inference_mode():
            for i in range(args.iterations):
                with record_function("model_inference"):
                    outputs = model(sample_input)

    print(prof.key_averages().table(sort_by="cpu_time_total"))
    print(f"Throughput: {(args.iterations * args.batch_size / (prof.profiler.self_cpu_time_total / 1e6)):.3f} images/s")

if __name__ == "__main__":
    main()
```

</details>

<details>
<summary>Profiler output using the current Pytorch 2.7 wheel </summary>

```
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 model_inference         2.39%     101.839ms       100.00%        4.254s     425.437ms            10
                 aten::hardtanh_         0.02%     905.454us        46.50%        1.978s       5.653ms           350
                  aten::hardtanh         0.03%       1.239ms        46.48%        1.977s       5.650ms           350
                     aten::clamp        46.45%        1.976s        46.45%        1.976s       5.646ms           350
                    aten::conv2d         0.06%       2.468ms        43.89%        1.867s       3.591ms           520
               aten::convolution         0.06%       2.491ms        43.83%        1.865s       3.586ms           520
              aten::_convolution         0.13%       5.546ms        43.77%        1.862s       3.581ms           520
               aten::thnn_conv2d         0.04%       1.658ms        24.13%        1.027s       3.019ms           340
      aten::_slow_conv2d_forward        23.99%        1.021s        24.09%        1.025s       3.014ms           340
        aten::mkldnn_convolution        14.42%     613.285ms        19.51%     829.885ms       4.610ms           180
                aten::batch_norm         0.06%       2.368ms         6.89%     292.928ms     563.323us           520
    aten::_batch_norm_impl_index         0.11%       4.600ms         6.83%     290.560ms     558.769us           520
         aten::native_batch_norm         6.60%     280.762ms         6.69%     284.567ms     547.244us           520
                aten::contiguous         0.01%     623.099us         5.01%     213.152ms       1.184ms           180
                     aten::clone         0.02%     988.729us         5.00%     212.529ms       1.181ms           180
                     aten::copy_         4.94%     210.315ms         4.94%     210.315ms       1.052ms           200
                    aten::linear         0.00%      58.347us         0.18%       7.659ms     765.905us            10
                     aten::addmm         0.17%       7.373ms         0.18%       7.483ms     748.309us            10
                     aten::empty         0.17%       7.161ms         0.17%       7.161ms       1.790us          4000
                       aten::add         0.11%       4.742ms         0.11%       4.742ms      47.419us           100
                aten::empty_like         0.03%       1.315ms         0.09%       3.890ms       5.557us           700
                      aten::view         0.05%       1.933ms         0.05%       1.933ms       2.801us           690
               aten::as_strided_         0.04%       1.599ms         0.04%       1.599ms       8.885us           180
                   aten::resize_         0.04%       1.493ms         0.04%       1.493ms       2.871us           520
       aten::adaptive_avg_pool2d         0.00%      55.360us         0.04%       1.491ms     149.051us            10
                      aten::mean         0.00%     116.997us         0.03%       1.435ms     143.515us            10
                       aten::sum         0.02%     935.980us         0.02%     992.121us      99.212us            10
                    aten::detach         0.02%     707.217us         0.02%     707.217us       2.080us           340
                      aten::div_         0.00%     161.473us         0.01%     326.035us      32.604us            10
                        aten::to         0.00%     178.193us         0.01%     321.253us       0.892us           360
         aten::_nnpack_available         0.01%     302.835us         0.01%     302.835us       0.891us           340
                  aten::_to_copy         0.00%      63.170us         0.00%     143.060us      14.306us            10
                         aten::t         0.00%      49.759us         0.00%     117.621us      11.762us            10
                 aten::transpose         0.00%      40.637us         0.00%      67.862us       6.786us            10
                   aten::flatten         0.00%      42.634us         0.00%      58.867us       5.887us            10
                     aten::fill_         0.00%      56.141us         0.00%      56.141us       5.614us            10
                    aten::expand         0.00%      42.687us         0.00%      48.930us       4.893us            10
             aten::empty_strided         0.00%      40.589us         0.00%      40.589us       4.059us            10
                aten::as_strided         0.00%      33.468us         0.00%      33.468us       1.673us            20
              aten::resolve_conj         0.00%       9.066us         0.00%       9.066us       0.453us            20
                   aten::dropout         0.00%       5.782us         0.00%       5.782us       0.578us            10
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 4.254s

Throughput: 18.804 images/s
```

</details>

<details>
<summary>Profiler output after this PR's changes </summary>

```
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 model_inference         4.43%     104.484ms       100.00%        2.359s     235.883ms            10
                    aten::conv2d         0.10%       2.313ms        79.19%        1.868s       3.592ms           520
               aten::convolution         0.10%       2.293ms        79.09%        1.866s       3.588ms           520
              aten::_convolution         0.23%       5.436ms        78.99%        1.863s       3.583ms           520
               aten::thnn_conv2d         0.08%       1.799ms        44.29%        1.045s       3.072ms           340
      aten::_slow_conv2d_forward        44.03%        1.039s        44.21%        1.043s       3.067ms           340
        aten::mkldnn_convolution        24.91%     587.584ms        34.47%     812.992ms       4.517ms           180
                aten::batch_norm         0.10%       2.350ms        11.83%     279.113ms     536.757us           520
    aten::_batch_norm_impl_index         0.20%       4.788ms        11.73%     276.764ms     532.238us           520
         aten::native_batch_norm        11.30%     266.660ms        11.46%     270.420ms     520.038us           520
                aten::contiguous         0.02%     575.723us         9.41%     222.080ms       1.234ms           180
                     aten::clone         0.04%       1.061ms         9.39%     221.504ms       1.231ms           180
                     aten::copy_         9.29%     219.131ms         9.29%     219.131ms       1.096ms           200
                 aten::hardtanh_         0.04%     917.669us         3.95%      93.133ms     266.096us           350
                  aten::hardtanh         0.05%       1.130ms         3.91%      92.216ms     263.474us           350
                     aten::clamp         3.85%      90.894ms         3.86%      91.086ms     260.246us           350
                    aten::linear         0.00%      68.681us         0.33%       7.899ms     789.945us            10
                     aten::addmm         0.32%       7.598ms         0.33%       7.707ms     770.673us            10
                     aten::empty         0.30%       7.176ms         0.30%       7.176ms       1.794us          4000
                       aten::add         0.20%       4.627ms         0.20%       4.627ms      46.268us           100
                aten::empty_like         0.06%       1.316ms         0.17%       3.973ms       5.676us           700
                      aten::view         0.08%       2.001ms         0.08%       2.001ms       2.899us           690
       aten::adaptive_avg_pool2d         0.00%      53.745us         0.07%       1.548ms     154.791us            10
                   aten::resize_         0.06%       1.533ms         0.06%       1.533ms       2.948us           520
               aten::as_strided_         0.06%       1.521ms         0.06%       1.521ms       8.450us           180
                      aten::mean         0.00%     117.637us         0.06%       1.494ms     149.417us            10
                       aten::sum         0.04%     973.291us         0.04%       1.013ms     101.342us            10
                    aten::detach         0.03%     652.224us         0.03%     652.224us       1.918us           340
                      aten::div_         0.01%     195.077us         0.02%     363.103us      36.310us            10
                        aten::to         0.01%     212.758us         0.02%     359.655us       0.999us           360
         aten::_nnpack_available         0.01%     295.235us         0.01%     295.235us       0.868us           340
                  aten::_to_copy         0.00%      68.726us         0.01%     146.897us      14.690us            10
                         aten::t         0.00%      53.873us         0.01%     124.033us      12.403us            10
                 aten::transpose         0.00%      42.512us         0.00%      70.160us       7.016us            10
                   aten::flatten         0.00%      44.040us         0.00%      66.631us       6.663us            10
                    aten::expand         0.00%      44.632us         0.00%      51.177us       5.118us            10
                     aten::fill_         0.00%      40.134us         0.00%      40.134us       4.013us            10
             aten::empty_strided         0.00%      35.291us         0.00%      35.291us       3.529us            10
                aten::as_strided         0.00%      34.193us         0.00%      34.193us       1.710us            20
              aten::resolve_conj         0.00%       8.594us         0.00%       8.594us       0.430us            20
                   aten::dropout         0.00%       6.758us         0.00%       6.758us       0.676us            10
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.359s

Throughput: 33.915 images/s
```

</details>

___

Using torchbench, the models `mobilenet_v2` and `mobilenet_v3_large` showed improvements as expected too.

Before -> After (latency in ms)
```
"mobilenet_v3_large-eval_latency": 1207.212 -> 844.902
"mobilenet_v2-eval_latency": 1029.834 -> 662.476
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152538
Approved by: https://github.com/Skylion007
2025-05-05 17:21:11 +00:00
131da0a982 Add a test for AsyncCollectiveTensor handling for maybe-view ops (#152688)
We never added a proper test for the fix from https://github.com/pytorch/pytorch/pull/134661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152688
Approved by: https://github.com/kwen2501
ghstack dependencies: #152195
2025-05-05 17:21:00 +00:00
5abe74857a SAC: fix recompute tag propagation for ops with list[tensor] inputs (#152195)
There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing.

This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running.

There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195
Approved by: https://github.com/soulitzer
2025-05-05 17:21:00 +00:00
7c96dd8f0c Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-05 17:20:14 +00:00
50fe1b2349 Implement async manifold cache write (#152452)
Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency.

Test Plan: added new unit test

Reviewed By: jamesjwu

Differential Revision: D73867797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452
Approved by: https://github.com/jamesjwu
2025-05-05 16:45:48 +00:00
3196a3aca0 [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-05 16:32:40 +00:00
d119481717 [cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)
For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds.

You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time.

Removing a log further bring it down to 0.07202 seconds

Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577
Approved by: https://github.com/chenyang78
2025-05-05 16:27:16 +00:00
9a9cc48c65 Update SGD documentation to match implementation (#149884)
Fixes #149476

This PR updates the pseudocode description of the SGD optimizer to better match the implementation.

Updated pseudocode:

![image](https://github.com/user-attachments/assets/2d7bc618-0408-4909-b835-af6465736918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149884
Approved by: https://github.com/janeyx99
2025-05-05 16:06:17 +00:00
7a2df6a00b [PGNCCL] Add FP8 support (#152706)
NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24.

NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op.

Test plan:
- test_allreduce_float8
- test_reduce_scatter_float8

Resolves #148344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706
Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever
2025-05-05 16:02:27 +00:00
a1516d9e6e Add "#pragma once" to CachingHostAllocator.h (#152800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152800
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-05 15:21:14 +00:00
fe36d7dc44 [MPSInductor] Fix truncdiv implementation (#152788)
For integral dtypes it should be just an alias for division

Fixes `GPUTests.test_div7_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152788
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #152663, #152515, #152737, #152743, #152758
2025-05-05 13:31:51 +00:00
87f2bd2439 Remove conda usage in windows binary builds (#151035)
This is related to : https://github.com/pytorch/pytorch/issues/146048
Removing conda from windows binary builds. At this point we are only removing conda and replacing it with python builds. Not rewriting all batch files as python or bash.

Additionally cleanup unused files:
```
.ci/pytorch/windows/internal/static_lib_test.bat
.ci/pytorch/windows/internal/env_fix.bat
.ci/pytorch/windows/internal/vs_install.bat
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151035
Approved by: https://github.com/cyyever, https://github.com/clee2000, https://github.com/malfet
2025-05-05 13:09:05 +00:00
0a470dc7c1 [inductor] fix lowering for cummin, cummax for one element tensors (#151931)
Fixes https://github.com/pytorch/pytorch/issues/151738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151931
Approved by: https://github.com/eellison
2025-05-05 13:05:59 +00:00
2825a28bf1 Exempt overriding methods from docstring_linter (fix #151692) (#151906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151906
Approved by: https://github.com/Skylion007
2025-05-05 12:39:42 +00:00
9210a98b92 [xla hash update] update the pinned xla hash (#152809)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152809
Approved by: https://github.com/pytorchbot
2025-05-05 11:21:11 +00:00
ac9fcd6346 [Inductor][CPU] bug fix for int8 GEMM compensation epilogue (#152408)
Fixes #152398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152408
Approved by: https://github.com/leslie-fang-intel
2025-05-05 08:26:47 +00:00
7e637de9cb [Flight Recorder] Added logging after FR dump completed (#152648)
Summary: TSIA

Test Plan: eyes

Differential Revision: D74041147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152648
Approved by: https://github.com/fduwjj, https://github.com/wdvr
2025-05-05 06:17:47 +00:00
0ffd31dc8a [MPS] Migrate div roudning modes (#152758)
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24.,  3.]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
2025-05-05 03:02:29 +00:00
93d8f6ee32 [reland] Detailed triton kernel logging (#152694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152694
Approved by: https://github.com/Skylion007
2025-05-05 02:46:57 +00:00
a78eec88b8 Implement util function compute_global_tensor_shape for 1D device mesh (#152751)
### Summary

Recreating #151990 to mitigate easyCLA failure

compute_global_tensor_shape util function takes in local tensor shape, device mesh
and placements. We all gather the shapes from the shards and according to the placement
type we construct the global shape.

Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared

### Test

`pytest test/distributed/tensor/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751
Approved by: https://github.com/XilunWu
2025-05-05 02:44:31 +00:00
30453d60dd Add methods for checking Triton availability to the device interface (#152529)
Adds the `is_triton_capable` and `raise_if_triton_unavailable` class methods to the device interface, to allow device types to run their own checks for Triton _capability_ (which means a device can actually support Triton in the first place) and _availability_ (if the correct backend of Triton is installed and is functional for the device).

Using the device interface allows us to do these checks in a device-agnostic way, allow external backends to attest their Triton support by simply implementing those methods. The intention is for this to back things like the `has_triton` utility method.

This has been split from #139171.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152529
Approved by: https://github.com/jansel
2025-05-05 00:55:53 +00:00
8dbe1ff34b Revert "Avoid triggering ignored requires_grad warning in our code (#152686)"
This reverts commit f51bee137518cde82e88ec655988e7eb1b94a3f3.

Reverted https://github.com/pytorch/pytorch/pull/152686 on behalf of https://github.com/wdvr due to failinginternal test, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152686#issuecomment-2849497208))
2025-05-04 23:34:34 +00:00
49b9efdf1f [BE]: Cleanup traceutils with fmtlib (#152265)
Simplify code and make it faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-04 22:27:19 +00:00
82cb202de7 [Inductor][NCU] Add kernel name filtering, and allow custom metrics (#150872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150872
Approved by: https://github.com/FindHao

Co-authored-by: Yueming Hao <yhao@meta.com>
2025-05-04 20:49:19 +00:00
b117a6c47b Fix two error messages involving Tensor.dense() (#152631)
Two error messages in the codebase instruct the user to use `Tendor.dense()`. This method doesn't exist, but `Tensor.to_dense()` does, and this is what the user should be using instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152631
Approved by: https://github.com/jansel
2025-05-04 20:44:08 +00:00
220870ce9e [caffe2] Support building for armv8.1 (#152766)
Summary:
- Remove explicit `-march=` compiler flags, as they're already implied by
   the toolchain:
https://www.internalfb.com/code/fbsource/[7f85b0565073]/fbcode/tools/build/buck/wrappers/defs.bzl?lines=819
- Gate non-8.1 compliant opcodes with `__ARM_FEATURE_*`.

Test Plan: CI

Reviewed By: rahulg

Differential Revision: D74023601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152766
Approved by: https://github.com/Skylion007
2025-05-04 19:09:21 +00:00
a69da90a9f Add pad limit of avg_poolnd and AvgPoolnd (#152680)
Fixes #152156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152680
Approved by: https://github.com/mikaylagawarecki
2025-05-04 17:25:22 +00:00
cyy
370e23388d Set CMake 3.5 as minimum version in pytorch_android (#152769)
I saw pytorch_android failure in docker image builds. This fix attempts to bypass CMake 4 limitations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152769
Approved by: https://github.com/Skylion007
2025-05-04 16:57:22 +00:00
8f54e56e62 Add optional device index to AOTIModelPackageLoader (#152093)
This is my suggestion for resolving #152087

This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093
Approved by: https://github.com/desertfire
2025-05-04 11:40:12 +00:00
fd8fd01d25 [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)
As the title stated.

**Changes**:
- Add get_rng_state & set_rng_state support for OpenReg
- Add _lazy_init support for OpenReg
- Remove redundant code for cuda/Module.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914
Approved by: https://github.com/albanD
2025-05-04 09:42:08 +00:00
c8bac51ec1 Remove the unnecessary cuda/Tensor.cpp (#152522)
As the title stated.

**Question:**

I have carefully looked through all the .h files in Tensor.cpp and from my perspective this file does not make sense. Does anyone know what the background is for doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152522
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #152512, #152513, #152521
2025-05-04 07:15:11 +00:00
8562457cba Make torch/csrc/utils.h to be device-agnostic (#152521)
`torch/csrc/utils.h` should be device-independent. Currently, it contains CUDA-related implementations, which indirectly causes the [failure of ROCm testing](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038) (The reason is that the ROCm test environment shouldn`t expose HIP-related header files, which causes the JIT compilation to fail during testing)

Therefore, move CUDA-related implementations to `torch/csrc/cuda/utils.h`.

**Question:**
This change may introduce BC-breack.
I searched for this function globally on github and I think the impact is very small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152521
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #152512, #152513
2025-05-04 07:15:11 +00:00
e889937850 [MPS] Migrate div to Metal (#152743)
TODOs:
 - Verify accuracy of  `metal::dot` vs `x.x*x.x + y.y*y.y`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152743
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #152663, #152515, #152737
2025-05-04 00:56:19 +00:00
8faa225695 Revert "[inductor] Realize bucketize/searchsorted output (#152644)"
This reverts commit 9ae4906b21cbd186a493a9564e22a42da2184e3a.

Reverted https://github.com/pytorch/pytorch/pull/152644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152644#issuecomment-2848743442))
2025-05-03 18:16:39 +00:00
6ae690f8f0 add support for 0 size shardedTensor and recalculate metadata from all_gather (#152583)
Summary:
change set
1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here
2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information

Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786.

Differential Revision: D73903933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583
Approved by: https://github.com/q10, https://github.com/fduwjj
2025-05-03 17:26:29 +00:00
762844355e Make DispatchKeySet serializable; add __eq__ (#152732)
These seem like reasonable things to add. Also fixes a bug in vLLM for
me.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732
Approved by: https://github.com/bdhirsh
2025-05-03 14:40:06 +00:00
792736f9ac [BE][MPS] Pass alpha by reference (#152737)
As it's always a scalar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515
2025-05-03 08:31:45 +00:00
cc28b43950 Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)"
This reverts commit 844842dfbf937c43b41c528e461d3f3931bca6e9.

Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))
2025-05-03 08:31:31 +00:00
457fa820ad [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-03 03:13:34 +00:00
34e9f0b5c6 [MPS] Migrate mul to TensorIterator (#152515)
What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when  invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure.

If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example:
```
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.pi)
tensor([ -5.8555,  19.4844,  -7.0703, -10.6562,  27.0000,  18.7812],
       dtype=torch.float16)
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16))
tensor([ -5.8516,  19.4844,  -7.0664, -10.6562,  26.9844,  18.7656],
       dtype=torch.float16)
```

Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515
Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #152663
2025-05-03 02:35:03 +00:00
1cd68c59dd Remove incorrect assertion (#152653)
It's only aspirational that the 'improvement' value is positive. In fact
the pass could make a collective more exposed and we shouldn't assert
here in that case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152653
Approved by: https://github.com/eellison
ghstack dependencies: #152565
2025-05-03 02:33:58 +00:00
84aa0985fb [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-03 02:23:54 +00:00
5e9682719f [Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)
@EikanWang @etaf @guangyey please take a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937
Approved by: https://github.com/drisspg
2025-05-03 01:12:49 +00:00
73b6b1ded4 [inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)
Before
![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6)

After
![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd)

tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-03 00:38:08 +00:00
36140e01fd Rename "startup-tracing-compile" to "compile-time" in label_to_label.yml (#152711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152711
Approved by: https://github.com/oulgen
2025-05-03 00:35:05 +00:00
3d777bae10 Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #148104
2025-05-03 00:02:24 +00:00
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
0e59b594ee [SymmMem] Use cub's BlockScan instead of in-house impl for offset calculation (#151993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151993
Approved by: https://github.com/ngimel
ghstack dependencies: #151261, #151498, #151819
2025-05-02 23:40:47 +00:00
2107d87dc9 [BE] remove outdated warning about TORCH_CUDA_ARCH_LIST (#152715)
I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-05-02 23:00:51 +00:00
a6ea63a841 [FlexAttention] explicilty create grad_q w/ strides (#152641)
Fixes: #147463

There is a mismatch between inductor's lowering for empty_like and it does not match the behavior of eager. The strides do not match preserve format

https://github.com/pytorch/pytorch/issues/144699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152641
Approved by: https://github.com/xmfan
2025-05-02 22:57:26 +00:00
54f29b04d6 Improve error wording in _link_check.yml (#152726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152726
Approved by: https://github.com/huydhn
2025-05-02 22:43:05 +00:00
730a077d48 [ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339)
Unskipping the test, should work fine now.

Related PR: https://github.com/pytorch/pytorch/pull/144572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339
Approved by: https://github.com/jeffdaily
2025-05-02 22:02:30 +00:00
ea12a38668 [associative_scan] Refactoring of input checking and dynamo invocation (#148657)
This PR is the counterpart of https://github.com/pytorch/pytorch/pull/142125 for the associative_scan operation. The way the input checks are performed and the combine_fn is not invoked in the frontend to check the output trees, but rather dynamo is used for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148657
Approved by: https://github.com/ydwu4
2025-05-02 21:39:28 +00:00
8afe40bc5e [Inductor] Fix kernel argument ordering when using dynamic shapes with workspace (#152660)
Summary:
This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted.

When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: **size hints → workspace tensor → grid args**. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: **workspace tensor → size hints → grid args**, which is incorrect.

This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout.

Test Plan:
contbuild and OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg
2025-05-02 21:32:07 +00:00
add4702ebc [Inductor] Introduce Wrapper IR line for symbolic call args (#152587)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

This PR introduces a new wrapper IR line to represent symbolic call args. This deletes a little bit of duplicated code between the Python and C++ backends. In the main PR, having a Wrapper IR line for this also tells the FX backend what this part of the wrapper code is doing. Before this PR, symbolic call args generated raw Python lines, which confuse the FX converter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152587
Approved by: https://github.com/jansel
2025-05-02 20:37:00 +00:00
9ae4906b21 [inductor] Realize bucketize/searchsorted output (#152644)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644
Approved by: https://github.com/benjaminglass1, https://github.com/eellison
2025-05-02 20:31:17 +00:00
74b496e54c Cleanup DeviceInterface in triton test (#152409)
- Remove inherited functions
- Return valid device_count (1 device: idx=0)
- Remove unused function `triton_supported`

Followup to #144399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152409
Approved by: https://github.com/jansel
2025-05-02 20:25:32 +00:00
44f29a3669 Add parameters for monitor (#152541)
Add log interval and log-data-collect interval to all test yml

Add upload step for all test yml files

next step:
enable perf test with utilization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152541
Approved by: https://github.com/huydhn
2025-05-02 20:24:11 +00:00
ec68d082a1 [CUDA][TF32] Account for TF32 in test_conv2d_same_padding (#152618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-05-02 20:19:00 +00:00
39c0b01970 [ez] Disable failing test in periodic no gpu no avx (#152698)
Failing on periodic after it was added in #152542
Ex
inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh [GH job link](https://github.com/pytorch/pytorch/actions/runs/14775755628/job/41485185829) [HUD commit link](6f6acb4128)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152698
Approved by: https://github.com/huydhn, https://github.com/hl475
2025-05-02 20:02:48 +00:00
a6dd1c2208 [DCP] Add 30min timeout for IPC communications in async checkpointing (#152629)
Summary:
### Diff Context
- Sometime background process can be stuck processing async checkpoint request, and trainer shutdown can occur before the background process completes.
- Fix, timeout the thread while reading the IPC queue for a response from background process.

Differential Revision: D74017700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152629
Approved by: https://github.com/saumishr
2025-05-02 19:36:22 +00:00
5d860c1e54 [ROCm][CI] Enabled fp8 distributed tests in test_micro_pipeline_tp.py for MI300 (#151977)
This PR enabled fp8 distributed tests on MI300.
For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977
Approved by: https://github.com/jeffdaily
2025-05-02 19:22:18 +00:00
d457b4492d Optimize Sequential methods description (#147304)
Fixes #146892

Add methods description and examples for [`Sequential` document](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3121a06f-02ed-4362-ad0a-f055bb43d469)

### After

![image](https://github.com/user-attachments/assets/66f6bb55-5298-4062-8f7f-7a7f4c1e16d9)
![image](https://github.com/user-attachments/assets/a5275a4c-4214-4518-b7a2-dff21954f368)
![image](https://github.com/user-attachments/assets/9c40d1fb-114a-4d14-a3c4-1143a131660e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147304
Approved by: https://github.com/mikaylagawarecki
2025-05-02 19:18:58 +00:00
eqy
216d81da81 [CUDA][complex] skip test_reference_numerics_large_jiterator_unary_cuda_complex64 on CUDA (#148024)
already skipped on ROCM for a similar reason, recent numpy versions changed convention from `nan+infj` to `-inf+infj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148024
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-05-02 19:11:11 +00:00
16153a0f27 [AOTAutogradCache][Easy] Move "einops.einops.rearrange" to SAFE_NON_TORCH_FUNCTIONS (#152640)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152640
Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/bdhirsh
2025-05-02 19:09:30 +00:00
0488883d6e [cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076)
turns out the backward is not supported yet, whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076
Approved by: https://github.com/drisspg
2025-05-02 18:43:33 +00:00
07290bdcdc Skip search for MKL on ARM cpus (#145850)
It will not find it anyway and makes a bit easier parsing thru CMake log on non-x86 systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145850
Approved by: https://github.com/atalman
2025-05-02 18:39:49 +00:00
1ea2731e26 [ROCm] Add support for SymmetricMemory (#150580)
This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream.

**NOTE:**
ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580
Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu

Co-authored-by: Xiaodong Wang <xdwang@fb.com>
2025-05-02 18:35:14 +00:00
376529c78b consolidate guard_or_x and definitely_x (#152463)
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
2025-05-02 18:08:11 +00:00
72337bdcf2 [ATen][CUDA] Optimize 128 bit vectorization (#148320)
Fixes #147376.
As per request: https://github.com/pytorch/pytorch/pull/145746#pullrequestreview-2642118301
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148320
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman
2025-05-02 17:35:44 +00:00
3baa85cfad [StaticCudaLauncher] Ensure cuda context exists before launching kernels (#152667)
Triton does this already due to  https://github.com/triton-lang/triton/pull/3731/files, in order to fix https://github.com/pytorch/pytorch/issues/124565. We need to do the same thing as triton here, so that in cases with no compilation we still have a cuda context in the backward autograd thread.

Fixes https://github.com/pytorch/pytorch/issues/152639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152667
Approved by: https://github.com/oulgen
2025-05-02 17:29:57 +00:00
f51bee1375 Avoid triggering ignored requires_grad warning in our code (#152686)
This one is ok to silence as we're just doing formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686
Approved by: https://github.com/Skylion007
2025-05-02 17:27:47 +00:00
844842dfbf [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-02 17:21:18 +00:00
f65fb0a23d Make PGO code state not sensitive to file path by hashing file content when the file is available. (#152628)
In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts.
but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number).

after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided.

Sometimes we do not have access to the source code, (file does not exists)
This seems to happen mostly when we re-trace a compiled function but generally it can happen .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628
Approved by: https://github.com/oulgen
2025-05-02 17:11:21 +00:00
ea4b7e0e1d [invoke_subgraph] Simplify output code for subgraph output node (#152490)
Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434)

After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490
Approved by: https://github.com/eellison
ghstack dependencies: #152383
2025-05-02 16:31:25 +00:00
5c0f474dac Do not check out nccl when not building it (#152533)
Add additional conditions to `build_pytorch_libs.py` to avoid fetching NCCL when `USE_CUDA` or `USE_NCCL` are disabled. While at it, adjust the existing condition for `USE_SYSTEM_NCCL` to use the utility function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152533
Approved by: https://github.com/albanD
2025-05-02 16:31:03 +00:00
f6761f2968 [inductor][subgraph] Simplify the resulting output code for subgraph (#152383)
Check out output code

Before this PR -  - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8)

After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383
Approved by: https://github.com/eellison
2025-05-02 15:52:34 +00:00
cb0cf7e5c7 [MPS][BE] Do not dispatch empty kernels (#152663)
If `iter.numel()` is zero no need to dispatch kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152663
Approved by: https://github.com/kulinseth
2025-05-02 14:34:53 +00:00
50d4698ac8 Revert "[cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)"
This reverts commit 1fef3cdabc3f79fd0cbf9273052057ef6122710f.

Reverted https://github.com/pytorch/pytorch/pull/152577 on behalf of https://github.com/wdvr due to failing test_unary_ufuncs.py::TestUnaryUfuncsCUDA::test_reference_numerics_large_jiterator_unary_cuda_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787347116/job/41519095088) [HUD commit link](1fef3cdabc) ([comment](https://github.com/pytorch/pytorch/pull/152577#issuecomment-2846544603))
2025-05-02 07:25:25 +00:00
cyy
e9e1aacef8 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519, https://github.com/wdvr
2025-05-02 07:14:19 +00:00
38a9a8b7f7 Fix: Consider input defined unbacked during inductor codegen for runtime asserts (#152231)
So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now,
deferred runtime assertions that uses those  is never generated.

This PR changes that, such that in the forward graph we consider those and generate the corresponding
runtime assertions of them. We still ignore them for backward which is not ideal

The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used
in them are seen.

We previously skipped placeholder, because for backward we have a wacky approach were we
ignore input defined unbacked symbols and assumes assertions that uses them are already emitted
in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts]

Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231
Approved by: https://github.com/aorenste
2025-05-02 07:01:48 +00:00
829752ba37 [SymmMem] Add all_to_all_vdev (#151819)
Merge in/out splits into one tensor

Multi-block

Use sync instead of barrier

Use nvshmemx_collective_launch

Rotate blocks among peer

write back input splits

Parallel scan works

Use scan for output offsets

Use at most 16 blocks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819
Approved by: https://github.com/ngimel, https://github.com/fduwjj
ghstack dependencies: #151261, #151498
2025-05-02 06:59:21 +00:00
6dadfc4457 Revert "Enable -Wunused on torch targets (#150077)"
This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54.

Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))
2025-05-02 06:53:20 +00:00
3731b70b40 [inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)
For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse

Before
![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c)

After
![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46)

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #152547, #152581
2025-05-02 06:46:05 +00:00
9e3fc41060 [invoke_subgraph] rename identifiers to prevent python mangling (#152581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
ghstack dependencies: #152547
2025-05-02 06:46:05 +00:00
4f9f1abd6d Revert "Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539)"
This reverts commit 037343657edceb345001e4c0ff226a34ca4c6063.

Reverted https://github.com/pytorch/pytorch/pull/152539 on behalf of https://github.com/wdvr due to failing internal tests - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152539#issuecomment-2846484924))
2025-05-02 06:43:35 +00:00
d7961a1086 [SymmMem] Add all-to-all (#151498)
Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #151261
2025-05-02 06:40:43 +00:00
7c3e679ddd Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654)"
This reverts commit fdcfc6a61a2146c7c961073e029ead633113eb9a.

Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](3c54e0c216) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))
2025-05-02 06:31:38 +00:00
4649fd17b0 [invoke_subgraph] Unpacked operands (#152547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152547
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-05-02 05:44:46 +00:00
e6989ceea9 Revert "[BE] Update numba versions (#152557)"
This reverts commit b5995cb67f8543f148b9216e140980e6844aadff.

Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/clee2000 due to test_unary_funcs failure seems real? [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787082066/job/41518415014) [HUD commit link](b5995cb67f) ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2846336004))
2025-05-02 05:22:17 +00:00
ac5de6d55a Remove unnecessary __STDC_FORMAT_MACROS macro (#152513)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152513
Approved by: https://github.com/cyyever, https://github.com/albanD
ghstack dependencies: #152512
2025-05-02 05:06:44 +00:00
d969e2ec33 [CUDAGraph Trees] support memory allocation on side stream (#152472)
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.

However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.

So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.

Fixes #151199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison, https://github.com/ngimel
2025-05-02 04:26:35 +00:00
1f898657e6 [ez] fix grammar mistakes in StatefulSymbolicContext comment (#152598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152598
Approved by: https://github.com/malfet
ghstack dependencies: #151407
2025-05-02 04:21:16 +00:00
36e5ff6bc4 [CP] Fix the offsets to KV in backward (#152625)
This is more semantically correct even though we currently assumed KV have the same lengths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152625
Approved by: https://github.com/XilunWu
2025-05-02 03:30:11 +00:00
1fef3cdabc [cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)
For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds.

You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time.

Removing a log further bring it down to 0.07202 seconds

Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577
Approved by: https://github.com/chenyang78
2025-05-02 02:17:50 +00:00
5b5938929f [refactor] refactor dense implementation of auto_functionalized_v2 for better clarity (#152248)
Abstracts away two helper functions (get_mutable_args_from_schema and _generate_new_op_kwargs_from_bases) to make the code better organized and more re-usable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152248
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245, #152246, #152247
2025-05-02 02:08:06 +00:00
380327c663 [hop] make materialize_as_graph's include and exclude dispatch key set optional (#152247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152247
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245, #152246
2025-05-02 02:08:06 +00:00
a776a566db [hop][schema] allow adding kw_only info to schema argument (#152246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152246
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245
2025-05-02 02:08:06 +00:00
7e7b9ca18f [hop][be] make check_input_alias_and_mutation_return_ouputs create new fake mode (#152245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152245
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244
2025-05-02 02:08:06 +00:00
b5995cb67f [BE] Update numba versions (#152557)
Let's see if PyTorch is compatible with latest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557
Approved by: https://github.com/Skylion007
2025-05-02 01:51:30 +00:00
cyy
ce94b212c7 [Environment Variable][Rebase] Use thread-safe getenv functions (#140200)
Use our thread-safe getenv wrappers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/kwen2501, https://github.com/eqy
2025-05-02 00:41:49 +00:00
a5dd7011a0 [ONNX] Delete JitTraceConvertStrategy (#152556)
Fixes #151703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152556
Approved by: https://github.com/justinchuby
2025-05-02 00:26:43 +00:00
3c54e0c216 [inductor] if unbacked symint in old-size or new-size skip mark_reuse check (#152379)
Probably can run the `mark_reuse` check work with unbacked sizes under certain conditions.
For e.g. `x.repeat(u0, 2).repeat(2, u0)`.

But I think cases like those are rare so skipping the check for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152379
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/jingsh
2025-05-02 00:24:58 +00:00
fdcfc6a61a [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-01 23:01:30 +00:00
64957db6c9 Fix some inductor periodic benchmarks (#152605)
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
https://github.com/pytorch/pytorch/issues/152606. These should be looked
at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152605
Approved by: https://github.com/eellison, https://github.com/huydhn
2025-05-01 22:18:30 +00:00
7aebb127bf [dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)
Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860, #151731, #151962
2025-05-01 21:59:55 +00:00
4555ed8c83 [ca] hide unused scalar int sizes from dynamo (#151962)
together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860, #151731
2025-05-01 21:59:55 +00:00
18229a5300 [ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)
This is the only way to support dynamic shapes on scalars right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860
2025-05-01 21:59:49 +00:00
613bd46272 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
2025-05-01 21:59:43 +00:00
c461ba6522 [aot] mark dynamic activations as maybe dynamic (#149707)
Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs.

Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707
Approved by: https://github.com/bdhirsh
2025-05-01 21:59:36 +00:00
b6c5886d09 BE: Swap functorch --> torch._higher_order_ops (#152620)
Summary: Discovered when attempting to resolve arvr builds, should resolve issues around utilizing functorch through export.

Test Plan:
```
buck2 test arvr/mode/linux/opt //arvr/libraries/xrrp/ml/python/test:convert_to_etvk_test
```

Differential Revision: D74013898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152620
Approved by: https://github.com/zou3519
2025-05-01 21:53:23 +00:00
1c04ea4e59 Revert "[torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)"
This reverts commit 4b5b1adb21f5d7d66945d78a1f89d2f9d86f15bb.

Reverted https://github.com/pytorch/pytorch/pull/150726 on behalf of https://github.com/malfet due to This breaks Windows builds, see a765e2ddda/1 ([comment](https://github.com/pytorch/pytorch/pull/150726#issuecomment-2845858846))
2025-05-01 21:52:35 +00:00
a765e2ddda [nativert] port enumerate from folly to c10::utill (#152481)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff ports an enumeration util from folly into c10.

Test Plan: CI

Differential Revision: D73881042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152481
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-01 21:41:05 +00:00
24b315676d [MPS][BE] Migrate lerp.Scalar.out to tensor iterator (#152514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152514
Approved by: https://github.com/kulinseth, https://github.com/Skylion007, https://github.com/dcci
2025-05-01 20:11:55 +00:00
f1d636f85b [BE] detect CXX pytree requirement with TorchVersion (#151102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151102
Approved by: https://github.com/zou3519
2025-05-01 18:55:57 +00:00
8cb6957e01 [export] Ignore None buffers (#152571)
Fixes https://github.com/pytorch/pytorch/issues/152467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152571
Approved by: https://github.com/yiming0416, https://github.com/yushangdi
2025-05-01 18:18:16 +00:00
037343657e Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-05-01 18:04:33 +00:00
4b5b1adb21 [torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.

This allows us to simplify the code such as:

1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.

95a5958db4/torchgen/utils.py (L155)

95a5958db4/torchgen/utils.py (L176)

2. `os.path.basename(...)` with `Path(...).name`.
 95a5958db4/torchgen/utils.py (L161)

3. Manual file extension split with `Path(...).with_stem(new_stem)`

95a5958db4/torchgen/utils.py (L241-L256)

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/zou3519
2025-05-01 17:43:16 +00:00
83acb688bb Fix constant folding cloning constants (#152273)
Summary:
Bug fix for #135060
Simple review:
https://github.com/pytorch/pytorch/pull/135060/files#diff-f23386709ff7e1235b15e18f835a48e5124e0ddd596aeb33c201daad1abbedd7R357
We mistakenly typed get_attr into getattr.

This causes constants never get untagged, and forces all constants get
cloned twice which greatly increases the memory consumption.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_empty_constant_folding

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152273
Approved by: https://github.com/trieuat, https://github.com/zhxchen17
2025-05-01 17:34:39 +00:00
563a91b144 [cutlass backend] Move cutlass compiled cache to cache_dir (#151825)
Moved "compiled_cache.db" to cache folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151825
Approved by: https://github.com/mlazos
2025-05-01 17:26:01 +00:00
1845df05c6 [inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-01 17:25:28 +00:00
f0c9b3385d Support more dtypes for input, indices in gather (#151822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822
Approved by: https://github.com/ngimel
2025-05-01 16:35:23 +00:00
4c8dee7986 Revert "[inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)"
This reverts commit c87c823de43b7815c523160778b682973e151794.

Reverted https://github.com/pytorch/pytorch/pull/152384 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:08 +00:00
f7b60456cc Revert "[inductor][subgraph] Simplify the resulting output code for subgraph (#152383)"
This reverts commit 98eb7c8cb1abafaff4e28b07ed91cababc2ce54a.

Reverted https://github.com/pytorch/pytorch/pull/152383 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:08 +00:00
2f1800bc3d Revert "[invoke_subgraph] Simplify output code for subgraph output node (#152490)"
This reverts commit 5fe335810af0df48f473387b6f9efcd5dbff4d4a.

Reverted https://github.com/pytorch/pytorch/pull/152490 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:07 +00:00
2fa39e60ed Revert "[inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)"
This reverts commit 5236a8506c4f2fcce6d8a7f945808d84e6c46784.

Reverted https://github.com/pytorch/pytorch/pull/152494 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:07 +00:00
52cbcac640 [BE] Migrate all add/sub ops to Metal kernels (#152510)
As typecasting harness shoudl take care of all permutations
Fix bug in `exec_binary_kernel` where it was not properly downcasting CPU double/complexDouble scalars to floats

Fixes https://github.com/pytorch/pytorch/issues/152582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152510
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #152443, #152466, #152479, #152504, #152485
2025-05-01 15:35:57 +00:00
e82dc0769c Respect checkpointed boundaries when using knapsack formulation in the partitioner (#141684)
When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see 7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)

When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation.

Adding tests would be nice, but not sure what's the best way to test this right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684
Approved by: https://github.com/bdhirsh
2025-05-01 15:28:41 +00:00
41de0f2eaf removing short-perf-test-cpu.sh and short-perf-test-gpu.sh (#152551)
When working on #148342 I realised that there is no reference from those files. So seems they are stale and can be safely removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152551
Approved by: https://github.com/atalman, https://github.com/xuzhao9
2025-05-01 15:09:55 +00:00
6f6acb4128 [AOTI][CPU] Introduce config.cpp.use_decompose_tanh (#152542)
Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need.

Differential Revision: D73909371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542
Approved by: https://github.com/desertfire
2025-05-01 10:25:31 +00:00
7c63ddd817 [Inductor] Wrapper code refactors to prepare for FX codegen (#152391)
This PR contains some refactors from https://github.com/pytorch/pytorch/pull/146942, which help to enable Wrapper FX codegen:
1. Remove `OutputLine`, which is unused.
2. Add an attribute to the backend classes specifying whether they support caching.
3. Before compiling a graph, query the registered backends and check whether caching is supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152391
Approved by: https://github.com/jansel
2025-05-01 09:14:55 +00:00
701c0848b8 [dynamic shapes] aten.constant_pad_nd meta impl (#152129)
We know the output shape, and we know this always produces a clone. Avoids data-dependent errors from the decomposition.

along with https://github.com/pytorch/pytorch/pull/150483, should fix https://github.com/pytorch/pytorch/issues/123855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152129
Approved by: https://github.com/laithsakka
2025-05-01 08:32:10 +00:00
53bf174626 Fix assertion in reorder_communication_preserving_peak_memory (#152565)
>=0 is practically correct becuase we do model the runtime of some ops as 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152565
Approved by: https://github.com/eellison
2025-05-01 06:40:04 +00:00
47972f9092 [export] warn when Dim.AUTO 0/1 specializes (#151827)
Fixes #151582

example warning for Dim.AUTO:
```
torch/_export/non_strict_utils.py:499] dimension inputs['x'].shape[1] 0/1 specialized; Dim.AUTO was specified along with a sample input with hint = 1.
```

example error when Dim.DYNAMIC specializes:
```
- Received user-specified dim hint Dim.DYNAMIC(min=None, max=None), but export 0/1 specialized due to hint of 0 for dimension inputs['x'].shape[0].
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151827
Approved by: https://github.com/angelayi
2025-05-01 06:00:51 +00:00
a7f1ddc184 [SymmMem] Experimental NVSHMEM integration (#151261)
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
2025-05-01 05:24:50 +00:00
13add553b2 [HOP][be] make supports_input_mutation and aliasisng a class field (#152244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152244
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073
2025-05-01 05:22:02 +00:00
447f8241f5 [export][function schema] support exporting hop with function schema argument (#152073)
We need to make function schema proxyable to trace a the auto_functionalized hop that takes function schema as inputs.  The implementation basically follows how we support torchbind object:

1. upon seeing an untracked function schema arg, we creates a constant get_attr node
2. we track the function schema argument in export to support lift/unlift.
3. we need to support serde for functional schema. We'll add support for this in follow-up PRs.

However, compared with torchbind object:
1. we don't need a dynamo implementation, because the function schema is added when we auto_functionalize a hop to the argument of auto_functionalized. One potential use case is users re-traces an exported program with strict mode. Since non-strict is the default now, we don't see a use case yet.
2. we don't need an inductor implementation, because the function schema will go away after auto_functionalized re-inplacing pass.

edit: we greatly simplifies (and generalizes) the implementation following @zou3519 's suggestion of using pytree.register_constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152073
Approved by: https://github.com/zou3519
ghstack dependencies: #152072
2025-05-01 05:22:02 +00:00
500bf50129 [export][be] better type annotation for lift_constants_pass (#152072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152072
Approved by: https://github.com/zou3519
2025-05-01 05:22:02 +00:00
d96193f622 [Inductor] Fix int check again (#152576)
Made an oss change to a diff train diff

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152576
Approved by: https://github.com/wdvr
2025-05-01 05:19:40 +00:00
18588fe2fc Fix GuardOnDataDependentSymNode in the normalize operator (#152039)
Test Plan:
Dumped the local net torch.package to local

Ran
```
buck2 run scripts/shengqin:test_model_export -- /tmp/mtia_local_torch_package {\"local\":null}
```
succeeded

Reviewed By: hongyang-zhao

Differential Revision: D73405271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152039
Approved by: https://github.com/houseroad
2025-05-01 04:34:49 +00:00
cyy
688adc9941 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519
2025-05-01 04:09:06 +00:00
15a3f58f91 Return ConstantVariable(None) from WithExitFunctionVariable.exit to prevent NoneType crash inside autocast exception path (#152503)
Copy of #152013 with PR time benchmarks updated (regressions seem unrelated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152503
Approved by: https://github.com/anijain2305, https://github.com/Skylion007

Co-authored-by: Witold Dziurdz <wdziurdz@habana.ai>
2025-05-01 04:01:24 +00:00
632b89af43 [dynamic shapes] support SymInt inputs for kthvalue (#152151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152151
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2025-05-01 03:47:23 +00:00
56d6d4dafe [PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450) (#152536)
Summary:

Same with D71358949, but removing newly added log to avoid test failures.

Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass.
Original dper pass diffs: D67884534, D68123479, D68384238

Test Plan:
Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log
```
Log: P1798246938

Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log
```
Log: P1798246675

Seeing logs like
`[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False`

Differential Revision: D73934142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152536
Approved by: https://github.com/wdvr
2025-05-01 03:14:04 +00:00
5236a8506c [inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)
Before
![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6)

After
![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd)

tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #152357, #152384, #152383, #152490
2025-05-01 02:04:10 +00:00
5fe335810a [invoke_subgraph] Simplify output code for subgraph output node (#152490)
Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434)

After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490
Approved by: https://github.com/eellison
ghstack dependencies: #152357, #152384, #152383
2025-05-01 02:04:10 +00:00
98eb7c8cb1 [inductor][subgraph] Simplify the resulting output code for subgraph (#152383)
Check out output code

Before this PR -  - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8)

After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383
Approved by: https://github.com/eellison
ghstack dependencies: #152357, #152384
2025-05-01 02:04:10 +00:00
c87c823de4 [inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)
For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse

Before
![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c)

After
![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46)

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #152357
2025-05-01 02:04:10 +00:00
3849fd13de 🐛 Add ciflow/pull🦋 (#152567)
To make it easier to workaround GitHub relibability issues, when it sometime fails to scheduled `on: pull_request` workflows

See https://github.com/pytorch/pytorch/issues/151322

But alas, it does not fixes problem at hand...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152567
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/Camyll, https://github.com/atalman
2025-05-01 02:00:51 +00:00
0b8822e70b [export] set is_exporting() for strict (#151833)
Helpful for upcoming work in figuring when to use stack trace in prettifying dynamic shapes errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151833
Approved by: https://github.com/angelayi
2025-05-01 02:00:19 +00:00
f2cc07d202 [cutlass backend] Add addmm dynamic support (#152498)
Differential Revision: [D73893133](https://our.internmc.facebook.com/intern/diff/D73893133/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152498
Approved by: https://github.com/ColinPeppler
2025-05-01 01:40:08 +00:00
fe1deeb701 [BE] Replace func_name with __func__ (#152553)
Summary: Not sure why one needs to preserve the name by hand

Test Plan: CI

Differential Revision: D73941209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152553
Approved by: https://github.com/wdvr
2025-05-01 01:26:49 +00:00
0d2746092b [ez][export] suggest torch._checks only for booleans (#152499)
We were doing this when the error was coming from int/float casts, suggesting fixes like `torch._check(zuf0), torch._check(~zuf0)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152499
Approved by: https://github.com/angelayi
2025-05-01 01:24:46 +00:00
be1adcae32 add split sizes info dump for uneven all2all bw calculation (#151438)
Add split sizes info to dumped execution trace and kineto trace for bw calcuation of uneven all2all.

Take input data as an example from case below, although we know input size of Rank-0 is 50 elements, actual data size that Rank-0 sends out is (12+13+14)=39 elements. Rank-0 doesn't send the 1st chunk of 11 elements to peers. But we don't know this infomation now, because "in split size" filed is empty.
![image](https://github.com/user-attachments/assets/7240f334-2081-409b-bbe0-a8396ffa2d30)
![image](https://github.com/user-attachments/assets/679fc49f-e34f-4a74-bad0-fb6fa9d18239)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151438
Approved by: https://github.com/shengfukevin, https://github.com/kwen2501
2025-05-01 01:19:20 +00:00
eqy
7abca8ceba Decorate test_host_memory_stats with @serialTest (#152454)
Seems to need it as it is expecting only its allocation behavior to be visible, to address #152422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152454
Approved by: https://github.com/Skylion007
2025-05-01 00:53:20 +00:00
5521e6b671 [export] support SymInt minlength for torch.bincount() (#152497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152497
Approved by: https://github.com/angelayi
2025-05-01 00:45:58 +00:00
ad9e209ea3 Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
2025-05-01 00:44:02 +00:00
8136e0d3b7 Expose NCCL communicator from ProcessGroupNCCL via an unsafe API (#152496)
Differential Revision: D73892691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152496
Approved by: https://github.com/ngimel
2025-04-30 23:51:34 +00:00
f2a89b802d [invoke_subgraph] Cache on tangent metadata and retrace if needed (#152357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152357
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-30 23:49:17 +00:00
b6f8209f54 Remove redundant line in partitioner (#152517)
Summary: This is a cleanup from https://github.com/pytorch/pytorch/pull/152264, which contained a line which was a vestige from a previous implementation.

Test Plan: Let CI run

Differential Revision: D73904636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152517
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-04-30 23:17:30 +00:00
56039b5778 Revert "[CUDAGraph Trees] support memory allocation on side stream (#152472)"
This reverts commit c620763ec2be83e37f9b31ad6663c6e82d6c0ab0.

Reverted https://github.com/pytorch/pytorch/pull/152472 on behalf of https://github.com/BoyuanFeng due to should use tid instead pid ([comment](https://github.com/pytorch/pytorch/pull/152472#issuecomment-2843491656))
2025-04-30 22:18:10 +00:00
361bf056a7 [nativert] Add moodycamel/concurrentqueue as third-party dependency (#152033)
nativert RFC:  https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

moodycamel/concurrentqueue is a high performence mpmc queue implementation and single header only. We want to add this to third_party to be used with upcoming Torch Native Runtime.

The source code is imported from commit hash 2f09da73d22a47dc8a89cdd4fc4c3bfae07f4284 from https://github.com/cameron314/concurrentqueue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152033
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-04-30 21:37:20 +00:00
49a72011cc Revert "[inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)"
This reverts commit 76331657d21e4bebd8f3c00ceed5369ae8b64112.

Reverted https://github.com/pytorch/pytorch/pull/152487 on behalf of https://github.com/malfet due to And it broke those tests, not sure why signal was ignored ([comment](https://github.com/pytorch/pytorch/pull/152487#issuecomment-2843333471))
2025-04-30 21:35:17 +00:00
3f10091d3c Clean up conda usage in benchmark scripts (#152552)
Fixes https://github.com/pytorch/pytorch/issues/152123.

* Switch `benchmarks/dynamo/Makefile` to use uv.  Note that these scripts are only used locally, so it's kind of ok to keep conda here IMO.  But switching to uv is probably nicer to most folks.
* Delete some files that are outdated and not used anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152552
Approved by: https://github.com/atalman, https://github.com/albanD
2025-04-30 21:27:29 +00:00
5a66c1d921 [nativert] Add utility function to convert strings into numbers. (#151467)
Summary:

nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a small library to convert strings into numbers which will later be used for parsing graph IR.

Differential Revision: D73133034

## Test Plan

c10 unittests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151467
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-04-30 21:20:52 +00:00
22ecaeb145 [standalone_compile] fix dynamic shapes with config_patches (#152462)
compile_fx with config_patches goes down another path where we need to
propagate the kwarg...

Test Plan:
- updated test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152462
Approved by: https://github.com/oulgen
2025-04-30 21:02:14 +00:00
eqy
ce317cd5a8 [CUDA][SDPA] bump fudge factor in test_sdpa in test_nestedtensor (#152235)
Small mismatches on e.g., 4090, A6000/A40

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152235
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/jbschlosser
2025-04-30 20:24:49 +00:00
55c539428f [inductor][BE] cleanup and improve precompilation loggings (#152483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152483
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-04-30 20:21:55 +00:00
76331657d2 [inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-04-30 20:05:21 +00:00
adebb8b112 set thread_work_size to 4 for unrolled kernel (#152396)
Previous PRs enabling 8-vectorization inadvertently regressed unrolled kernel perf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152396
Approved by: https://github.com/BoyuanFeng, https://github.com/msaroufim, https://github.com/malfet, https://github.com/Aidyn-A, https://github.com/atalman
2025-04-30 19:53:58 +00:00
c4a0b31c1d Update CODEOWNERS (torch/utils/data/) (#152482)
Updating codeowners for dataloading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152482
Approved by: https://github.com/ramanishsingh, https://github.com/janeyx99
2025-04-30 19:24:56 +00:00
eqy
1bb13a16bb [CUDA][SDPA] Bump python fused_attention_vs_math_ref_grads fudge_factor for sm120 (#152491)
🍦

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152491
Approved by: https://github.com/Skylion007
2025-04-30 19:22:21 +00:00
7a3cae4b20 Configurable logging for cpp_extensions.py (#152260)
Today `cpp_extensions` makes heavy use of printing to stderr, this makes our life harder in KernelBot where we typically rely on stderr to only surface real errors but instead today cpp_extensions leverages stderr for updates that could be qualified as INFO, WARNING, ERROR

Now instead we'll recommend users of our cpp extension system to do something like

```python
import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)
```

While this dramatically reduces log spew, it can be viewed as a BC breaking change if people were relying on certain strings being present in stdout or stderr

Considering different teams might want to silence errors differently, this PR proposes replacing all `print()` statements with `logging` statements with the same heuristics that the python logging module recommends
1. DEBUG: For things like detailed compilation steps or reading filepaths - by default gets logged on stdout
2. INFO: Build progress - by default gets logged on stdout
3. WARNING: Surfacing issues that might cause bad performance or slow compilation times - by default gets logged on stdout
4. ERROR: Problems that prevent proper functioning - by default gets logged on stdout

Note that warnings.warn is a different library and is not hooked up to the python logging module by default

So the goal of this PR is to make it possible for teams to set the logging that is most appropriate to them. One annoying thing is logger throws ruff errors if you try to use it in conjunction with f strings or .format so have to use old school %s

An unrelated improvement I'd be happy to push to a seperate PR is adding support for "native" in `TORCH_CUDA_ARCH_LIST` which would just pick the ARCH for the current device

An example of what's in stderr today

```
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/grayscale/build.ninja...
/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module grayscale...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module grayscale...
/usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:679: UserWarning: Graph break due to unsupported builtin grayscale.PyCapsule.grayscale. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
```

Whereas after this PR users can do

`python benchmark_load_inline.py > >(tee stdout.txt) 2> >(tee stderr.txt >&2)`

```python
import os
import sys
from pathlib import Path
import shutil
import tempfile

import torch
from torch.utils.cpp_extension import load_inline

import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)

os.environ["TORCH_CUDA_ARCH_LIST"] = "native"

cpp_code = """
torch::Tensor to_gray(torch::Tensor input);
"""

cuda_kernel_code = """
torch::Tensor to_gray(torch::Tensor input) {
  auto output = torch::epty({input.size(0), input.size(1)}, input.options());
  return output ;
}
"""

# Avoid caching results
with tempfile.TemporaryDirectory() as build_dir:
    cuda_module = load_inline(
        name="to_gray_cuda",
        cpp_sources=cpp_code,
        cuda_sources=cuda_kernel_code,
        functions=["to_gray"],
        with_cuda=True,
        verbose=True,
        extra_cflags=["-std=c++17"], # "-ftime-report", "-H"],
        extra_cuda_cflags=["-arch=sm_89"],
        build_directory=build_dir,
    )

```

## New logs

### On failure

Which gives a much more reasonable stdout

```
[1/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
/tmp/tmpbg_xzv0r/cuda.cu(6): error: namespace "torch" has no member "epty"
    auto output = torch::epty({input.size(0), input.size(1)}, input.options());
                         ^

1 error detected in the compilation of "/tmp/tmpbg_xzv0r/cuda.cu".
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpbg_xzv0r/main.cpp -o main.o
ninja: build stopped: subcommand failed.

```

And stderr

```
Traceback (most recent call last):
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2874, in _run_ninja_build
    subprocess.run(
  File "/home/marksaroufim/.conda/envs/nv/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/marksaroufim/load_inline_slow/benchmark_load_inline.py", line 30, in <module>
    cuda_module = load_inline(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2261, in load_inline
    return _jit_compile(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2367, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2528, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2892, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'to_gray_cuda'

```

### On success

stdout

```
[1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpxv_ovlrf/main.cpp -o main.o
[2/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpxv_ovlrf/cuda.cu -o cuda.cuda.o
[3/3] c++ main.o cuda.cuda.o -shared -L/home/marksaroufim/pytorch/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-12.8/lib64 -lcudart -o to_gray_cuda.so

```

And an empty stderr as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152260
Approved by: https://github.com/albanD
2025-04-30 18:30:28 +00:00
05933e08ca [ATen][CUDA][SDPA] Enable SDPA on sm_121 (#152314)
This PR adds support for `sm_121` of the DGX Spark. The `sm_121` is binary compatible with `sm_120` (just like `sm_89` and `sm_86`), therefore a compilation targeting `sm_121` is not required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152314
Approved by: https://github.com/eqy
2025-04-30 18:04:50 +00:00
b027cb8f9e [Docs] Add Description of validate_args for torch.distributions (#152173)
Fixes #152165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152173
Approved by: https://github.com/soulitzer
2025-04-30 18:01:20 +00:00
cyy
256c96332c [1/N] Use std::filesystem (#152288)
Maybe it is time to use std::filesystem because CXX11 ABI is now the default. The changes are for jit and distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152288
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-30 17:54:16 +00:00
62ab6a5bb1 [ROCm] Use almalinux docker files for building Magma (#152488)
Fixes #151707 for ROCm Magma builds.  See also #152358.  Depends on #152492.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152488
Approved by: https://github.com/atalman
2025-04-30 17:53:30 +00:00
c620763ec2 [CUDAGraph Trees] support memory allocation on side stream (#152472)
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.

However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.

So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.

Fixes #151199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison
2025-04-30 17:45:07 +00:00
0904a182c2 [dynamo] Relax guard introduced when tracing __call__ on user defined object (#152395)
This relaxes the guard introduced in #100444 (which aggressively guard
on the object id, despite Dynamo is just tracing its `__call__` method.

This allows users to bypass the high compilation time issue in #150706
by compiling transformer blocks only. Without this patch, we'd get lots
of unnecessary recompilation, as the block has difference attention
processor instances.

Compiling blocks only _significantly_ speeds up compilation process
(from ~310s to ~32s), and even speeds up e2e performance for some reason
(7.83s to 7.67s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152395
Approved by: https://github.com/anijain2305
ghstack dependencies: #152369
2025-04-30 17:34:21 +00:00
e4994e2f73 [AOTAutogradCache] Allow torch.Tensor and a non-torch op from einops (#152369)
This addresses part of #150706.

Specifically, it reduces the warm start `torch.compile` overhead by
40~50% for GGUF models on
1. HuggingFace diffusers: [tlparse before, 224s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpqgbdva/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 126s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp950PFy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
2. ComfyUI: [tlparse before, 93s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7SeJb4/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 51s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRwGNqA/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)

The improvements should generalize to all other GGUF models on these
platforms, because the cache miss was induced by framework code, which
will be hit by every GGUF model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152369
Approved by: https://github.com/jamesjwu
2025-04-30 17:34:21 +00:00
ce2cf31623 Remove dead binary_ios_build, test, upload scripts (#152461)
Can't find any mentions of them in the codebase, presumably no longer used?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152461
Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/malfet
2025-04-30 17:10:27 +00:00
702264dad4 Revert "Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)"
This reverts commit ff1099562d261315ac7bbf43f3795872099a1c31.

Reverted https://github.com/pytorch/pytorch/pull/152103 on behalf of https://github.com/clee2000 due to failure is real but log classifier is pointing at an unrelated line, actual failure is just that the old name is mentioned somewhere and needs to be changed, see the bottom of the test step of the job https://github.com/pytorch/pytorch/actions/runs/14740884246/job/41379127184#step:22:705 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14758321324/job/41434697413) [HUD commit link](ff1099562d) ([comment](https://github.com/pytorch/pytorch/pull/152103#issuecomment-2842638551))
2025-04-30 16:57:58 +00:00
8aa65780f4 [CUDA] Fix test_multi_device_context_manager on CUDA (#152474)
Seems there was a typo where `set_device` was called when the intent was to use `current_device`

As-is the test will fail on multigpu systems with

`TypeError: set_device() missing 1 required positional argument: 'device'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152474
Approved by: https://github.com/Skylion007
2025-04-30 16:53:10 +00:00
1e4bcd3ba3 Remove unnecessary condition compilation macro (#152512)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152512
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-04-30 16:48:25 +00:00
3b105ccc04 [AOTI] Fix a memory leak in model_package_loader (#152334)
Summary: There was a char array allocated but never freed. It was found by valgrind and verified fixed with this PR, although it's not easy to write a unit test for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152334
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-04-30 16:21:50 +00:00
c7484805ca Add two missing JIT tests to CMake (#152440)
Looks like I forgot to add these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152440
Approved by: https://github.com/Skylion007
2025-04-30 16:18:55 +00:00
ff1099562d Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
2025-04-30 15:27:44 +00:00
3c2bf24786 [ROCm] add almalinux images (#152492)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152492
Approved by: https://github.com/atalman
2025-04-30 15:14:01 +00:00
d88e0ceb64 Cast to unsigned char to avoid UB (#152360)
The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB).
To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this.

Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-04-30 15:09:13 +00:00
4408701fed [CI][CD] Unify install_cuda and install_cuda_aarch64 scripts (#152140)
Generalize install_cuda so it can also handle aarch64
Remove install_cuda_aarch64 since install_cuda can now handle it
Make install_cuda and install_cudnn functions in the install_cuda script because most of the code is the same

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152140
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-04-30 15:09:06 +00:00
371999782a Revert "Fix flaky test in test_custom_ops (#152484)"
This reverts commit 5a52e050248c71dd6e84f51d25cbd17a88555800.

Reverted https://github.com/pytorch/pytorch/pull/152484 on behalf of https://github.com/malfet due to It broke test_save to file with TypeError: get_sample_op_profile() missing 1 required argument ([comment](https://github.com/pytorch/pytorch/pull/152484#issuecomment-2842254907))
2025-04-30 14:53:15 +00:00
d620fefb2c [invoke_subgraph] Use backward identifier for min-cut parititioning (#152207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152207
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-30 14:34:56 +00:00
cf894b3f1f [MPS][BE] Remove exec_binary_alpha_kernel (#152485)
Which was almost a complete copy-n-paste from exec_binary_kernel anyway
Just add `Scalar` as an optional argument and figure out kernel name during the invocation rather than in executor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152485
Approved by: https://github.com/Skylion007
ghstack dependencies: #152443, #152466, #152479, #152504
2025-04-30 14:09:14 +00:00
c90e23eb73 [inductor] Fix usage of launch_enter_hook/launch_exit_hook (#152457)
In https://github.com/triton-lang/triton/pull/6467 I moved where `launch_enter_hook`/`launch_exit_hook` are specified (from the kernel class to a config). This PR updates the usages to use the config module if it exists to support tip of main triton.

In https://github.com/triton-lang/triton/pull/6641 I renamed `triton.config` to `triton.knobs`, hence the second commit in this PR.

Test Plan: Setup OSS PT with tip of main triton (namely including https://github.com/triton-lang/triton/pull/6641) and run `python test/inductor/test_pad_mm.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152457
Approved by: https://github.com/jamesjwu
2025-04-30 13:22:16 +00:00
36acaaae3f [CUDA] Add new architectures (#152414)
CUDA 12.9 will introduce a couple of new architectures `sm_103` and `sm_121`. We do not need to build for them, because they are going to be compatible with`sm_100` and `sm_120` respectively (similar to `sm_86` and `sm_89`), but PyTorch must be "aware" of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152414
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet
2025-04-30 09:55:27 +00:00
ece1658418 [ROCm][TunableOp] Fix ScaledGEMM rowwise (#152403)
Fixes TunableOp ScaledGEMM regression for rowwise scaling caused by this https://github.com/pytorch/pytorch/pull/147548

Credit goes to @mawong-amd for fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152403
Approved by: https://github.com/jeffdaily
2025-04-30 08:33:03 +00:00
7a9d0d2451 Revert "[PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450)"
This reverts commit c8f48eb18531e4e348fcfa718b2e52d3c2497197.

Reverted https://github.com/pytorch/pytorch/pull/152450 on behalf of https://github.com/wdvr due to still failing after https://github.com/pytorch/pytorch/pull/152493 - needs further investigation ([comment](https://github.com/pytorch/pytorch/pull/152450#issuecomment-2841212970))
2025-04-30 08:30:57 +00:00
424e21ae82 Revert "fix tests broken after #152450 (#152493)"
This reverts commit d8fe6fa280c3e5bd21b3e84b3e25d9204ccdedf7.

Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/wdvr due to still failing ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841207942))
2025-04-30 08:27:58 +00:00
fa6f9eb2be [CUDA][TF32] Account for TF32 in compile_kernel_advanced (#152468)
Also cleanup some uses of `assert_close` in favor of `self.assertEqual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152468
Approved by: https://github.com/msaroufim
2025-04-30 07:54:38 +00:00
d8fe6fa280 fix tests broken after #152450 (#152493)
Updating test expected value after #152450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493
Approved by: https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-04-30 07:16:10 +00:00
5a52e05024 Fix flaky test in test_custom_ops (#152484)
Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484
Approved by: https://github.com/zou3519
2025-04-30 07:07:27 +00:00
cc7346bf19 Revert "fix tests broken after #152450 (#152493)"
This reverts commit 4df97a883949564aa4ed20b6912c3eb664d2624c.

Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/huydhn due to Another tweak is needed https://github.com/pytorch/pytorch/actions/runs/14748144909/job/41399954902, seem easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841010528))
2025-04-30 07:05:58 +00:00
59a8aa1489 Fix instantiate_device_type_tests() for 3rd-party devices (#152177)
For 3rd-party devices now, `` instantiate_device_type_tests()`` with explicitly passing ``str`` obj (rather than `List[str]/Tuple[str]`) to argument ``only_for`` or ``except_for`` would causes unexpected results.

For example, if calling ``instantiate_device_type_tests(TestXXX, globals(), only_for="cpu")``, then it goes into [filter_desired_device_types()](f38dae76ee/torch/testing/_internal/common_device_type.py (L729)) and results in ``only_for=['c', 'p', 'u']`` because ``only_for`` we passed is  a "cpu" string.

This PR fixes the above unexpected behavior for ``str`` case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152177
Approved by: https://github.com/albanD
2025-04-30 06:25:59 +00:00
a2c553cac6 [Metal] Extend typecasted op support to complex dtypes (#152504)
First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`)

Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466, #152479
2025-04-30 05:32:07 +00:00
4df97a8839 fix tests broken after #152450 (#152493)
Updating test expected value after #152450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-30 04:55:55 +00:00
fcfa6e36c9 [MPS] Fix lerp for complex numbers (#152479)
As well as `.add`/`.sub` with complex alpha

Before this change `python3 -c "import torch;print(torch.rand(10, device='mps', dtype=torch.complex64).add(torch.rand(10, device='mps', dtype=torch.complex64), alpha=.5j))"` used to fail with
```
RuntimeError: value cannot be converted to type double without overflow
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152479
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466
2025-04-30 04:46:19 +00:00
9bfdf57572 [MPS][BE] Introduce c10:🤘:mul (#152466)
Which multiplies two arguments for either scalar or complex data types

This allows one to get rid of bunch of complex specialization in BinaryOps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152466
Approved by: https://github.com/dcci
ghstack dependencies: #152443
2025-04-30 04:45:47 +00:00
ee2d104c05 [cutlass backend] Add (limited) bmm dynamic shape support (#152393)
Differential Revision: D73626732

In this PR, we add support for bmm dynamic shape, provided that the batch stride is the biggest in the stride for A, B, and D. For example, for A of size `(B, M, K)`, we support stride `(M*K, K, 1)` and `(M*K, 1, M)`. With this assumption, we can infer the batch stride from existing arguments.

The reason is we don't want to add 2-3 more runtime params. The concerns are complexity and possible perf regression, though we didn't verify the latter.

We can revisit this if there is a need for that.

We also remove `B = 1` for normal mm and addmm. We tested it and didn't see perf regression. But open to revisiting this as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152393
Approved by: https://github.com/ColinPeppler
2025-04-30 04:36:24 +00:00
e5ea7911ea [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-30 03:51:50 +00:00
c01bcc5efb [MPS][BE] Delete unused lerp functors (#152443)
For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations.
But move `alpha=1.0` ahead of dispatching to Metal shaders, as plain copy of tensor should still be faster a1a4fee3b8/aten/src/ATen/native/mps/operations/BinaryOps.mm (L285-L290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443
Approved by: https://github.com/Skylion007
2025-04-30 03:32:52 +00:00
4a63cab624 [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-04-30 03:24:05 +00:00
bce7f0a216 Fix additional inputs to error on inconsistent constants (#151970)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151970
Approved by: https://github.com/pianpwk
2025-04-30 01:38:17 +00:00
4bead7b85e use cutlass native BroadcastPtrArray in scaled group gemm (#152404)
After cutlass update to 3.9 we can use BroadcastPtrArray instead of a local copy with small changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152404
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-30 01:17:28 +00:00
eqy
cc072af74a [CUDA][MXFP8] bump tolerances for test_blockwise_mxfp8_nvfp4_numerics (#151811)
got a slightly lower sqnr on a smaller GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151811
Approved by: https://github.com/albanD
2025-04-30 01:12:51 +00:00
bea7d428bc [export] Preserve custom metadata for tensor constants (#152241)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/151476
The `custom_meta` collected from `mod` has keys that follow name of nodes in `mod`, which are inconsistent with the node names after the naming pass. For example a constant `b` will become `c_b`.

Test Plan: buck2 run caffe2/test:test_export -- -r test_run_decompositions_keep_tensor_constant_metadata

Differential Revision: D73703068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152241
Approved by: https://github.com/angelayi
2025-04-30 00:30:35 +00:00
d36b09ca58 [aten] Enable vectorized 8byte copy for fp16/bf16 for index select kernel (#152380)
## Summary

Enable aligned vector loading for 2 bytes data types for index select. Specifically:

- **4 element fp16/bf16 packing**: added 8-byte vector load/store to move 4 half values at once.
- **warp-wide predicate (__all_sync)**: decide fast vs fallback path per warp, eliminating lane level divergence
- **alignment guard**: fast or vectorized path only executes when src and dst are 8 byte aligned, preventing mis aligned address faults.
- **Safe for loop fallback**: for misaligned, strid > 1, or tail elements we recompute offsets per element to avoid memory corruption.
- **Bound checks**: fast or vectorized path is skipped when less than 4 elements are remaining, guaranteeing bounded access.
- **Stride remapping**: Redirect calls to inner contiguous dim which has stride = 1 so copies occur along memory coalesced axes.
- **AMD support**: Ensured portability and correctness across CUDA and HIP platforms.

## Perf testing
We note a 2.5x improvement in memory bandwidth after this change when the tensor dim is a multiple of 4 for 2 byte data types (fp16/bf16).

<img width="625" alt="image" src="https://github.com/user-attachments/assets/909b04a3-98f2-4c30-8c29-c36e1beeea0f" />

With input tensor dimension not being a multiple of 4, we see a smaller improvement (~1.2x) due to warp divergence.
<img width="624" alt="image" src="https://github.com/user-attachments/assets/f3ed16f4-b091-48bd-9889-093f6a90688d" />

## Perf testing code
```
# pyre-strict
from typing import List, Optional, Tuple

import click
import pandas as pd

import torch

# @manual=//triton:triton
import triton

@click.command()
@click.option("--data-type", type=str, default="bf16")
@click.option("--return-result", type=bool, default=False)
def main(
    data_type: str,
    return_result: bool,
) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]:
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
    data_types = {"fp32", "fp16", "bf16"}
    if data_type not in data_types:
        raise ValueError(f"Unsupported data type: {data_type}.")

    dtype = {
        "fp32": torch.float32,
        "fp16": torch.float16,
        "bf16": torch.bfloat16
    }[data_type]

    D1 = 192
    D2 = 156
    configs: List[triton.testing.Benchmark] = [
        triton.testing.Benchmark(
            x_names=["B"],
            x_vals=[24],
            line_arg="provider",
            line_vals=[
                "repeat_interleave",
                "repeat_interleave_int32",
            ],
            line_names=["repeat_interleave", "repeat_interleave_int32"],
            styles=[("red", "-"), ("purple", "-")],
            ylabel="ms",
            plot_name=f"torch-repeat_interleave-D1-{D1}-D2-{D2}-dtype-{dtype}",
            args={
                "D1": D1,
                "D2": D2,
                "dtype": dtype,
            },
        )
    ]

    @triton.testing.perf_report(configs)
    def bench_repeat_interleave(
        B: int,
        D1: int,
        D2: int,
        dtype: torch.dtype,
        provider: str,
    ) -> float:
        warmup = 20
        rep = 100
        torch.manual_seed(42)
        torch.cuda.manual_seed(42)

        a = torch.randn(24, D1, D2)
        a = a.to(dtype).to("cuda")

        input_bytes = a.numel() * a.element_size()

        repeats = torch.randint(low=100, high=1600, size=(24,), device="cuda")
        output_bytes = (
            repeats.sum() * a.shape[1] * a.shape[2] * repeats.element_size()
        )
        total_bytes = input_bytes + output_bytes

        def torch_repeat_interleave(
            input_tensor: torch.Tensor, repeats: torch.Tensor
        ) -> torch.Tensor:
            res = input_tensor.repeat_interleave(repeats, dim=0)
            return res

        def torch_repeat_interleave_int32(
            input_tensor: torch.Tensor, repeats: torch.Tensor
        ) -> torch.Tensor:
            dim = 0
            if torch.is_tensor(repeats):
                idx64 = torch.repeat_interleave(
                    torch.arange(
                        0,
                        input_tensor.shape[dim or 0],
                        device=input_tensor.device,
                    ),
                    repeats,
                    dim=0,
                )
            else:
                idx64 = (
                    torch.arange(
                        input_tensor.shape[dim or 0] * repeats,
                        device=input_tensor.device,
                    )
                    .reshape(-1, repeats)
                    .flatten()
                )

            idx32 = idx64.to(torch.int32)
            res = torch.index_select(input_tensor, 0, idx32)
            return res

        def expand_flatten(input_tensor: torch.Tensor) -> torch.Tensor:
            return input_tensor[:, None].expand(-1, 4, -1).flatten(0, 1)

        if provider == "repeat_interleave":
            fn = lambda: torch_repeat_interleave(a, repeats)  # noqa E731
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        if provider == "repeat_interleave_int32":
            fn = lambda: torch_repeat_interleave_int32(a, repeats)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        elif provider == "expand_flatten":
            fn = lambda: expand_flatten(a)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        else:
            raise ValueError(f"unsupported provider: {provider}")

    df = bench_repeat_interleave.run(print_data=True, return_df=True)

    if return_result:
        return configs, df

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152380
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-04-29 23:54:52 +00:00
c6d3b8f861 add xfail for distributed tests on Jetson (#152224)
We are hitting distributed import failures on Jetson in test/export/test_export.py tests in NVIDIA internal testing with the recent additions of https://github.com/pytorch/pytorch/pull/146050 and https://github.com/pytorch/pytorch/pull/147417. Instead of simply skipping these tests for Jetson, we are introducing an xfailIfDistributedNotSupported to get better signaling for this kind of failure in the long run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152224
Approved by: https://github.com/nWEIdia, https://github.com/eqy
2025-04-29 23:48:40 +00:00
6f8023a35f [PowerPC] Fix vec256 for complex float and double in Power system (#152402)
Power System build is failing with below error.

After this commit it is failing:
912102b4ec

Fix the build error along with test cases that are failing for complex double and complex float data type.

Build Failure Logs:
```
vec_base.h:790:6: error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::operator’
790 | c[i] = a[i] * b[i];
| ~^
error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::oper
ator’
802 | c[i] = a[i] / b[i];
| ~^

error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
790 | c[i] = a[i] * b[i];
| ~^

error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
802 | c[i] = a[i] / b[i];
| ~^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152402
Approved by: https://github.com/malfet
2025-04-29 23:45:49 +00:00
c8f48eb185 [PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450)
Summary:
Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass.
Original dper pass diffs: D67884534, D68123479, D68384238

Test Plan:
Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log
```
Log: P1798246938

Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log
```
Log: P1798246675

Seeing logs like
`[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False`

Reviewed By: huxintong

Differential Revision: D71358949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152450
Approved by: https://github.com/huxintong
2025-04-29 23:45:20 +00:00
e872bf8f88 Avoid linking multiple OMP runtimes in libtorch_cpu.so if BLAS used is OpenBLAS. (#147725)
When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so

Fixes #146603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725
Approved by: https://github.com/albanD
2025-04-29 23:39:48 +00:00
a1a4fee3b8 Native channel shuffle floating point exception (#144010)
Fixes #142453

Added TORCH_CHECKS to prevent the user from using the native_channel_shuffle function incorrectly and getting a "Floating point exception (core dumped)"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144010
Approved by: https://github.com/albanD
2025-04-29 23:38:54 +00:00
8f420a500a Save/load op profiles (#151817)
Add ability to save/load op profiles into a yaml file:
```python
op_profile = self.get_sample_op_profile()

# Save
save_op_profiles(op_profile, "op_profile.yaml")
# Load
loaded = load_op_profiles("op_profile.yaml")

assert op_profile == loaded
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151817
Approved by: https://github.com/zou3519
2025-04-29 23:11:32 +00:00
8358eca2ce [Cutlass] Only run EVT tests on sm90 (#151713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151713
Approved by: https://github.com/masnesral
ghstack dependencies: #152305, #152306, #150905, #151405
2025-04-29 23:06:01 +00:00
a1f6d85b36 [Cutlass] Fixes for e2e compilation in arg rendering (#151405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151405
Approved by: https://github.com/eellison
ghstack dependencies: #152305, #152306, #150905
2025-04-29 23:06:01 +00:00
a0ce5ce6e4 [Cutlass] Implement cutlass epilogue visitor python codegen (#150905)
This PR implements the second codegen task of CUTLASS EVT: translating inductor epilogue nodes into python code that will be traced by the EVT infra.

Details:
The implementation uses a simple ops wrapper which only supports add and mul pointwise ops today (to be extended in the future). This ops wrapper generates python code from inner_fn of the epilogue nodes in the format EVT expects. The main caveat is that one of the outputs needs to be named "D" and the accumulator input needs to be named "acc". Reads/writes are named according to the inductor buffer names otherwise.

Previously merged:
* #150904
* #150903
* #150346
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150905
Approved by: https://github.com/eellison
ghstack dependencies: #152305, #152306
2025-04-29 23:05:55 +00:00
72273bef9e [Cutlass] Fix int check in example tensor creation (#152306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152306
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #152305
2025-04-29 23:05:47 +00:00
4293a6095d [Cutlass] Remove unused dtype conversion map (#152305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152305
Approved by: https://github.com/Skylion007
2025-04-29 23:05:41 +00:00
a4a771648a [pt2d] Add reorder_comms_preserving_peak_memory pass (#146562)
This is a new pass to replace the pre-existing passes.  It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.

The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
  relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
  not increaes curr memory beyond peak memory

The pass logs a summary table for each graph to TORCH_LOGS=overlap.

e.g. (exact format may have been tweaked but this shows the idea).

```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node                                                                                                                                                initial exposed    final exposed    improvement  limiting factor        moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] -----------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------  ---------------  -------------  -------------------  -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns)               12141.6          6514.53       5627.08   prefetch limit            75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns)                 32265.8         28429.2        3836.61   data dependency           78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns)                         10800.6         10732.3          68.254  peak memory                1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns)                          10809.5         10809.5           0      data dependency            4
[rank
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
2025-04-29 22:51:31 +00:00
e35e31697e Revert "[MPS][BE] Delete unused lerp functors (#152443)"
This reverts commit 0a2d3206a82c4a5c923938cf0a0ebc0f47aa17dd.

Reverted https://github.com/pytorch/pytorch/pull/152443 on behalf of https://github.com/wdvr due to failing MPS test: test/test_optim.py::TestOptimRenewedMPS::test_can_load_from_to_named_state_dict_is_named_optim0_False_is_named_optim1_False_Adafactor_mps_float32 ([comment](https://github.com/pytorch/pytorch/pull/152443#issuecomment-2840405966))
2025-04-29 22:50:23 +00:00
fecaa60c3c Revert "Add detailed triton kernel logging to tlparse (#152197)"
This reverts commit 8303860de779da840316dd95ce3051e0a4119174.

Reverted https://github.com/pytorch/pytorch/pull/152197 on behalf of https://github.com/wdvr due to failing     python test/dynamo/test_structured_trace.py StructuredTraceTest.test_cudagraphs on trunk ([comment](https://github.com/pytorch/pytorch/pull/152197#issuecomment-2840400839))
2025-04-29 22:47:48 +00:00
471025c489 Revert "[AOTI][reland] Remove typedef for half and bfloat16 (#151109)"
This reverts commit a0d440a26a555c34e87b90bef3bff960b34bb180.

Reverted https://github.com/pytorch/pytorch/pull/151109 on behalf of https://github.com/wdvr due to causing AOTI test failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/151109#issuecomment-2840386483))
2025-04-29 22:37:16 +00:00
accffef504 Run link checks on modified files on push too (#152464)
https://github.com/pytorch/pytorch/issues/152439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152464
Approved by: https://github.com/huydhn
2025-04-29 22:08:40 +00:00
89c0c3ca80 Add private config to broadcast rank0 decision from the partitioner to all ranks (#152264)
Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore.

Test Plan:
This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?)
T

Differential Revision: D73666017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264
Approved by: https://github.com/bdhirsh
2025-04-29 21:27:57 +00:00
28efeb1522 Remove unused Manylinux2014 Docker files and builds (#152428)
Related to Manylinux 2.28 migration: https://github.com/pytorch/pytorch/issues/123649
Cleanup old Docker files and `manylinuxaarch64-builder:cpu-aarch64` image which has been replaced by `manylinux2_28_aarch64-builder:cpu-aarch64`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152428
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-29 20:57:29 +00:00
c039cb1a06 submodules: point gloo to new home in pytorch/ (#152438)
Gloo moved to the PyTorch GitHub org. This updates PyTorch to point to the new location.

https://github.com/pytorch/gloo

Test plan:

CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152438
Approved by: https://github.com/fduwjj
2025-04-29 20:42:24 +00:00
0a2d3206a8 [MPS][BE] Delete unused lerp functors (#152443)
For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443
Approved by: https://github.com/Skylion007
2025-04-29 20:42:21 +00:00
1d8cdf373b [dynamo] Guard serialization for NAME_MATCH (#152332)
Differential Revision: [D73780430](https://our.internmc.facebook.com/intern/diff/D73780430/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152332
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330, #152331
2025-04-29 20:16:00 +00:00
5c297b2846 [dynamo] Guard serialization for DISPATCH_KEY_SET_MATCH (#152331)
Differential Revision: [D73780433](https://our.internmc.facebook.com/intern/diff/D73780433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152331
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330
2025-04-29 20:16:00 +00:00
4cb75d7afc [dynamo] Guard serialization for ID_MATCH (#152330)
Differential Revision: [D73780431](https://our.internmc.facebook.com/intern/diff/D73780431/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152330
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329
2025-04-29 20:16:00 +00:00
0b39124ea3 [dynamo] Guard serialization for NONE_MATCH. (#152329)
Differential Revision: [D73780435](https://our.internmc.facebook.com/intern/diff/D73780435/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152329
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328
2025-04-29 20:16:00 +00:00
ab4091a9fa [dynamo] Guard serialization for BOOL_MATCH. (#152328)
Differential Revision: [D73780434](https://our.internmc.facebook.com/intern/diff/D73780434/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152328
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327
2025-04-29 20:16:00 +00:00
c521c45a8a [dynamo] Guard serialization for DICT_CONTAINS (#152327)
Adding serialization for DICT_CONTAINS

Differential Revision: [D73780432](https://our.internmc.facebook.com/intern/diff/D73780432/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152327
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326
2025-04-29 20:16:00 +00:00
52202525b9 [dynamo] Guard serialization for DICT_VERSION (#152326)
I think we shouldn't support DICT_VERSION for 2 reasons:
1. dict version is not well defined across processes
2. they are pretty rare (only with pytree calls)

Differential Revision: [D73780437](https://our.internmc.facebook.com/intern/diff/D73780437/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152326
Approved by: https://github.com/jansel
ghstack dependencies: #152325
2025-04-29 20:16:00 +00:00
df663b9e72 [dynamo] Guard serialization for TYPE_MATCH (#152325)
Adding guard serialization for TYPE_MATCH

Differential Revision: [D73780438](https://our.internmc.facebook.com/intern/diff/D73780438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152325
Approved by: https://github.com/jansel
2025-04-29 20:16:00 +00:00
a04f4622e1 [conda] Remove conda from lint-autoformat.yml (#152433)
Installs setuptools since I get
https://github.com/pytorch/pytorch/actions/runs/14736804186/job/41364832984#step:5:60
```
+ python3 -m tools.generate_torch_version --is_debug=false
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/tools/generate_torch_version.py", line 9, in <module>
    from setuptools import distutils  # type: ignore[import]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'setuptools'
```
It should be a no op in the normal lint workflow since setuptools is in the docker image

Switched from using python3.10 to system python, which should be python3.9

Use venv to put deps not in the base?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152433
Approved by: https://github.com/huydhn
2025-04-29 20:14:21 +00:00
2cfc1faa27 [PT2]: fix add_passes and remove_passes naming issue (#152386)
Summary:
When defining pre_grad passes, they are initially defined as empty functions, then overriden in [customized_triton_kernel_passes.py](https://www.internalfb.com/code/fbsource/[b4eea3dcd7f22421e68a3c1533fd09a4281bc291]/fbcode/caffe2/torch/_inductor/fx_passes/fb/customized_triton_kernel_passes.py?lines=71-73). This causes issues for add_passes and remove_passes because `p.__name__` now may be prefixed by _.

This diff removes the leading _ to match the pass name.

Test Plan: Tested together with the next diff in the stack.

Reviewed By: oniononion36

Differential Revision: D73809937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152386
Approved by: https://github.com/huxintong
2025-04-29 20:07:15 +00:00
e58c73be44 Add latex settings (#152350)
- Fixes #147027
- Only lualatex can build our 3K pages PDF with reasonable quality, xelatex runs out of memory and pdflatex just fails.
- Move notes under the same toctree as python-api which is needed for the PDF but doesn't change how the HTML is generated.

This is the produced PDF:
[pytorch.pdf](https://github.com/user-attachments/files/19945450/pytorch.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152350
Approved by: https://github.com/albanD
2025-04-29 19:28:43 +00:00
e6e1ca1996 [easy] Fix test_dynamo_timed (#152387)
Summary: I'm just trying to fix the test again. It's out of date because it's disabled and some dynamo_timed-related fields are gone now.

Test Plan: `python test/dynamo/test_utils.py -k dynamo_timed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152387
Approved by: https://github.com/anijain2305
2025-04-29 19:22:56 +00:00
8e2e06b7ea Fix shadow local variables (#152429)
Summary: Fixing shadow local variables error: P1798875650

Test Plan: CI

Differential Revision: D73853605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-29 18:50:18 +00:00
a3123dd3ab Run link linters on modified files only or on everything when scheduled (#152377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152377
Approved by: https://github.com/huydhn
2025-04-29 18:30:40 +00:00
8303860de7 Add detailed triton kernel logging to tlparse (#152197)
This PR adds detailed logging of each triton kernel we compile, and its autotune result, to every kernel we compile with triton. We add these results to a global variable that we then clear after each triton kernel compile.

We can't keep these objects around after compile time, so we can't record the autotune cache save or coordinate descent tuning, unfortunately, but we can log at least:
- The duration of compilation
- Whether or not autotune cache hit
- The best autotuning config, if there's only one.

Example triton kernel info: https://gist.github.com/jamesjwu/493bdd0f36b0b7e3ca327f87bd6c2c75

See internal diff for an example log for internal model.

Differential Revision: [D73674443](https://our.internmc.facebook.com/intern/diff/D73674443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152197
Approved by: https://github.com/oulgen, https://github.com/eellison
2025-04-29 18:16:56 +00:00
d35e900c74 [MPSInductor] Make sure sizevars are computed (#152436)
Before calling the kernel

This fixes `GPUTests.test_float_repr_dynamic_shapes_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152436
Approved by: https://github.com/dcci
ghstack dependencies: #152363, #152430
2025-04-29 17:53:29 +00:00
835f95490f [MPSInductor] Fix type promotion in _print_Max (#152430)
Run into this problem while re-enabling `test_float_repr_dynamic_shapes`, where `_print_Max` were called for integer and long argument which resulted in the following compilation error
```
error: call to 'max' is ambiguous
        out_ptr0[x0 + x1*metal::max(1, ks0)] = static_cast<float>(tmp26);
                         ^~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:2477:16: note: candidate function
METAL_FUNC int max(int x, int y)
               ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:3686:17: note: candidate function
METAL_FUNC long max(long x, long y)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152430
Approved by: https://github.com/dcci
ghstack dependencies: #152363
2025-04-29 17:53:29 +00:00
cce8b5d8d7 Refactor TritonTemplate.generate and move codgen part to generate_and_load (#151764)
Splitting https://github.com/pytorch/pytorch/pull/149267/ .
This first PR just refactor the code without adding any caching functionality.
The logic of generating the code and loading it is moved to generate_and_load() + some typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151764
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-04-29 17:44:46 +00:00
3962b8f1e0 Revert "[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)"
This reverts commit 64a55b531f4f4ae2b35175ab5d9a30a856b0d6ef.

Reverted https://github.com/pytorch/pytorch/pull/151914 on behalf of https://github.com/malfet due to Looks like breaks number of ROCM jobs, see 797768cd90/1 ([comment](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038))
2025-04-29 17:36:12 +00:00
797768cd90 [Graph Partition] reorder for minimal number of partitions (#151968)
This pr adds an optimal reordering for minimizing #partitions.

## Optimal reordering for minimizing #partitions

A bfs could minimize #partitions (ignore peak memory for now):
1. For each node, compute node_to_indegree: dict[node, int].
2. Maintain 2 queues: cudagraphable_nodes, and non_cudagraphable_nodes. Iterate through all nodes and add nodes to one of these 2 queues if node_to_indegree[node] == 0.
3. While non_cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
4. While cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
5. Repeat step 3 & 4 until all nodes have been scheduled.

We call this strategy `reorder_for_minimizing_partition`.

**Q: Why is this optimal?**

Suppose this is not optimal, we have a counter example with 2 non_cudagraphable regions:

```
[non_cudagrable1, cudagraphable2, non_cudagraphable3]
```

where we can reorder to only 1 non_cudagraphable region:

```
[non_cudagrable1, non_cudagraphable3, cudagraphable2]
```

This reorder means non_cudagraphable3 does not depend on cudagraphable2. So after we scheduled non_cudagraphable1, both non_cudagraphable3 and cudagraphable2 have in_degree as 0. If this is true, Step 3 should have already scheduled non_cudagraphable3 before cudagraphable2 such that the counter example cannot exist.

This shows we cannot find such a counter example and the bfs is optimal on minimizing #partitions.

## Minimize peak memory

`reorder_for_peak_memory` currently uses topological_sort_dfs, topological_sort_lpmf, and topological_sort_bfs, where the later 2 are bfs. ILP brings small benefits and it can hardly scale to more than 100 nodes, according to @xuanzhang816. So ILP is not used for peak memory reorder in the inductor.

Heuristics strategy:
- Conduct reorder_for_peak_memory as the default order
- Conduct reorder_for_minimal_partitions and get results as list[tuple[partition, bool]], where partition: list[BaseSchedulerNode] and bool for cudagraphable.
- If the reorder increases peak memory too much, we use the default order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151968
Approved by: https://github.com/eellison
2025-04-29 17:17:16 +00:00
a77a44761b [BE] Remove dangling # in contributing.md (#152259)
I frequently come to CONTRIBUTING.md to copy paste the below snippet to rebuild pytorch which in zsh gives this error because zsh interprets # as a command. These comments add nothing so just removing

```
error: pathspec 'sync' did not match any file(s) known to git
error: pathspec 'the' did not match any file(s) known to git
error: pathspec 'submodules' did not match any file(s) known to git
Building wheel torch-2.8.0a0+git9c01c87
invalid command name '#'
```

```
git submodule update --init --recursive # very important to sync the submodules
python setup.py develop                 # then try running the command again
git submodule update --init --recursive
python setup.py develop
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152259
Approved by: https://github.com/janeyx99
2025-04-29 17:07:19 +00:00
de20d76622 [conda] Remove conda usage from upload test stats while running workflow (#152431)
The original uses python 3.10 and the base is 3.9 but I think that's ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152431
Approved by: https://github.com/atalman
2025-04-29 16:16:54 +00:00
f84062f78d [conda] Remove conda usage from TD llm retriever job (#152338)
Remove conda usage from TD llm retriever job

python3 in the base is python3.9 right now.  I'm not sure what the best way to deal with a potentially different python version would be, dnf install?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152338
Approved by: https://github.com/huydhn
2025-04-29 15:17:50 +00:00
663bcb68ba Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644)
Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644
Approved by: https://github.com/malfet
2025-04-29 14:24:49 +00:00
2fb62f8288 [Dynamo][Typing] Enable typing hints for tx in misc.py (#152412)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152412
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-04-29 13:54:35 +00:00
49cbe0ffe9 [AOTInductor] Propagate ConstantType for main graph. (#152272)
Summary:
We need to make sure all named_parameters and named_buffers be
propagated if we use runtime constant folding.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_type_propagation

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152272
Approved by: https://github.com/22quinn

Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-29 12:42:17 +00:00
64a55b531f [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)
As the title stated.

**Changes**:
- Add get_rng_state & set_rng_state support for OpenReg
- Add _lazy_init support for OpenReg
- Remove redundant code for cuda/Module.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914
Approved by: https://github.com/albanD
2025-04-29 11:18:12 +00:00
5c01302cc8 Remove 3.13 hack when installing TIMM (#152399)
A Docker build failure showing up at this step triggered by the landing of https://github.com/pytorch/pytorch/pull/152362.  Here is the example logs https://github.com/pytorch/pytorch/actions/runs/14718029881/job/41305891896:

```
#37 29.72 + as_jenkins conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
#37 29.72 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/conda/envs/py_3.13/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
#37 49.50 ERROR: Cannot install torch and torchvision==0.22.0.dev20250226+cu124 because these package versions have conflicting dependencies.
```

This happens because we have stopped building 12.4 nightly for sometime.  This hack doesn't apply anymore, so let's just remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152399
Approved by: https://github.com/cyyever, https://github.com/wdvr, https://github.com/malfet
2025-04-29 08:22:37 +00:00
eb69f4e609 Add lr_lambda type check in MultiplicativeLR (#151973)
Fixes #81554

## TestResult

### Before

```python
In [3]: import torch
   ...: class SimpleLinearModel(torch.nn.Module):
   ...:     def __init__(self):
   ...:         super(SimpleLinearModel, self).__init__()
   ...:         self.linear = torch.nn.Linear(10, 1)
   ...:
   ...:     def forward(self, x):
   ...:         return self.linear(x)
   ...:
   ...: net = SimpleLinearModel()
   ...: optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
   ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95)
   ...: for i in range(10):
   ...:     print(i, scheduler.get_last_lr())
   ...:     scheduler.step()
TypeError: 'float' object is not callable

### After

```python
   ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95)
TypeError: lr_lambda should be a function, but got float
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151973
Approved by: https://github.com/janeyx99
2025-04-29 08:21:41 +00:00
dcd9a444b3 Add pack support and use micro gemm for Half flex attention on CPU (#151530)
Add pack support and use micro gemm for the second gemm to improve the performance for Half flex attention on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151530
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-29 07:24:00 +00:00
cyy
41bd0c900a [1/N] Deprecate c10::string_view and at::string (#151972)
The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972
Approved by: https://github.com/malfet
2025-04-29 07:23:52 +00:00
a6d19fcfac Revert "[cudagraphs] Fix issue in collecting static_input_idxs (#152287)"
This reverts commit 75a564608ab289edd5ba0e30a3acf544b90b5769.

Reverted https://github.com/pytorch/pytorch/pull/152287 on behalf of https://github.com/wdvr due to causing ao failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152287#issuecomment-2837686127))
2025-04-29 06:57:06 +00:00
62f1d0ea78 Log information about suppressed data dependent errors (#151041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041
Approved by: https://github.com/bobrenjc93
2025-04-29 06:08:07 +00:00
520366e102 Fix StringCoordView::substr after D73379178 / #151810 (#152304)
Received complaint that we broke something. After a bunch of debugging, landed on this test + fix.

Differential Revision: [D73754877](https://our.internmc.facebook.com/intern/diff/D73754877/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D73754877/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152304
Approved by: https://github.com/Skylion007
2025-04-29 06:00:38 +00:00
ad11d6378c Don't run NCCL/gloo distributed test without GPUs (#150764)
If there aren't any GPUs the WORLD_SIZE would be zero which does not work.
So skip those backends completely in that case.

Fix after https://github.com/pytorch/pytorch/pull/137161

It might make sense to still run the (CPU-) part of the tests by using something like `world_size = max(3, gpu_count)` or `num_gpus if num_gpus else 3` instead of skipping them all

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150764
Approved by: https://github.com/kwen2501
2025-04-29 05:27:23 +00:00
99c42722f6 [MPS] fix memory leak in sdpa float32 (#152371)
Fixes #152344

Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory.

Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-29 04:51:10 +00:00
46419c7899 Revert "[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372)"
This reverts commit 7ce6f632142b65849fa33f325c90a24bace2c130.

Reverted https://github.com/pytorch/pytorch/pull/152372 on behalf of https://github.com/malfet due to Looks like it broke distributed this time around, see f05d3e5019/1 ([comment](https://github.com/pytorch/pytorch/pull/152372#issuecomment-2837426497))
2025-04-29 04:37:40 +00:00
f05d3e5019 [torch-xpu-ops] Update torch-xpu-ops commit pin. (#152321)
Update the torch-xpu-ops commit to [655fa9bc7f88ab5bd3766b5f2fd5b43989c2caca](655fa9bc7f), including:

- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
- Cache cclStream to improve performance.
- Add support for complex datatypes in allgather and broadcast.
- Support coalescing operations and batch_isend_irecv.
- Introduce additional logging; use export TORCH_CPP_LOG_LEVEL=INFO.
- Fix #152296
- Fix #152020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152321
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-04-29 04:00:09 +00:00
119cdcc926 Add rich support to torch.distributed.tensor.debug.visualize_sharding (#152027)
Fixes https://github.com/pytorch/pytorch/issues/151857

Please verify this PR by running the following command on a computer with at least 4 GPUs.

```shell
torchrun --nproc_per_node=4 /w/pytorch/torch/distributed/tensor/examples/visualize_sharding_example.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152027
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2025-04-29 03:51:32 +00:00
9c7b902cb2 [MPSInductor][BE] Make all reductions cacheable (#152363)
By moving actual implementaiton to `_reduction_nocache` and make reduction a caching wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152363
Approved by: https://github.com/dcci
2025-04-29 02:49:22 +00:00
5a9868b78c Do not log exception when recording is disabled or already recording (#151038)
I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be
transparent. namely logging dde could be spamming.

before:
<img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038
Approved by: https://github.com/bobrenjc93
2025-04-29 02:48:20 +00:00
b22fda9e1c Remove conda refs in tools (#152368)
Fixes #152126

Did not find references in the two .ipynb files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152368
Approved by: https://github.com/atalman
2025-04-29 02:45:47 +00:00
c8b4a39d73 Add precedence to the infix printing done by sympy_str. (#151920)
Add precedence to the infix printing done by sympy_str.

Without this change sympy_str will print the same string for both `a+b*(c+d)` and `(a+b)*(c+d)`.

While there I also cleaned up the printing for `-a` and `a - b`.

Added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151920
Approved by: https://github.com/jansel
2025-04-29 00:58:58 +00:00
4b61564252 Include CollectiveKernel in inductor debug visualization (#146561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146561
Approved by: https://github.com/eellison
ghstack dependencies: #152060
2025-04-29 00:53:38 +00:00
22f179d77d Use almalinux docker files for building Magma (#152358)
Resolves https://github.com/pytorch/pytorch/issues/151707 for CUDA Nvidia Magma builds.
Removes deprecated cuda 12.4 build.

Using `pytorch/manylinux2_28-builder` image for magma build creates circular dependency.

For a while for magma builds we used `conda-builder` image since it does not have circular dependency:
https://github.com/pytorch/builder/blob/release/2.4/magma/Makefile#L13
However during migration to pytorch/pytorch: https://github.com/pytorch/pytorch/pull/139888 we introduced circular dependency using Manylinux 2.28 docker image.

Hence using almalinux image which suppose to be general usage image

Please note: Magma builds using Docker build : https://github.com/pytorch/pytorch/blob/main/.ci/magma/README.md we can look into migrating them to Docker images if required as a followup BE change if needed

TODO: Make same change for rocm builds. I believe some more work for rocm is required, since maga-rocm is requires rocm dev, utils and lib to be installed : https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_rocm.sh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152358
Approved by: https://github.com/nWEIdia, https://github.com/huydhn
2025-04-29 00:45:01 +00:00
7ce6f63214 [Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372)
Reapplying with fix for linux-manylinux-2_28-py3-cpu-s390x / build
failure
(https://github.com/pytorch/pytorch/actions/runs/14716285820/job/41300304223#logs),
which is to just update a pair of static_assert constants I got wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152372
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-04-28 23:55:48 +00:00
e5f4356a25 [inductor][fix] enable dtype promotion for bucketize (#150634)
Summary:
bucketization involves comparing an input with border values. Without careful consideration of dtypes, this can cause dangerous implicit casting.

aten.bucketize resolves this via dtype promotion. We enable dtype promotion for the inductor bucketization pass so as to maintain alignment with the aten op.

Test Plan:
```
python3 test/inductor/test_torchinductor.py -k "bucketize"
```

Fixes #145929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150634
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-04-28 23:44:26 +00:00
119f64d0eb Add 'step' counter to visualize_overlap log (#152060)
Example of log after the change:

```
[rank0]:V0227 15:07:20.704000 1594243 torch/_inductor/comms.py:621] [0/0] [__overlap] ==== Visualize overlap after reordering pass <function group_copy_collective at 0x7f41c1922050> (ran in 0.026380538940429688 sec)====
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      0: GroupedSchedulerNode(name='op6_op7')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      1: GroupedSchedulerNode(name='op55_op56')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      2: GroupedSchedulerNode(name='op75_op76')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      3: GroupedSchedulerNode(name='op121_op122')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      4: GroupedSchedulerNode(name='op141_op142')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      5: GroupedSchedulerNode(name='op187_op188')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      6: GroupedSchedulerNode(name='op207_op208')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      7: GroupedSchedulerNode(name='op253_op254')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      8: GroupedSchedulerNode(name='op273_op274')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      9: GroupedSchedulerNode(name='op319_op320')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152060
Approved by: https://github.com/eellison
2025-04-28 23:23:21 +00:00
a6d38051ee [CUDA][CUTLASS] CUTLASS 3.9 submodule upgrade (#151253)
Originally authored by Jack Kosaian, likely needs #ifdefs if we want to preserve compat with 3.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151253
Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-28 23:10:14 +00:00
75a564608a [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison
2025-04-28 23:07:52 +00:00
63790a0c43 Speed-up time spent in generating shaped str keys (#152202)
Replaces the janky way of using the IntArrayRef to create an NSArray to ask for it to provide its contents in a string format with use of stringstream.

This speeds up the call for getting the key string for caching (or reading from cache) for shaped inputs by ~5x. While the actual wall time, depending on the number of input tensors, is only some microseconds this time represents non-negligible chunk of the overall time spent in preparing to dispatch work to the GPU. And since this function gets called on every time a (cacheable) operation in MPS is used it should be a small but broadly impacting time saver.

Using mps_linear as an example. Note this is before PR https://github.com/pytorch/pytorch/pull/152199 so it only captures the CPU time spent in the op call:

Before the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1108f07d0>
func(*args, **kwargs)
  Median: 22.75 us
  IQR:    0.87 us (22.50 to 23.38)
  8361 measurements, 1 runs per measurement, 1 thread
```

After the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x108875350>
func(*args, **kwargs)
  Median: 18.67 us
  IQR:    0.46 us (18.50 to 18.96)
  10342 measurements, 1 runs per measurement, 1 thread
```

Which aligns with the observed change for getTensorStringKeys() taking ~1us instead of ~5us  in mps_linear op I got from a point measurement sandwiching the function call with `std::chrono::high_resolution_clock`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152202
Approved by: https://github.com/Skylion007
2025-04-28 23:06:10 +00:00
c81d8c231c Fix CosineAnnealingWarmRestarts reset T_cur (#151289)
Fixes #88791

## Test Result

```python
pytest test/optim/test_lrscheduler.py -k test_CosineAnnealingWarmRestarts
```

![image](https://github.com/user-attachments/assets/75ad238c-f319-47dc-bf2d-da05b0879b84)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151289
Approved by: https://github.com/janeyx99
2025-04-28 23:02:55 +00:00
0d99b4e9e2 ROCm: Enable tf32 testing on test_nn (#148945)
Add tf32 support for ROCm tests.
test command: python test/test_nn.py -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-28 23:01:04 +00:00
f3ef46e5fa [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/iter.py (#151789)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/iter.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151789
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-04-28 22:56:39 +00:00
d79e06723d Provide list of files to link linters if desired (#152352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152352
Approved by: https://github.com/huydhn
2025-04-28 22:48:34 +00:00
c8540984a2 [inductor] set correct precompile start time (#152284)
Fixes #148777

With num_worker set to 1, ran script in #148777

before:
```
Precompiling benchmark choice TritonTemplateCaller took 0.19s
Precompiling benchmark choice TritonTemplateCaller took 0.38s
Precompiling benchmark choice TritonTemplateCaller took 0.53s
Precompiling benchmark choice TritonTemplateCaller took 0.90s
Precompiling benchmark choice TritonTemplateCaller took 1.29s
Precompiling benchmark choice TritonTemplateCaller took 20.78s
Precompiling benchmark choice TritonTemplateCaller took 25.42s
Precompiling benchmark choice TritonTemplateCaller took 25.92s
Precompiling benchmark choice TritonTemplateCaller took 27.21s
Precompiling benchmark choice TritonTemplateCaller took 48.76s
Precompiling benchmark choice TritonTemplateCaller took 53.66s
Precompiling benchmark choice TritonTemplateCaller took 63.12s
Precompiling benchmark choice TritonTemplateCaller took 69.53s
Precompiling benchmark choice TritonTemplateCaller took 71.24s
Precompiling benchmark choice TritonTemplateCaller took 75.57s
Precompiling benchmark choice TritonTemplateCaller took 97.58s
Precompiling benchmark choice TritonTemplateCaller took 107.71s
Precompiling benchmark choice TritonTemplateCaller took 117.27s
Precompiling benchmark choice TritonTemplateCaller took 126.30s
FX codegen and compilation took 133.733s
```

after:
```
Precompiling benchmark choice TritonTemplateCaller took 0.18s
Precompiling benchmark choice TritonTemplateCaller took 0.18s
Precompiling benchmark choice TritonTemplateCaller took 0.14s
Precompiling benchmark choice TritonTemplateCaller took 0.35s
Precompiling benchmark choice TritonTemplateCaller took 0.39s
Precompiling benchmark choice TritonTemplateCaller took 19.54s
Precompiling benchmark choice TritonTemplateCaller took 4.69s
Precompiling benchmark choice TritonTemplateCaller took 0.52s
Precompiling benchmark choice TritonTemplateCaller took 1.28s
Precompiling benchmark choice TritonTemplateCaller took 20.96s
Precompiling benchmark choice TritonTemplateCaller took 4.81s
Precompiling benchmark choice TritonTemplateCaller took 9.40s
Precompiling benchmark choice TritonTemplateCaller took 6.34s
Precompiling benchmark choice TritonTemplateCaller took 1.93s
Precompiling benchmark choice TritonTemplateCaller took 4.39s
Precompiling benchmark choice TritonTemplateCaller took 21.91s
Precompiling benchmark choice TritonTemplateCaller took 10.10s
Precompiling benchmark choice TritonTemplateCaller took 9.55s
Precompiling benchmark choice TritonTemplateCaller took 9.15s
FX codegen and compilation took 133.246s
```

Also tested async triton compile path by setting num_workers > 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152284
Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang
2025-04-28 22:30:35 +00:00
e7c19f4f69 Revert "Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)" (#152250)"
This reverts commit e407ea1e5e22a41d14ce141295bf391cd46f2677.

Reverted https://github.com/pytorch/pytorch/pull/152250 on behalf of https://github.com/malfet due to Breaks s390, may be time to move build back to opt-in 2667cb69d9/1 ([comment](https://github.com/pytorch/pytorch/pull/152250#issuecomment-2836833030))
2025-04-28 22:05:12 +00:00
2667cb69d9 [inductor] align replicationpad on processing bool dtype with eager (#147666)
Fixes #143779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147666
Approved by: https://github.com/jansel
2025-04-28 21:54:31 +00:00
86b0271b00 Add CUDA 12.8 almalinux image, remove CUDA 12.4 almalinux (#152362)
This is general purpose image located in: https://hub.docker.com/r/pytorch/almalinux-builder
Updating it to match our supported CUDA matrix

Adding this build to use as general purpose image and use for Magma build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152362
Approved by: https://github.com/malfet
2025-04-28 21:15:05 +00:00
eqy
34b0de50a3 [TF32][CUDA] account for TF32 in test_linear_autograd (#152216)
Abate some more noise seen on blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152216
Approved by: https://github.com/Skylion007
2025-04-28 21:00:17 +00:00
ddff3d4f6b [inductor][invoke_subgraph] Run joint graph passes for inference (#152062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152062
Approved by: https://github.com/eellison
ghstack dependencies: #151409, #151633, #151477, #151957, #151961
2025-04-28 20:42:55 +00:00
99b6c426a9 [Graph Partition] fix extra reference in runner.partitions to cudagraphify functions (#152066)
When CompiledFxGraph is deallocated, its cudagraphifed fn (i.e., `current_callable`) is expected to also be deallocated.
Without graph partition, this is true since the cudagraphified fn is only refered by compiled_fx_graph.current_callable.

However, with graph partition, runner.partitions hold cudagraphified fns while compiled_fx_graph.current_callable holds the runner.call. Thus the cudagraphied fn may not be deallocated when CompiledFxGraph is deallocated. This leads to errors in several unit tests (e.g., test_unaligned_static_input_no_cudagraphs and test_unaligned_static_input_non_trees).

In this PR, we also clean up runner.partitions when CompiledFxGraph is deallocated. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152066
Approved by: https://github.com/eellison
2025-04-28 20:38:26 +00:00
728a6dd51c [Graph Partition] support ForeachKernelSchedulerNode (#152148)
ForeachKernelSchedulerNode misses outputs_by_name when created with previous nodes. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152148
Approved by: https://github.com/eellison
2025-04-28 20:38:22 +00:00
8e65310d49 [caffe2/c10/util/TypeIndex] Add '__CUDA_ARCH_LIST__' check (#152030)
Summary:
We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_.

To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur.

Test Plan: CI

Reviewed By: zhuhan0

Differential Revision: D73383236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
2025-04-28 20:31:23 +00:00
fcebaedebc Add a label to skip URL lint if needed (#152340)
Some URLs may be down due to server side issues we can't control
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152340
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-28 20:29:40 +00:00
33766de2d3 [Security] Advise against loading untrusted TorchScripts (#152336)
As torchscripted model is a Turing complete program
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152336
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-04-28 20:18:56 +00:00
00ebbbb701 [cutlass backend] add addmm and bmm for cutlass backend benchmark (#152163)
Copying what @kadeng did.

```
FINAL results...

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 44.454172253608704 |  3.0991086587309837  |         NA          |
|        triton         | 44.06978189945221  | 0.07496077567338943  | -0.8646890374284049 |
| triton_persistent_tma | 43.598245829343796 | 0.06154991965740919  | -1.9254130284597197 |
|  cutlass_lvl_default  | 39.91834074258804  | 0.056073310784995556 | -10.20338762612423  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 49.05610531568527 |  0.160279156640172   |         NA          |
|        triton         | 43.97720843553543 |  0.0660805031657219  | -10.353241145961718 |
| triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 |
|  cutlass_lvl_default  | 40.2066633105278  | 0.034127906896173954 | -18.039430460713596 |
+-----------------------+-------------------+----------------------+---------------------+

Average edge over aten (max(-edge, 0), higher is better):
triton: 5.608965091695062 (from 2 valid values)
triton_persistent_tma: 6.175686863092341 (from 2 valid values)
cutlass_lvl_default: 14.121409043418913 (from 2 valid values)
```

Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163
Approved by: https://github.com/jingsh
2025-04-28 20:16:17 +00:00
5f4c8e4c89 [inductor][tests] don't test for cpu if you want to use triton backend (#152227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152227
Approved by: https://github.com/clee2000
2025-04-28 19:43:56 +00:00
e407ea1e5e Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)" (#152250)
Almost-exact reapply of #151850 (adding minor reviewer nits) . AFAICT it was reverted unnecessarily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152250
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-04-28 19:33:40 +00:00
6b1acfa41b Fix redistribute new_local_tensor be None case (#152303)
as titled, we can just set new_local_tensor to be the local tensor and
remove the None check, as there would be cases where there's no
transformation needed (i.e. src_placements and dst_placements are the same,
and we still want to return the original local_tensor)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303
Approved by: https://github.com/awgu
2025-04-28 19:00:17 +00:00
d3f8aa4378 [ez] Don't always pass HF token to fsspec (#151464)
Summary: The HF storage reader/writer component can work for any back-end in theory, so we shouldn't enforce the token to be passed into fsspecreader/writer, because the specific fsspec implementation may not handle tokens. Specifically, manifold doesn't accept a token arg, but we're passing one in always, which is throwing

Test Plan: signals

Differential Revision: D73130679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151464
Approved by: https://github.com/Skylion007
2025-04-28 18:52:20 +00:00
41a0c23c7c Skip test requiring MKL (#152322)
`test_reproduce_121253_issue_addmm_fusion_check` checks for "mkl._mkl_linear" being found in the generated source which cannot be there when MKL isn't available.
Add skip marker similar to other tests in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152322
Approved by: https://github.com/Skylion007
2025-04-28 18:29:24 +00:00
686dff0098 Fix an incorrect link markup (#152239)
Remove extra whitespace so the link works correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152239
Approved by: https://github.com/soulitzer
2025-04-28 18:28:08 +00:00
fcbbb03d48 Extend vec backend with BF16 SVE intrinsics (#143666)
- Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
2025-04-28 18:25:44 +00:00
0c52ee1b35 [DTensor] Error on illegal view op during sharding prop (#149764)
Adds explicit error checking during sharding propagation for view ops
rather than relying on runtime errors during local op execution.

Before:
An error is thrown by aten.view op called by DTensor dispatch, because
the local shard size is incompatible with the (incorrectly calculated)
args to the view op.

`RuntimeError: shape '[384]' is invalid for input of size 512`

After:
We raise more specific errors for cases of incompatible view operations
during sharding propagation, before getting to runtime dispatch.

`RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.`

Change Summary:

add 'strict_view' kwarg to the helper methods that implement
view/reshape op shard prop rules, so it can be decided op-by-op whether
to raise these new errors
enabled errors just for the 'view' op in this PR
added two specific checks/errors that can occur during view ops.

Details:

- View ops are never allowed to flatten a dimension that is unevenly
  sharded, since that would likely change the size/content of the
  local_tensor and require redistribute
- View ops are also never allowed to flatten two dims if the rightmost
  dim is a Shard() placment, becuase it would cause contiguity errors
  without redistribution

Notes:

- Disables support for several ops in test_dtensor_ops.py test, which
  decompose to an illegal view that only works by performing a
  redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron

Follow Ups:
- triage other view-like ops (besides aten::view) for using strict_view
- look for other gaps where view-like ops could still perform
  redistribution (ban them all, and document this)

Fixes #143372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #152045
2025-04-28 18:21:49 +00:00
efeed720a6 [DTensor] make test_dtensor_ops report dtensor_args (#152045)
Before:
Does not report DTensor args, and you can't tell which combination of
sharding/replication is used for that particular iteration

```
RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01,  1.1260e+00,  1.7686e+00, -7.8216e+
         [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00],
         [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00],
         [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]],
        [[-7.5171e+00,  5.3900e+00, -7.9208e+00,  6.1000e+00],
         [-1.7350e+00, -3.6188e-03, -7.1592e+00,  9.2951e-02],
         [ 5.7143e+00, -3.0805e+00,  7.6227e+00, -7.4862e+00],
         [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]],
        [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00],
         [-2.5225e+00, -5.2520e+00,  6.5686e+00, -6.0350e+00],
         [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00],
         [ 3.9581e+00, -6.3151e-01, -3.3223e+00,  4.0546e+00]],
        [[-2.8112e+00,  3.8742e+00, -4.4612e+00, -5.0016e+00],
         [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00],
         [ 3.1207e+00, -7.6518e+00,  7.1084e+00, -1.0500e+00],
         [ 8.8823e+00, -1.1178e+00,  4.8485e+00, -8.8593e+00]]],
       requires_grad=True)], **{})
```

After:
You can see the particular DTensor spec that failed

```
RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667,
        [[ 0.4317, -4.9678, -1.2441, -2.3042]],
        [[ 3.9581, -0.6315, -3.3223,  4.0546]],
        [[ 8.8823, -1.1178,  4.8485, -8.8593]]], requires_grad=True),
        device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152045
Approved by: https://github.com/XilunWu
2025-04-28 18:21:48 +00:00
bb90f66e70 [CUDA][conv3d] bump tolerances for test_variant_consistency_eager conv3d complex64 (#152203)
~1/1000 1.5e-5 mismatch on A100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152203
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-04-28 17:59:37 +00:00
79e8dc7d53 Pin to SHA for actions outside of PyTorch (#152110)
Pin actions from repos external to the PyTorch project to their shasums for security. This is a best practice as Git tags are not immutable.

https://openssf.org/blog/2024/08/12/mitigating-attack-vectors-in-github-workflows/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152110
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2025-04-28 17:57:32 +00:00
2246cb6e14 Fix common_distributed.py to NOT set root logger (#152319)
Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319
Approved by: https://github.com/Skylion007
2025-04-28 17:51:32 +00:00
8ce3d4a541 test(Conv3d): use correct class for test_Conv3d_module_same_padding (#152187)
The test for the class `Conv3d` is calling `Conv2d`. This PR just ensure that we are testing the correct module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152187
Approved by: https://github.com/Skylion007
2025-04-28 16:59:12 +00:00
c869862875 Remove cuda dependencies from non cuda buids (#152333)
These dependancies added to fix poetry issue on pypi. However inclusion of these dependencies creates issue with poetry on download.pytorch.org due to poetry reading first available wheel on index for METADATA requirements. Hence all metadata requirements for CPU wheels can't list any cuda dependencies.

Injecting these dependencies via prep for pypi will need to be done via:
https://github.com/pytorch/test-infra/blob/main/release/pypi/prep_binary_for_pypi.sh

Ref: https://github.com/pytorch/pytorch/issues/152121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152333
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2025-04-28 16:46:44 +00:00
cbf8e0fb1a use statically known true instead of guard size oblivious in bmm and mm inductor decompositions . (#148893)
this was discussed with @eellison and he recommended using  statically_known_true here, the intuition is. We already have 0/1 specializations in place, if we reach those checks with dynamic shapes that are not already specialized
then we do not want them to specialize them, "a recompilation here is not justified".
Those are all non-semantic changing optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148893
Approved by: https://github.com/eellison
2025-04-28 16:44:25 +00:00
6e5e9dc321 [benchmarking] Inc aarch64 bench shards to 15 (#152324)
As it frequently timing out with 12, but also it feels like shards are somewhat unbalanced
I.e. if one to look at https://github.com/pytorch/pytorch/actions/runs/14696840776/job/41239776679
Shard 12 takes 3.6 hours, while shard 11 is only 40 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152324
Approved by: https://github.com/janeyx99, https://github.com/atalman
2025-04-28 16:08:39 +00:00
4bdecd94ea [modefile free][long tail] selectify fbcode/caffe2/defs.bzl (#148925)
Summary:
replace read_config with select

For more info, please refer to the [doc](https://docs.google.com/document/d/1e0Hvht8WEHhcRvlCAodq_R9xnAtKBrAhdyvxcAqQjCw/edit?tab=t.hl8j18gza0cv)

Test Plan: CI

Reviewed By: malfet

Differential Revision: D70267850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148925
Approved by: https://github.com/malfet
2025-04-28 16:04:28 +00:00
9c864f9b0f Revert "[Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)"
This reverts commit 443840080265ce6133121c91d258b619eae151bb.

Reverted https://github.com/pytorch/pytorch/pull/151937 on behalf of https://github.com/malfet due to Broke ASAN tests, probably by enabling too many tests https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=asan&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/151937#issuecomment-2835151532))
2025-04-28 12:56:49 +00:00
0b6ea0b959 [xla hash update] update the pinned xla hash (#151210)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151210
Approved by: https://github.com/pytorchbot
2025-04-28 11:45:09 +00:00
7cae7902a2 Add scripts to check xrefs and urls (#151844)
Traverses the docs and code to find any broken links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844
Approved by: https://github.com/huydhn
2025-04-28 09:30:07 +00:00
7e8b9b3f51 ReducedPrecisionFloatGemvFastPathKernel: Correctly type parallel_for lambda arguments as int64_t (#152233)
This plus the previous irangeification PR seem like a better fix for #150637 than #150949 to me -- should make sure we are using 64-bit math for indexing everywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152233
Approved by: https://github.com/Skylion007, https://github.com/cyyever
ghstack dependencies: #152232
2025-04-28 07:19:26 +00:00
3b7d6bbe8b irangeify ReducedPrecisionFloatGemvKernel.cpp (#152232)
We should be using irange, especially because we had 32-bit overflow issues in this file recently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152232
Approved by: https://github.com/Skylion007
2025-04-28 07:19:26 +00:00
ce00ec7ecf Enable max autotune for AOTInductor benchmark (#149309)
With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309
Approved by: https://github.com/desertfire

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-04-28 06:54:26 +00:00
13966d0bf5 [BE] Migrate dtype_abbrs into one location (#152229)
Namely `torch.utils._dtype_abbrs.dtype_abbrs`

Before that it was defined in various forms of completeness in
c02edba863/torch/fx/graph.py (L215),
c02edba863/torch/testing/_internal/common_utils.py (L5226)
 and c02edba863/torch/testing/_internal/logging_tensor.py (L17)

TODO:
 - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test`

Fixes https://github.com/pytorch/pytorch/issues/152225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2025-04-28 03:52:47 +00:00
899eec665c [MPS] col2im kernel implementation (#152282)
Fixes #151820
Also requested in #141287

Mainly based on the cuda kernel implementations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282
Approved by: https://github.com/malfet
2025-04-28 03:48:41 +00:00
2503843673 Add check for 2-dim mask to COO mask computation (#151940)
Follow up on discussion on https://github.com/pytorch/pytorch/pull/151794 Related to all fixes for https://github.com/pytorch/pytorch/issues/151351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151940
Approved by: https://github.com/Skylion007
2025-04-28 03:40:46 +00:00
4438400802 [Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)
@EikanWang @etaf @guangyey please take a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937
Approved by: https://github.com/liangan1, https://github.com/drisspg
2025-04-28 03:07:23 +00:00
98bd2bd1ab Do not generate long log messages for suppressed data dependent errors. (#151023)
TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true

 before:
<img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" />

after:
<img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" />

Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen
even when recording is not enabled? I will follow up with PR for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023
Approved by: https://github.com/bobrenjc93
2025-04-28 00:39:52 +00:00
cyy
70d7638b0d Fix clang-tidy suppression in torch/csrc/jit (#152271)
Remove some clang-tidy suppression in torch/csrc/jit by applying fixes or refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152271
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-27 21:18:39 +00:00
c02edba863 Revert "Update OpenBLAS commit (#151547)"
This reverts commit c4b085475062270946eeec854aa54d0739c7a0c9.

Reverted https://github.com/pytorch/pytorch/pull/151547 on behalf of https://github.com/malfet due to It breaks all aarch64 tests ([comment](https://github.com/pytorch/pytorch/pull/151547#issuecomment-2833593427))
2025-04-27 18:58:35 +00:00
cyy
b34146a093 Fix initGdsBindings declaration (#152277)
Move initGdsBindings into the correct namespace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152277
Approved by: https://github.com/Skylion007
2025-04-27 17:04:56 +00:00
861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
c4b0854750 Update OpenBLAS commit (#151547)
Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install`

Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2009%3A35%3A26%20GMT&stopTime=Thu%2C%2017%20Apr%202025%2009%3A35%3A26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=90701ab81bf61fd864d31e0aa7e88d97a1a8676c&rBranch=main&rCommit=40ce4fb24a536d175348df876f61956d4945778e)

1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers.
2. Overall torch-bench pass-rate increased `[87%, 65/75 → 96%, 72/75]`
<img width="676" alt="Screenshot 2025-04-17 at 10 32 10" src="https://github.com/user-attachments/assets/a92dce0c-ecee-4466-8175-065df664dd71" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547
Approved by: https://github.com/malfet
2025-04-27 15:55:42 +00:00
bb680b5a87 [MPSInductor] Fix masked_fill decomp (#152268)
By adding `mps` to the list of accelerators that can work with CPU scalars

Fixes `GPUTests.test_masked_fill_promotion_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152268
Approved by: https://github.com/kulinseth, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #152266
2025-04-27 15:50:46 +00:00
cbcf677223 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/lists.py (#151873)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/lists.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151873
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-04-27 11:59:45 +00:00
0423a7b322 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/nn_module.py (#151895)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/nn_module.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151895
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-04-27 11:54:42 +00:00
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
cbcc03c2ad [MPSInductor][BE] Only include headers when needed (#152266)
Store headers used by shader in `MetalKernel.headers`
Add headers when function depending on it gets invoked
Generate majority of a special ops from template
Delete two unused functors: `entr` and `xlog1py` as they are decomposed by inductor anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152266
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci, https://github.com/cyyever
2025-04-27 05:09:50 +00:00
a0d440a26a [AOTI][reland] Remove typedef for half and bfloat16 (#151109)
Summary: Reland https://github.com/pytorch/pytorch/pull/150657

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen.

Differential Revision: [D72878456](https://our.internmc.facebook.com/intern/diff/D72878456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151109
Approved by: https://github.com/angelayi
2025-04-26 23:17:35 +00:00
225742838b Add an additional check to trigger graph break for sparse tensor (#151897)
Fixes #151522

This PR fixes the issue that Dynamo fails to trigger a graph break for sparse tensors in certain code paths. I added an additional check to handle this case, and it resolves the original problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151897
Approved by: https://github.com/jansel
2025-04-26 21:02:32 +00:00
e4a1a16bef Check integrity of bytes in AppendingByteSerializer (#152139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152139
Approved by: https://github.com/zou3519
2025-04-26 18:10:58 +00:00
9480ed4cd3 Fix typos in multiple files (#152254)
Fix typos in multiple files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152254
Approved by: https://github.com/Skylion007
2025-04-26 17:18:39 +00:00
6a62356857 [BE][Easy]: Change typing to DimsType in dim_reduction (#151677)
Use prims_common DimsType to reduce duplication of DType

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151677
Approved by: https://github.com/albanD
2025-04-26 16:59:32 +00:00
203201255f [dynamo] remove dead code for DATA_PTR_MATCH (#152206)
Summary: Seems this guard is not created anywhere

Test Plan: CI

Differential Revision: D73682084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152206
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-04-26 15:25:01 +00:00
ee8166e94f Correctly handle duplicated arguments when merging input views. (#146275)
Fix: #135099

This PR changes how we map the original inputs into the new set of
inputs that take in the tensor input's base instead of their aliases.

**Problem:** in order to create this mapping, we had a dictionary that
mapped the hashed arguments into their respective indices. However, if
there's a group of equal arguments, we will have only one mapping for
such an argument. This breaks the assumption that there will be one
mapping for each argument.

**Solution:** map the hashed arguments into a list of indices. Then, we
will be able to correctly reconstruct the parameters for the new calling
convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275
Approved by: https://github.com/bdhirsh
2025-04-26 14:50:16 +00:00
580913290c [Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.

The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.

Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD, https://github.com/guangyey
ghstack dependencies: #151404, #151221, #151411
2025-04-26 14:18:22 +00:00
2ce9d2e9aa [MPS/inductor] Adjust test_to_dtype_mps so that it works on the backend. (#152230)
float64 isnt' supported for MPS, but we can still test the functionality with another type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152230
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-26 13:54:53 +00:00
0f9b02c839 [Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)
**Changes:**
- add detailed function or class signature
- fix the wrong display of torch.Event.wait and torch.Event.record
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411
Approved by: https://github.com/albanD
ghstack dependencies: #151404, #151221
2025-04-26 13:52:38 +00:00
bd7dc1b17d [Easy] Fix the function signature of torch.Event (#151221)
As the title stated.

The difference between declaration and implemention.
declaration:
d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)

Implementation:
d5a19e4525/torch/csrc/Event.cpp (L30-L32)

**Question**: Which one should we choose?
- Change enable_timing to False to be consistent with torch.cuda.Event
- Change enable_timing to True to avoid BC-break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221
Approved by: https://github.com/albanD
ghstack dependencies: #151404
2025-04-26 13:51:56 +00:00
4a46ee96d2 [Indcutor Remote Cache] Raise an exception if redis module is required but not available (#151779)
If we need redis but redis is not available, it is better to tell the user to install redis instead of continue silently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151779
Approved by: https://github.com/aorenste
2025-04-26 11:21:54 +00:00
8d427e9e76 [AOTInductor] Inherit Buffer if not being updated (#152092)
Summary: Inherit buffer from original constants buffer if it's not being updated.

Test Plan: TBD

Differential Revision: D73571260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152092
Approved by: https://github.com/kflu, https://github.com/jingsh
2025-04-26 04:28:23 +00:00
d22c4cc353 Add option to use mempool on OOM (#151487)
MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool.

Usage:
Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing.

```
pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
    a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a
# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
```

Testing:
```
python test/test_cuda.py -k test_mempool_limited_memory_with_allocator
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487
Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel
2025-04-26 04:04:57 +00:00
cyy
65b845f82b Remove useless options for third-party ONNX build (#147616)
Treat ONNX CMake targets properly and remove unneeded options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616
Approved by: https://github.com/malfet
2025-04-26 02:34:08 +00:00
d9d306e8e9 Fix inductor test_linear_with_in_out_buffer (#151548)
Without MKL there is only 1 epilogue, not 2 because `addmm` is used instead of `packed_linear/_mkl_linear`.
This fails first at `TestSelectAlgorithmCPU.test_linear_with_in_out_buffer_batch_size_8_in_features_3_in_features2_192_image_size_224_out_features_64_bias_True_cpu_float32`

Instead of skipping the whole test just adjust the count for the single check.

Final numbers of `test/inductor/test_cpu_select_algorithm.py` without MKL:
```
Ran 1337 tests
OK (skipped=1211)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151548
Approved by: https://github.com/jansel
2025-04-26 01:53:34 +00:00
0e015ef116 [ROCm][Windows] Fix HIP Caffe2 Tests (#152014)
Solves the following problems of caffe2 HIP tests building on Windows:
1. HIP tests now use `hip_add_executable` to be built with custom_command invoking hip compiler, due to lack of cmake support for HIP in 3.18 (currently used).
2. failing with "Command line too long" which resulted from `hip_add_executable` adding the same flags over and over on top of `HIP_HIPCC_FLAGS` with every test added.
3. Disables `HasSameArgTypes` test on Windows, as `at::native::modern::detail` is nowhere to be found in the codebase (I think it must be a legacy thing). Perhaps the whole test should be removed/rewritten?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152014
Approved by: https://github.com/jeffdaily
2025-04-26 01:35:46 +00:00
3ef6d6924a [BE] Switch TestConsistency to MPS device (#147893)
Which will eventually allow move decorators away more `common_mps.py`

Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893
Approved by: https://github.com/atalman
ghstack dependencies: #152204
2025-04-26 01:19:21 +00:00
73f11e3365 [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature.
XPU still uses it, so exclude XPU builds  until https://github.com/intel/torch-xpu-ops/pull/1615 is merged

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-26 01:15:53 +00:00
4647658247 [PT2] - Allowlist should have precedence (#151942)
Summary: When working on List[List[int]], the ints were being considered Constants regardless of their inclusion on the allowlist.

Test Plan:
CI + new test

https://www.internalfb.com/intern/testinfra/testrun/5066549856504774

Differential Revision: D73137631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151942
Approved by: https://github.com/laithsakka
2025-04-26 00:58:43 +00:00
fa1b4ef649 Revert "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)"
This reverts commit 47d34261e06e2416e7a1e7d51a3d428e4ea51f9d.

Reverted https://github.com/pytorch/pytorch/pull/151850 on behalf of https://github.com/ZainRizvi due to This codev PR is breaking  on it's internal counterpart diff D73129443.  For codev PRs like this one, please always make sure the internal diff is green and then land the diff internally. The Github PR will be automatically merged ([comment](https://github.com/pytorch/pytorch/pull/151850#issuecomment-2831686141))
2025-04-26 00:44:11 +00:00
47d34261e0 Rewrite the guts of torch::jit::Lexer to speed it up (#151850)
The trie-based approach was, apparently, not efficient. This incidentally fixes a bug where "not inp" and "is note" were lexed incorrectly; see test_lexer.cpp update.

Differential Revision: [D73129443](https://our.internmc.facebook.com/intern/diff/D73129443/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151850
Approved by: https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810, #151849
2025-04-25 23:49:35 +00:00
0f765773e3 Revert "[BE] Do not allow PyTorch codebase to use c10::optional (#150464)"
This reverts commit 490ef768cff448080083a46f362053e025f6b95b.

Reverted https://github.com/pytorch/pytorch/pull/150464 on behalf of https://github.com/clee2000 due to broke xpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/14674243034/job/41187443432) [HUD commit link](490ef768cf)? ([comment](https://github.com/pytorch/pytorch/pull/150464#issuecomment-2831608162))
2025-04-25 23:34:56 +00:00
6aa92806db [CP] Use TorchFunctionMode to dispatch SDPA for CP (#147902)
While we prefer not use monkey patching to dispatch SDPA, TorchFunctionMode is currently not compatible with selective activation checkpointing (https://github.com/pytorch/pytorch/issues/147995). This PR adds `TorchFunctionMode` to CP code and make it configurable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147902
Approved by: https://github.com/XilunWu
2025-04-25 23:33:48 +00:00
e28864fc0f [MPS/inductor] Fix the approximation of polygamma for n == 0. (#152214)
Fixes #152205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152214
Approved by: https://github.com/malfet
2025-04-25 22:42:45 +00:00
cf101d66ee Add simple direct C++ tests for torch::jit::Lexer (#151849)
We have test_jit.py, but given that I'm working on
significant changes to the lexer, it seems nice to have direct C++
tests. (Also, writing the tests caught a pair of related bugs; see the
two tests with "Bug" in their name. The rewrite will fix them.)

Differential Revision: [D73402367](https://our.internmc.facebook.com/intern/diff/D73402367/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151849
Approved by: https://github.com/malfet
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810
2025-04-25 22:39:49 +00:00
490ef768cf [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-25 22:03:48 +00:00
9e50c21e27 Fix xrefs (#151888)
Fix existing cross references and removed old ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151888
Approved by: https://github.com/eqy, https://github.com/huydhn, https://github.com/svekars
2025-04-25 21:27:27 +00:00
1aa971a3bb [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-25 21:06:45 +00:00
2c5c793085 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-25 20:15:04 +00:00
91c590f048 [ONNX] add converters for sym_min, sym_max (#152196)
Conversion of Phi4-multimodel-instruct fails because of missing converters for torch.sym_max, and torch.sym_min.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152196
Approved by: https://github.com/justinchuby
2025-04-25 20:01:05 +00:00
9336608307 BM FM FlashAttention Test (#151974)
Reviewed By: joebos

Differential Revision: D72880307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151974
Approved by: https://github.com/yoyoyocmu, https://github.com/Skylion007, https://github.com/malfet
2025-04-25 19:24:25 +00:00
8542d55f0c [logging] Clean up dynamo_timed usages in cudagraph_trees (#152136)
Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances.

Here's the production job with the discrepancy:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty

Test Plan:
torchbench nanogpt:
* tlparse: https://fburl.com/h1n2ascc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136
Approved by: https://github.com/BoyuanFeng
2025-04-25 19:18:12 +00:00
1bc0e2579d [aarch64] Fixes to build with ArmPL's cblas.h (#151126)
Summary:
Various fixes to make fbcode work w/ ArmPL's cblas header:
1) Avoid re-declaring prototypes for internal blas methods which ArmPL already declares.
2) Fix `std::complex` conversion when using these methods.
3)  Drop `extern "C"` around include fo `cblas.h`.

Test Plan: CI

Differential Revision: D72808561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151126
Approved by: https://github.com/Skylion007
2025-04-25 19:02:28 +00:00
56190d2577 [MPS] Fix ICE for entr bool instantiation on M1/M2 (#152204)
By instantiating it implicitly, otherwise attempts to run something like
```
% python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))"
```
will fail with
```
Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error"
```

Similar in spirit to https://github.com/pytorch/pytorch/pull/149123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204
Approved by: https://github.com/dcci
2025-04-25 19:00:49 +00:00
d7eb3a492c [Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157)
### Changes

Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157
Approved by: https://github.com/Skylion007
2025-04-25 19:00:10 +00:00
85bfaf8cc5 Package const folded graph's cubin file (#152145)
Summary: We need to pacakge const folded graph's cubin file into the final .pt2 package.

Fix https://github.com/pytorch/pytorch/issues/152067

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_constant_folding_cuda
```

Differential Revision: D73626480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152145
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2025-04-25 18:38:32 +00:00
a5f2fd1017 Unskip index_put in cudagraphs (#152186)
The repro from the original skip in https://github.com/pytorch/pytorch/pull/105439 does not fail. unskip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152186
Approved by: https://github.com/Skylion007
2025-04-25 18:15:49 +00:00
bcf1031cb8 [ROCm] Fixes to enable VM-based MI300 CI runners (#152133)
New VM-based MI300 CI runners tested in https://github.com/pytorch/pytorch/pull/151708 exposed some issues in CI that this PR fixes:

* HSAKMT_DEBUG_LEVEL is a debug env var that was introduced to debug driver issues. However, in the new MI300 runners being tested, since they run inside a VM, the driver emits a debug message `Failed to map remapped mmio page on gpu_mem 0` when calling `rocminfo` or doing other GPU-related work. This results in multiple PyTorch unit tests failing when doing a string match on the stdout vs expected output.

* HSA_FORCE_FINE_GRAIN_PCIE was relevant for rccl performance improvement, but is not required now.

* amdsmi doesn't return metrics like [power_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-power-cap-info) and [clock_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-clock-info) in a VM ("Guest") environment. Return 0 as the default in cases where amdsmi returns "N/A"

* amdsmi throws an exception when calling `amdsmi.amdsmi_get_clock_info` on the VM-based runners. Temporarily skipping the unit test for MI300 until we find a resolution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152133
Approved by: https://github.com/jeffdaily
2025-04-25 18:06:48 +00:00
0dae27d75b Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/EikanWang
2025-04-25 17:48:53 +00:00
c03359de2d Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981)"
This reverts commit fc6e37ceb23f99808265c11a37368078d5f982b8.

Reverted https://github.com/pytorch/pytorch/pull/148981 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @davidberard98 can you please help get these changes validated? Details in D73628297. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148981#issuecomment-2831044810))
2025-04-25 17:45:13 +00:00
4ea2e093ca [inductor][BE] Clean up use_mixed_mm and mixed_mm_choice usage inside pytorch (#152071)
Differential Revision: [D73551912](https://our.internmc.facebook.com/intern/diff/D73551912/)

Decided to leave the mixed_mm tests alive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152071
Approved by: https://github.com/eellison
2025-04-25 17:25:55 +00:00
67f75244ea Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit c91acad73a11825c366c51fb1e91d7e1a47d3f9e.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))
2025-04-25 16:08:27 +00:00
d4a8e4e30c [dynamo] Guard serialization for HASATTR (#151349)
Adding guard serialization for type HASATTR

Differential Revision: [D73059073](https://our.internmc.facebook.com/intern/diff/D73059073/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151349
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #151318, #151343
2025-04-25 14:16:30 +00:00
558f45190e [dynamo] Guard serialization for NOT_PRESENT_IN_GENERIC_DICT (#151343)
Adding guard serialization for type NOT_PRESENT_IN_GENERIC_DICT

Differential Revision: [D73057304](https://our.internmc.facebook.com/intern/diff/D73057304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151343
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #151318
2025-04-25 14:16:30 +00:00
a34c28e0d2 [dynamo] Add guard serialization for tensor matches. (#151318)
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.

The main behavioral change introduced in this diff is on CheckFunctionManager:

```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")

guards_state: bytes = check_fn_manager.guards_state
```

Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.

When we load back guards state, we will set `guards_serialization_mode` is set to `load`:

```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```

# TENSOR_MATCH

Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.

We kick off the work from TENSOR_MATCH from this diff.

# Testing

For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-25 14:16:23 +00:00
6e8602b558 Relax tolerance on test_aot_autograd_exhaustive_matmul_cpu_float32 without MKL (#152106)
When e.g. OpenBLAS is used instead of MKL the differences get to large:
> Greatest absolute difference: 5.91278076171875e-05 at index (7,) (up to 1e-05 allowed)
> Greatest relative difference: 3.468156592134619e-06 at index (7,) (up to 1.3e-06 allowed)

I traced some of the matmul operations and there are differences of around 8e-6 between MKL and OpenBLAS but I haven't found where exactly the backward pass is calculated which is where the actual differences arise. So I couldn't check if there is some difference in the low-level BLAS function used by the autograd.

However it seems odd that there is a difference at all: For the MKL case it seems to be zero up to the accuracy shown by Python.

So it seems the AOT compilation has some differences when MKL is not available.

Maybe this is also the reason why it fails for ARM and hence the test is skipped there. Maybe @zou3519 knows more as he introduced those skip markers in https://github.com/pytorch/pytorch/pull/85565

Is there any documentation how and where `matmul_backward(_out)` is generated and how AOT transforms it with and without MKL?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152106
Approved by: https://github.com/zou3519
2025-04-25 14:03:37 +00:00
c1c8c1f8d6 [Quant][X86] add an op to compute uint8 pointwise mul (#151112)
**Summary**
Add a new op, `onednn.qmul.tensor`, for int8 elementwise mul, which accepts inputs on CPU device (instead of QuantizedCPU).
The new op is implemented by AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.mul`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_mul_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151112
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-04-25 12:52:54 +00:00
ad81eeb7c7 Refactor to use torch.accelerator.device_index instead of torch.cuda.device for generic device context manager (#148880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148880
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #148864
2025-04-25 09:45:25 +00:00
33c75cae0a Add torch.accelerator.device_index as accelerator's device switch context (#148864)
# Motivation
We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](00199acdb8/torch/cuda/__init__.py (L482)) and  [`torch.cuda._DeviceGuard`](00199acdb8/torch/cuda/__init__.py (L469)).

**Design Philosophy**
It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`.

Therefore, with this PR, we can do like this

```python
src = 0
dst = 1
# Set src to current device
torch.accelerator.set_device_index(src)
with torch.accelerator.device_index(dst):
    # Inside with statement, we set dst to current device
    assert torch.accelerator.get_device_index() == dst
# Here the current device should be src
assert torch.accelerator.get_device_index() == src
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864
Approved by: https://github.com/albanD
2025-04-25 09:45:25 +00:00
f38dae76ee [Proposal] Drop legacy CUDA support to slim down the wheels (#152069)
Proposal of dropping legacy CUDA support to slim down the Windows wheels.

With the latest release of 2.7.0 and the new Blackwell support we've seen yet another rise in size to the wheel, going from ~2.5GB with Pytorch 2.6.0 all the way to ~3.1GB with pytorch 2.7.0 CUDA 12.8 on Python 3.12 and ~3.3GB with Python 3.13.

Python 3.12, Pytorch 2.7.0 Cuda 12.8
![image](https://github.com/user-attachments/assets/78a5bbcb-027e-4139-84f0-57bfae9f594e)

Python 3.13, Pytorch 2.7.0, Cuda 12.8
![image](https://github.com/user-attachments/assets/7f256860-46e3-41f6-81b3-65bd3ee5aa77)

These .CI changes should imply the removal of support for many GPUs which are now about 8 years old if not older, including GPUs like the GTX960M, 950M, 940M, 930M and some other Quadro GPUs all the way from april 2016 like Quadro M500M as per [Nvidia's Documentation](https://developer.nvidia.com/cuda-gpus).

This change would also save on our bandwidth 😅

@seemethere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152069
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/atalman
2025-04-25 08:20:00 +00:00
a811d3351b [ONNX] Implement sym_not (#152111)
Implement onnx support for sym_not. Replaces https://github.com/pytorch/pytorch/pull/147472

Fix https://github.com/pytorch/pytorch/issues/136572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152111
Approved by: https://github.com/titaiwangms
2025-04-25 07:50:37 +00:00
6120cc8ccd [executorch hash update] update the pinned executorch hash (#151728)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151728
Approved by: https://github.com/pytorchbot
2025-04-25 05:33:09 +00:00
a936d596f6 [Cutlass] Implement EVT example tensor creation (#150904)
This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++.

udpates to example tensor creation

Previously merged:
* https://github.com/pytorch/pytorch/pull/150903
* https://github.com/pytorch/pytorch/pull/150346
* https://github.com/pytorch/pytorch/pull/150345
* https://github.com/pytorch/pytorch/pull/150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904
Approved by: https://github.com/eellison
2025-04-25 04:43:37 +00:00
dda0c952e7 [audio hash update] update the pinned audio hash (#152149)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152149
Approved by: https://github.com/pytorchbot
2025-04-25 04:20:06 +00:00
e2c7ae52d5 [ONNX] Add group_norm support from opset 21 (#152138)
I didn't run the model in test because ORT doesn't have the op yet. Nevertheless it should be leveraged for newer opset versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152138
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1, https://github.com/cyyever
2025-04-25 03:30:07 +00:00
1a6d50d407 Reducer: add check on received data to avoid segfault (#152143)
When ncclCommAbort is called it may return invalid/corrupted data to the reducer. This adds a check so we don't read past the end of the tensors leading to a segfault.

While this looks like it could be a security issue it actually isn't since we only read past the end of the buffer, not write.

Fixes #149418

Test plan:

https://gist.github.com/d4l3k/b47c2c95cf9c37e78069e19f1b6ed2c6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152143
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-04-25 02:16:44 +00:00
7f28c03fac Adding fbgemm to whitelist (#152079)
Adding `torch.ops.fbgemm` to GraphPickler's allowlist. Otherwise, the fx graph module containing `fbgemm` node will return "Unable to pickle non-standard op" error.

The validation is done on the model and the difference appears only on the graph name not the node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152079
Approved by: https://github.com/aorenste
2025-04-25 01:13:51 +00:00
8313bc27f2 Revert "Add OIDC permissions to bazel workflow (#151456)"
This reverts commit 5fc1eb85fc1b9d605939830d3be3506762b3df27.

Reverted https://github.com/pytorch/pytorch/pull/151456 on behalf of https://github.com/seemethere due to This is causing downstream failures on PRs, see examples in PR comment ([comment](https://github.com/pytorch/pytorch/pull/151456#issuecomment-2829130319))
2025-04-25 00:37:15 +00:00
75c71ab371 [Break XPU] generalize newly introduced device bias code in Inductor UT. (#151926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151926
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-04-25 00:03:23 +00:00
d70490ecfe [Inductor][CPP] Optimize the epilogue for int8 GEMM Template (#152000)
**Summary**
For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general,  it will be calculated as:
```
temp = micro_gemm_output * x_scale * w_scale
temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0)
```
For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation.

**Performance**
Test with 4 cores of XEON-5 and shapes from VIT model
Before
```
GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s)
```

After
```
GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
2025-04-24 23:36:00 +00:00
2089b22c76 [xpu] set aot device flags in cpp_extension (#149459)
If PyTorch is compiled with only AOT text strings starting with "dg2", the `_get_sycl_arch_list()` function will pass an empty string to `-device` argument of `ocloc` and then cause a compilation crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149459
Approved by: https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-04-24 22:55:52 +00:00
fc6e37ceb2 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981)
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/147019 :

Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.

Motivation

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148981
Approved by: https://github.com/davidberard98
2025-04-24 21:28:53 +00:00
0413358a77 Non-deterministic alert in histc_cuda for floating types only (#151701)
The note about atomic add only applies for floating point. The
implementation is deterministic for integer data types.

fixes: #151610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-04-24 21:16:46 +00:00
6ced5e6840 Python 3.11 and 3.13 support for Windows Arm64 (#152109)
This PR adds Python 3.11 and 3.13 support Windows Arm64 wheels and creates the necessary jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152109
Approved by: https://github.com/malfet
2025-04-24 21:09:14 +00:00
eqy
d78d2af4e3 [CUDA][TF32] Account for TF32 in test_corrcoef (#151830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151830
Approved by: https://github.com/Skylion007
2025-04-24 21:06:07 +00:00
8a9c66bb70 Improve stable library apis per Scott's feedback (#152040)
Following 3 suggestions:
1. inline at::Tensor arg
2. use uniq ptr of array vs std::vector
3. document the `std::optional<S>()` case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152040
Approved by: https://github.com/swolchok, https://github.com/albanD
2025-04-24 20:51:03 +00:00
dccc41581a Include other accelerators in capturable docstr for optimizers (#149770)
Fixes #149722

@ILCSFNO is this better?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149770
Approved by: https://github.com/albanD
2025-04-24 20:38:42 +00:00
bd09d87fdb add Out Notes (#151306)
Fixes #150181
@albanD Could you please have a check?

Build locally without pytorch build:

![Developer-FAQ](https://github.com/user-attachments/assets/351a7e0b-588e-48ae-ad0a-03f427c86e89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151306
Approved by: https://github.com/albanD
2025-04-24 20:25:09 +00:00
92f125e622 [export] improve error message for deserializing custom triton op (#152029)
In https://github.com/pytorch/pytorch/issues/151746, users ran into an error where a custom triton op cannot be resolved into an operator from string target. We improve the error message by reminding users to register the same custom operator at de-serialization time.

Now the error looks like this:
```python
torch._export.serde.serialize.SerializeError: We failed to resolve torch.ops.triton_kernel.add.default to an operator. If it's a custom op/custom triton op, this is usally because the custom op is not registered when deserializing. Please import the custom op to register it before deserializing. Otherwise, please file an issue on github. Unsupported target type for node Node(target='torch.ops.triton_kernel.add.default', inputs=[NamedArgument(name='x', arg=Argument(as_tensor=TensorArgument(name='linear')), kind=1), NamedArgument(name='y', arg=Argument(as_tensor=TensorArgument(name='mul')), kind=1)], outputs=[Argument(as_tensor=TensorArgument(name='add'))], metadata={'stack_trace': 'File "/data/users/yidi/pytorch/test.py", line 50, in forward\n    output = triton_add(dense_output, bias)', 'nn_module_stack': 'L__self__,,__main__.SimpleModel', 'torch_fn': 'add.default_1;OpOverload.add.default'}, is_hop_single_tensor_return=None): <class 'str'>.```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152029
Approved by: https://github.com/jingsh
2025-04-24 20:22:05 +00:00
24bda01a93 Pin theme to a branch (#152046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152046
Approved by: https://github.com/albanD
2025-04-24 20:20:21 +00:00
eqy
6efc572221 [CUDA][CPU] Bump system memory requirement for test_cross_entropy_large_tensor (#151812)
`/usr/bin/time` seems to show max resident pages at 119GiB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151812
Approved by: https://github.com/colesbury
2025-04-24 19:25:29 +00:00
b1d055fd6a Revert "[dynamo] Add guard serialization for tensor matches. (#151318)"
This reverts commit 81c4369d813facf39313dfd481adc71704cbc2c1.

Reverted https://github.com/pytorch/pytorch/pull/151318 on behalf of https://github.com/zhxchen17 due to macos test failing ([comment](https://github.com/pytorch/pytorch/pull/151318#issuecomment-2828638168))
2025-04-24 19:22:45 +00:00
b11c9e1808 [CI][docker] Use install_cusparselt when possible in docker image (#150600)
spot checked builds for line like `Found CUSPARSELT: /usr/local/cuda/lib64/libcusparseLt.so`.  I don't know if there's another way to do it

I am slowly trying to reduce the duplicated code in docker image installs
Pros:
* less dup code

Cons:
* more docker copies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150600
Approved by: https://github.com/atalman
2025-04-24 18:52:10 +00:00
ff075d0815 Update docs dependencies for local build (#151796)
Fixes #151786

- Changed requirements.txt to a symlink to .ci/docker/requirements-docs.txt
- Updated README.md with better doc build instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151796
Approved by: https://github.com/malfet
2025-04-24 18:40:42 +00:00
81c4369d81 [dynamo] Add guard serialization for tensor matches. (#151318)
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.

The main behavioral change introduced in this diff is on CheckFunctionManager:

```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")

guards_state: bytes = check_fn_manager.guards_state
```

Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.

When we load back guards state, we will set `guards_serialization_mode` is set to `load`:

```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```

# TENSOR_MATCH

Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.

We kick off the work from TENSOR_MATCH from this diff.

# Testing

For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.

Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-24 18:07:01 +00:00
03970dfd4c Add functionality for installing free variables (#151134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151134
Approved by: https://github.com/anijain2305
ghstack dependencies: #152036
2025-04-24 17:57:54 +00:00
402d19c0bd add basic unit tests and noop config (#152036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152036
Approved by: https://github.com/anijain2305
2025-04-24 17:57:54 +00:00
9c1bc9ce46 [fake tensor] Cache None, integer and SymInts in the output (#151961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151961
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #151409, #151633, #151477, #151957
2025-04-24 16:44:45 +00:00
0eb554e96a Better error msg for too big to optimize (#151855)
Summary: In the "too big to optimize" error message, tell the user that they should use the torch._inductor.config.aot_inductor.compile_wrapper_opt_level = 'O0' flag

Test Plan:
This is not added to unit test cases because it runs for a little longer time before the expected failure

```

    def test_runtime_checks_error_msg(self):

        with torch.library._scoped_library("mylib", "FRAGMENT") as lib:
            torch.library.define(
                "mylib::foo",
                "(Tensor a, Tensor b) -> Tensor",
                tags=torch.Tag.pt2_compliant_tag,
                lib=lib,
            )

            torch.library.impl("mylib::foo", "cpu", lib=lib)
            def foo(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
                return a + b

            torch.library.impl_abstract("mylib::foo", lib=lib)
            def foo_fake_impl(a, b):
                return a + b

            class Model(torch.nn.Module):
                def __init__(self) -> None:
                    super().__init__()

                def forward(self, x):
                    for i in range(10000):
                        x = torch.ops.mylib.foo(x, x)
                    return x

            inputs = (torch.ones(8, 8, 8), )
            model = Model()
            with self.assertRaisesRegex(Exception, "torch._inductor.config.aot_inductor.compile_wrapper_opt_level"):
                with torch.no_grad():
                    AOTIRunnerUtil.compile(
                        model,
                        inputs,
                    )
```

Differential Revision: D72323380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151855
Approved by: https://github.com/desertfire
2025-04-24 16:35:19 +00:00
56e67badc3 Move verbose warning to warning_once (#152044)
It was printing 1000s of lines for me..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152044
Approved by: https://github.com/XilunWu
2025-04-24 16:18:34 +00:00
3a170a8ce6 Revert "[Cutlass] Implement EVT example tensor creation (#150904)"
This reverts commit 253059356fc93b51c7c53246a5922db3fb14e184.

Reverted https://github.com/pytorch/pytorch/pull/150904 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking the test_example_tensor_creation test internally. See D73519195 for more details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150904#issuecomment-2828132914))
2025-04-24 16:00:25 +00:00
d743a7bd85 [invoke_subgraph] Cache fake tensor if no unbacked symint in the output (#151957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151957
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151409, #151633, #151477
2025-04-24 14:17:22 +00:00
1d73b644a8 [fake tensor cache] Support index with non bool/int8 indices (#151477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151409, #151633
2025-04-24 13:48:18 +00:00
41285f26e4 [invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633
Approved by: https://github.com/zou3519
ghstack dependencies: #151409
2025-04-24 13:32:08 +00:00
3278ddd50c [invoke_subgraph] Compile time traces (#151409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151409
Approved by: https://github.com/zou3519
2025-04-24 13:20:50 +00:00
5e320eea66 [BE] follow autoformating and linter (#151507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151507
Approved by: https://github.com/Skylion007
2025-04-24 07:37:04 +00:00
5b368fa0b7 Add torch.cuda._compile_kernel() (#151484)
Followup work on top https://github.com/pytorch/pytorch/pull/149480

Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet

Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench

This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal

For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch

Future work, as a prereq to making the work public
* divup primitive
* support multiple kernels
* Expose _get_nvrtc_version from native code
* interop with torch.compile
* AMD support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484
Approved by: https://github.com/malfet
2025-04-24 07:14:31 +00:00
78953ee122 [pytorch] reland of [cutlass backend] delay construction of cutlass presets to when called (#151875) (#152031)
Differential Revision: D73524978

reland of https://github.com/pytorch/pytorch/pull/151875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152031
Approved by: https://github.com/yangw-dev
2025-04-24 05:36:36 +00:00
2ea8653391 [vec128] Fix fmsub NEON defintion (#152075)
As reported in https://github.com/pytorch/pytorch/issues/149292, according to manual, `vfmsq_f32` implements `c - a * b` rather than `a * b - c`, so it's call must be prefixed with `vnegq_f32`

Also, adjust the tests to use OpMath for FMA computation to avoid accuracy error accumulation due to non-fused multiply-and-add over lower precision dtypes

Note that `Vectorized::fmsub` is not currently instantiated anywhere, so it could safely remain broken

TODO:
 - Enable C++ testing on MacOS and/or aarch64 platforms (right now Mac tests are build without C++ tests)

Fixes https://github.com/pytorch/pytorch/issues/149292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152075
Approved by: https://github.com/swolchok
ghstack dependencies: #151955
2025-04-24 05:10:45 +00:00
5e9bdc9b86 [MPS] layernorm forward kernel (#152010)
Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below:
![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e)

Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script
```python
import csv
import time

import numpy as np

import torch
import torch.nn.functional as F

matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
batch_sizes = [1]
elementwise_affine = [False, True]
num_runs = 50
warmup_runs = 3

def create_input_tensor(n, batch_size):
    torch.manual_seed(42)
    return torch.randn(batch_size, n, dtype=torch.float32)

def run_layer_norm(A, normalized_shape, elementwise_affine):
    torch.mps.synchronize()
    start = time.perf_counter()
    out = F.layer_norm(A, normalized_shape)
    torch.mps.synchronize()
    end = time.perf_counter()
    return out, end - start

results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []}

for el_aff in elementwise_affine:
    for n in matrix_sizes:
        for batch_size in batch_sizes:
            print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}")

            try:
                A_cpu = create_input_tensor(n, batch_size)
                A_mps = A_cpu.to("mps")

                normalized_shape = (n,)

                for _ in range(warmup_runs):
                    _, _ = run_layer_norm(A_mps, normalized_shape, el_aff)

                times = []
                for _ in range(num_runs):
                    _, t = run_layer_norm(A_mps, normalized_shape, el_aff)
                    times.append(t)

                mean_time = np.mean(times)
                std_time = np.std(times)

                results["N"].append(n)
                results["elementwise_affine"].append(el_aff)
                results["batch_size"].append(batch_size)
                results["mean_time"].append(mean_time)
                results["std_time"].append(std_time)

                print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

            except RuntimeError as e:
                print(f"Error for N={n}, batch_size={batch_size}: {e}")
                continue

with open("layernorm_benchmark_times_new.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"])
    for i in range(len(results["N"])):
        writer.writerow(
            [
                results["N"][i],
                results["elementwise_affine"][i],
                results["batch_size"][i],
                results["mean_time"][i],
                results["std_time"][i],
            ]
        )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:07:46 +00:00
a389835313 [MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:04:49 +00:00
2102b3b4c5 [FSDP1] print fqns when debug FlatParamHandle (#151336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151336
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-04-24 04:49:24 +00:00
2a58d2a155 StringCordView: make iterator fast when there is only one piece (#151810)
This makes the StringCordView iterator a variant holding
either the existing implementation (when there is more than one piece)
or a simple `std::string_view::iterator` (when there is only one
piece). The latter seems to be significantly cheaper.

Differential Revision: [D73379178](https://our.internmc.facebook.com/intern/diff/D73379178/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151810
Approved by: https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807
2025-04-24 04:43:34 +00:00
76cc379bec Fix missing moves in SchemaTypeParser::parseFakeAndRealType (#151807)
Was seeing a small amount of shared_ptr traffic from these.

The std::move(text) at the top is just a piggyback.

Differential Revision: [D73376720](https://our.internmc.facebook.com/intern/diff/D73376720/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151807
Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806
2025-04-24 04:43:34 +00:00
68454b9d17 Fix a missed c10::TypeFactory::create spot in function_schema_parser (#151806)
Looks like we are supposed to be using TypeFactory instead of direct creation everywhere that might run on mobile.

Differential Revision: [D73376716](https://our.internmc.facebook.com/intern/diff/D73376716/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151806
Approved by: https://github.com/Skylion007, https://github.com/iseeyuan
ghstack dependencies: #151801, #151802, #151803, #151804, #151805
2025-04-24 04:43:34 +00:00
b237211b42 Fix easy missing moves in function_schema_parser (#151805)
Just some straightforward not-moving-upon-return.

Differential Revision: [D73376718](https://our.internmc.facebook.com/intern/diff/D73376718/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151805
Approved by: https://github.com/malfet, https://github.com/cyyever
ghstack dependencies: #151801, #151802, #151803, #151804
2025-04-24 04:43:34 +00:00
89a85d0954 Add & use Token::text_view() (which returns a string_view unlike text()) (#151804)
Sadly, I can't just fix text() because that might cause lifetime issues in somebody's code.

Differential Revision: [D73376715](https://our.internmc.facebook.com/intern/diff/D73376715/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151804
Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151801, #151802, #151803
2025-04-24 04:43:34 +00:00
0559741d7f Fix return type of TypeFactoryBase<c10::DynamicType>::get (#151803)
getBaseType() actually returns a reference. This was causing shared_ptr copies.

Differential Revision: [D73376717](https://our.internmc.facebook.com/intern/diff/D73376717/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151803
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #151801, #151802
2025-04-24 04:43:34 +00:00
fabbcddab1 Create and use DynamicTypes for check in DispatchKeyExtractor::makeBitsetForDispatchArgs (#151802)
On mobile, many but not all things in the JIT type subsystem start using DynamicType. Not using DynamicType  was imposing a startup time cost here, as explained in the comment.

Differential Revision: [D73129442](https://our.internmc.facebook.com/intern/diff/D73129442/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151802
Approved by: https://github.com/malfet
ghstack dependencies: #151801
2025-04-24 04:43:34 +00:00
5de92e676a Don't copy DynamicType argument to DynamicType::create (#151801)
This improves performance of DynamicType::isSubtypeOfExt.

Differential Revision: [D73129449](https://our.internmc.facebook.com/intern/diff/D73129449/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151801
Approved by: https://github.com/malfet
2025-04-24 04:43:34 +00:00
43f1b60ded Revert "[MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)"
This reverts commit d703f062fe7e4ead362ec0473ef33579e84532ac.

Reverted https://github.com/pytorch/pytorch/pull/152064 on behalf of https://github.com/malfet due to Lint is not green ([comment](https://github.com/pytorch/pytorch/pull/152064#issuecomment-2826305781))
2025-04-24 04:04:49 +00:00
e2cf60ff18 [MPS] Fix test_neg_index_mps (#151966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 04:02:09 +00:00
2ee8de54b1 [dynamic shapes] user-code friendly statically_known_true, has_static_value (#151601)
Fixes #151480

Allows `statically_known_true` in user code, as well as introducing `has_static_value`, returning True if the input has a static bool/float/int value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151601
Approved by: https://github.com/laithsakka, https://github.com/zou3519, https://github.com/jingsh
2025-04-24 02:53:59 +00:00
d703f062fe [MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-24 02:32:36 +00:00
4ac2ee573d [sigmoid] memory planner C10 deps (#151275)
Summary: perf-sensitive util functions for use in our memory planner

Test Plan: CI

Differential Revision: D73002726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275
Approved by: https://github.com/georgiaphillips
2025-04-24 01:46:32 +00:00
c91acad73a [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-24 01:28:09 +00:00
f39a1a43ee Fix typos in meta.rst (#151979)
### Fixes made:
- "allow you to the module" → corrected to "allows you to move the module"

- "allow" → changed to "allows" to agree with the singular subject "method"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151979
Approved by: https://github.com/colesbury
2025-04-24 01:25:09 +00:00
4e1d4333f7 [FlexAttention] Remove Old Constraint on lastdim strides (#151959)
Fixes: #148827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151959
Approved by: https://github.com/Chillee
ghstack dependencies: #151846
2025-04-24 01:09:52 +00:00
2455ded502 [FlexAttention] Fix device test instantation (#151846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos
2025-04-24 01:09:52 +00:00
f2cfeb23e5 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-04-24 01:06:29 +00:00
8172397025 Revert "Update torch-xpu-ops commit pin (#150827)"
This reverts commit 776aa682218bad4df7b6cd46ef2a0f1d8ca1194c.

Reverted https://github.com/pytorch/pytorch/pull/150827 on behalf of https://github.com/etaf due to Inductor UT regression ([comment](https://github.com/pytorch/pytorch/pull/150827#issuecomment-2825857903))
2025-04-24 00:41:06 +00:00
4d2d833976 [CI] Update sleef submodule to v3.8 (#151955)
Should help with RISC-V cross-compilation.
3.9.0 migration is blocked by sleef project switching to C++20
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151955
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/Skylion007
2025-04-23 23:56:05 +00:00
fd3d339e17 [dynamic shapes] be less aggressive with runtime assert CSE for bounds (#151590)
Fixes #150540
Fixes #147772

Stops trying to CSE bound expressions, only does exact deduplication for runtime asserts. Adds the test cases to check that AOTAutograd doesn't data-dependent error out when retracing due to not seeing the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151590
Approved by: https://github.com/laithsakka
2025-04-23 23:07:00 +00:00
47ad351ff3 [DRAFT] INitial version of sticky export (#151047)
Summary: This is to make torchnative demos and benchmarking real models more simple by not requiring ppl to find example inputs first.

Test Plan: CI

Differential Revision: D72815584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151047
Approved by: https://github.com/zhxchen17
2025-04-23 22:58:43 +00:00
bd191730ce [cutlass backend] Stop using GenerateSM80 for SM90 and SM100 (#150781)
Not urgent.

We don't use the GenerateSM80 ops I believe.

For SM100, we could skip SM90 as well. But I don't have data for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150781
Approved by: https://github.com/kadeng
2025-04-23 22:16:57 +00:00
dccb7a9cb2 [pytorch] use a mutex in initialize_torch_libraries (#151938)
Summary: The TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature is thread unsafe for calling the initializers, but we want to allow the deferred initializer call to be safe from multiple threads. Add a mutex to ensure we have thread safe construction of the libraries post launch.

Differential Revision: D73457714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151938
Approved by: https://github.com/swolchok, https://github.com/zou3519
2025-04-23 21:41:01 +00:00
562328501e Revert "Turn on static cuda launcher in OSS (#151691)"
This reverts commit e31e2d27c6739cad5327cc54e6ac9fd28a157cbf.

Reverted https://github.com/pytorch/pytorch/pull/151691 on behalf of https://github.com/malfet due to This breaks tests, see c1f51cf2c4/1 ([comment](https://github.com/pytorch/pytorch/pull/151691#issuecomment-2825427252))
2025-04-23 20:28:31 +00:00
98c53d8b39 Revert "[MPS] Fix test_neg_index_mps (#151966)"
This reverts commit 9422e24c472ccbaffc4cf3935e12d0a83f269560.

Reverted https://github.com/pytorch/pytorch/pull/151966 on behalf of https://github.com/malfet due to Looks like it broke halide testing, see https://github.com/pytorch/pytorch/actions/runs/14623941238/job/41034065229 ([comment](https://github.com/pytorch/pytorch/pull/151966#issuecomment-2825425305))
2025-04-23 20:25:49 +00:00
c1f51cf2c4 [map] defer importing AOTConfig and create_joint dependency (#151479)
Summary:
We reverted D72896450 due to a weird error happens at a seemingly unrelated test "buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"
"

I did some investigation and found that moving import AOTConfig and create_joint inside the create_fw_bw_grap causes a delay of importing the recursively imported modules in AOTConfig create_joint from test construction time to the test running time. The path.exists mock gets called multiple times due to the inspect.getsource calls in multiple places of torch.

Specifically, we set a breakpoint at the sideeffect of mocked os.path.exists. P1787425831 shows the importing stack trace before the change. P1787431638 shows the importing stacktrace after the change.

The notable difference is that in the second pastry, we trigger an os.path.exists when somewhere in triton we called inspect.getsourcelines when we construct OnDiskPreprocStateSerializer, which gets recorded by the mock.

Looking at the test, it seems what the test actualy wants to test is the deserialize step. So we reset_mock before the step to avoid mocking things happened at import time.

Test Plan:
buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"

and existing tests for map.

Differential Revision: D73138415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151479
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-04-23 19:16:40 +00:00
99ae7d4069 Reland fast gather and index implementation (#151917)
This PR reapplies #151490 and #151753 together, and adds some missing checks when applying the fast path.
Previously missed checks:
1) indexing path has the stride in the indexed dimension in bytes, gather path has the stride in the indexed dimension in elements. When checking if fast path is applicable, I didn't take this difference into account, and still multiplied the indexing stride by element size. Fixed and test added
2) We want to take fast path only when we are copying contiguous equally spaced slices of inputs + all the necessary alignment requirements. The effective tensor size should be 2d (after all possible flattening is applied), the index stride in the last dimension should be 0, and, since in the kernel we are not applying non-indexing-related offsets to src tensor, the src tensor stride in the second dimension should be 0. This automatically happens for gather with dim=0, so I didn't put in an explicit condition for this. Sometimes all conditions except first dim "effective" stride equal to 0 are satisfied for scatter on non-zero dim, when index size in the indexing dimension is 1 and thus it is collapsed (dimensions of size 1 are always collapsed), e.g.
```
        # test gather along 1st dim that can accidentally trigger fast path
        # because due to index dimension in the gather dim being 1
        # an unexpected squashing in tensorIterator happens
        src = make_tensor((16, 2, 16), device=device, dtype=dtype)
        ind = torch.randint(2, (16, 1), device=device).view(16, 1, 1).expand(16, 1, 16)
        res = torch.gather(src, dim=1, index=ind)
        if res.device.type == "cuda":
            ref_cpu = torch.gather(src.cpu(), dim=1, index=ind.cpu())
            self.assertEqual(res.cpu(), ref_cpu, atol=0, rtol=0)
```
Note that if index size here was (16, 2, 16) instead of (16, 1, 16) then the middle dimension could not be collapsed and we wouldn't end up incorrectly taking fast path.
We could update the kernel to take this stride into account when computing offsets into src tensor, or we could specifically disallow non-zero stride on the first dimension. I took the second path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151917
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/Skylion007
2025-04-23 19:13:13 +00:00
69e41cee04 move find_hop_schema into _higher_order_ops/schema.py (#151147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151147
Approved by: https://github.com/zou3519
2025-04-23 18:26:37 +00:00
5acc3e286a [Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`

These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.

While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.

These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.

Test Plan:
contbuild & OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
2025-04-23 18:21:35 +00:00
3c1a17a08b [Dynamo] Use LazyVariableTracker in base VT (#151847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847
Approved by: https://github.com/StrongerXi
2025-04-23 18:18:01 +00:00
aa285e6512 Revert "[cutlass backend] delay construction of cutlass presets to when called (#151875)"
This reverts commit 8ca7953d510deb21cd99b92523f73beafa4588bf.

Reverted https://github.com/pytorch/pytorch/pull/151875 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151875#issuecomment-2825030726))
2025-04-23 17:33:31 +00:00
5f63789dd2 [torchbind] fix error message when attr is a real tensor. (#151944)
Summary: Previously, when attr is defined, "if attr" will try to evaluate the data of attr, which is not intendended and we get a ugly error stack if the attr is not evaluable (like a fake tensor) before the callable(attr) check.

Test Plan: Existing tests.

Reviewed By: yushangdi, henryoier

Differential Revision: D73460905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151944
Approved by: https://github.com/yushangdi
2025-04-23 17:32:11 +00:00
9344da8bd1 Revert "[fake tensor cache] Support index with non bool/int8 indices (#151477)"
This reverts commit bdb34f55a0c44f82d914dc9b41e785b2eed97675.

Reverted https://github.com/pytorch/pytorch/pull/151477 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151477#issuecomment-2825023953))
2025-04-23 17:30:27 +00:00
348272e67e Revert "[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)"
This reverts commit 02dd096e5154867f6eb463d434b9eba0bdc85a64.

Reverted https://github.com/pytorch/pytorch/pull/151633 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151633#issuecomment-2825007363))
2025-04-23 17:23:23 +00:00
2ab752d720 Make torch.jit.Error inherit from Exception (#151947)
Summary:
I can confirm that `torch.jit.Error.mro()` contains `Exception` in the inheritance hierarchy.

This avoids a bunch of `pyre-ignore`s in D73352417.

Test Plan: Sandcastle

Differential Revision: D73464544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151947
Approved by: https://github.com/Skylion007
2025-04-23 17:19:25 +00:00
9422e24c47 [MPS] Fix test_neg_index_mps (#151966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966
Approved by: https://github.com/malfet
2025-04-23 17:06:28 +00:00
a560216abb Update description for torch.random.fork_rng (#151881)
As the title stated.

Related ISSUE:
https://github.com/pytorch/pytorch/issues/151784
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151881
Approved by: https://github.com/albanD
2025-04-23 16:59:29 +00:00
05114679b7 [ROCm] AtomicAdd specialization on AMD for fp64. (#151724)
Fixes https://github.com/pytorch/pytorch/issues/151039

Improve scatter add performance on MI250X.

Some numbers from the reporter's benchmark:
```
Before: dtype torch.float64 time =  3.577979326248169
After: dtype torch.float64 time =  0.0031385421752929688
```
No perf. improvement to MI300 or MI100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151724
Approved by: https://github.com/jeffdaily
2025-04-23 16:33:32 +00:00
e31e2d27c6 Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen
2025-04-23 15:43:24 +00:00
dcc32ff5bf [CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163)
opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue
`TORCH_CUBLASLT_UNIFIED_WORKSPACE=1`

@izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163
Approved by: https://github.com/ngimel
2025-04-23 15:24:22 +00:00
7310049c42 Revert "[FlexAttention] Fix device test instantation (#151846)"
This reverts commit b37fa20771a7aa1ddcfaf59df7e56683d3d0be3b.

Reverted https://github.com/pytorch/pytorch/pull/151846 on behalf of https://github.com/jithunnair-amd due to PR broke rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/151846#issuecomment-2824607429))
2025-04-23 15:01:36 +00:00
21b0ef520d [Easy] Remove redundant code (#151883)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151883
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-04-23 14:25:19 +00:00
b32b002a6e [BE] Replace std::runtime_error with TORCH_CHECK [1/N] (#151880)
Part of: #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151880
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/cyyever
2025-04-23 11:14:35 +00:00
6d28d61323 [CI] Remove protobuf from docker image (#151933)
Pretty sure the source should be the one in third-party

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151933
Approved by: https://github.com/huydhn
2025-04-23 10:29:09 +00:00
5b9df57b50 [dynamo] context manager/decorator for dynamo config patching during tracing (#150586)
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.

The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.

Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.

Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)

See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.

NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
2025-04-23 09:12:13 +00:00
62b5649b76 [Inductor] Test ND block pointers with dynamic shapes (#151646)
With ND tiling, we can get multi-dimensional block pointers with dynamic shapes. This is an important capability, but I couldn't find any CI tests for it. This PR adds a couple of tests checking that we get the expected block pointers with dynamic shapes, both for pointwise and reduction kernels.

Example kernels:
```
@triton.jit
def triton_poi_fused_div_0(in_ptr0, out_ptr0, ks0, ks1, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    yoffset = (tl.program_id(1) + tl.program_id(2) * tl.num_programs(1)) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[:, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, :]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1])
    tmp1 = (tmp0 / tmp0)
    tl.store(tl.make_block_ptr(out_ptr0, shape=[ks0, ks0], strides=[ks0, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp1, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])

@triton.jit
def triton_red_fused_prod_0(in_ptr0, out_ptr0, ks0, ks1, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 1
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rbase = r1_base + r0_base*r1_numel
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 1, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + r0_offset*r1_numel
            rindex = r1_index + r0_index*r1_numel
            r0_0 = r0_index
            r1_1 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1], padding_option='zero', eviction_policy='evict_first')[None, :, :]
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 * tmp1
            _tmp2 = tl.where(r0_mask & r1_mask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [R0_BLOCK, (-1)*R1_BLOCK*(triton_helpers.div_floor_integer((-1) + ks0 + R1_BLOCK,  R1_BLOCK))])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = triton_helpers.prod(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp2, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151646
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/shunting314
2025-04-23 06:20:04 +00:00
ee81fe40c1 Support regexes in dynamic sources allowlist (#151766)
As requested by Shuai. I also included an additional refactor to capture
changes in the whitelist over time since previously the first time it
was set, it was impossible override when a new config was set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151766
Approved by: https://github.com/pianpwk
2025-04-23 06:17:16 +00:00
7c97720d16 [dynamic shapes] rewrite expand with guard_or_false (#150236)
Rewrites the expand decomposition to avoid unbacked errors, assuming the general path where `input shape == output shape or input shape == 1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150236
Approved by: https://github.com/laithsakka
2025-04-23 06:11:11 +00:00
097faa9217 [audio hash update] update the pinned audio hash (#151729)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151729
Approved by: https://github.com/pytorchbot, https://github.com/Skylion007
2025-04-23 06:04:32 +00:00
b247e5db33 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AMX (#150603)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds AMX-based GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. It brings performance benefits on platforms where AMX is available.

**Validation results**
We have run GPT-J-6B and Llama-3-8B-Instruct on a 6th gen Xeon with 96 cores. Results show that the AMX-based microkernel outperforms AVX512-based one by >5x for prefill stage with 1024 input length.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150603
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-23 05:58:55 +00:00
54f736155b [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-23 05:42:30 +00:00
b37fa20771 [FlexAttention] Fix device test instantation (#151846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos
2025-04-23 05:37:25 +00:00
cc793e895e [StandaloneCompile] Autotune at compile time (#151922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151922
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151921
2025-04-23 04:32:06 +00:00
f9bdfe90ae [MegaCache] Return None on no compilation (#151921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151921
Approved by: https://github.com/jamesjwu
2025-04-23 04:32:06 +00:00
78bbb468c6 Use /var/tmp instead of /tmp for torch cache directory on fbcode (#151466)
Summary:
We've been noticing that cache directory has been getting cleaned underneath us, lets use /var/tmp which is supposed to be cleaned less frequently.

https://fb.workplace.com/groups/257735836456307/posts/883428143887070

Test Plan: unit tests

Reviewed By: masnesral

Differential Revision: D73008663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151466
Approved by: https://github.com/masnesral
2025-04-23 03:30:51 +00:00
253059356f [Cutlass] Implement EVT example tensor creation (#150904)
This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++.

udpates to example tensor creation

Previously merged:
* https://github.com/pytorch/pytorch/pull/150903
* https://github.com/pytorch/pytorch/pull/150346
* https://github.com/pytorch/pytorch/pull/150345
* https://github.com/pytorch/pytorch/pull/150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904
Approved by: https://github.com/eellison
2025-04-23 03:26:56 +00:00
cd021d048e Fix circular imports (#151939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151939
Approved by: https://github.com/jamesjwu
2025-04-23 02:53:32 +00:00
13339ce086 [dynamic shapes] bound_sympy for size-oblivious min/max reasoning (#151242)
Differential Revision: D72978020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151242
Approved by: https://github.com/bobrenjc93
2025-04-23 02:14:05 +00:00
74074fe8d8 [inductor] handle offset in ReinterpretView for alignment (#151859)
Fix https://github.com/pytorch/pytorch/issues/151589

It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size
1. alignement
2. max number of contiguous elements a thread need to process

Here is the triton code that decides vectorization size [link](c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)), and here is the triton code that considers contiguity for vectorization [link](c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269))

When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction.

Check this test:
```
    @parametrize(
        "size",
        (
            128,
            1024,
            1024 * 1024,
        ),
    )
    def test_slice_view_dtype(self, size):
        offset = 1

        def f(x):
            return x[2:].view(dtype=torch.float32) + 1

        x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device)
        self.common(f, (x,), reference_in_float=False)
```

Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size
1. when size = 128, triton pick ld.global.b32 to load data from global memory
2. when size = 1024, triton uses ld.global.v2.b32
4. when size = 1024 * 1024, triton uses ld.global.v4.b32

So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #151841
2025-04-23 01:50:49 +00:00
68a7501dab [Inductor][CPP] Fix Codegen Issue when Parallel Reduction under the vectorization (#151887)
**Summary**
Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level.

However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like:
```
for vec_loop_level:
    do_parallel_reduction
```
In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented.

**Test Plan**
```
python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu
python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887
Approved by: https://github.com/jansel
2025-04-23 00:41:14 +00:00
015b526a2a [MPSInductor] Warn-cast double as floats (#151963)
To support sqrt over dynamic shapes, i.e. make something like:
```python
torch.compile(dynamic=True)(lambda x: x * math.sqrt(x.size(0))
```
compilable into
```metal
// Source node to ATen node mapping:
// Graph fragment:
//   %scalar_tensor_default : [num_users=1] = call_function[target=torch.ops.aten.scalar_tensor.default](args = (%arg0_1,), kwargs = {})
//   %convert_element_type_default : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%scalar_tensor_default, torch.float64), kwargs = {})
//   %sqrt_default : [num_users=1] = call_function[target=torch.ops.aten.sqrt.default](args = (%convert_element_type_default,), kwargs = {})
//   %convert_element_type_default_1 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%sqrt_default, torch.float32), kwargs = {})
//   %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %convert_element_type_default_1), kwargs = {})
 kernel void generated_kernel(
     device float* out_ptr0,
     constant float* in_ptr0,
     constant long& ks0,
     uint xindex [[thread_position_in_grid]]
 ) {
     int x0 = xindex;
     auto tmp0 = in_ptr0[x0];
     auto tmp1 = ks0;
     auto tmp2 = static_cast<float>(tmp1);
     auto tmp3 = metal::sqrt(tmp2);
     auto tmp4 = static_cast<float>(tmp3);
     auto tmp5 = tmp0 * tmp4;
     out_ptr0[x0] = static_cast<float>(tmp5);
 }
```

TODO:
 - Figure out if this could be tweaked in fx-passes, but overhead is probably too high

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151963
Approved by: https://github.com/dcci
ghstack dependencies: #151869, #151871, #151872
2025-04-23 00:30:45 +00:00
49b7ffbb15 [MPS] Implement _print_Trunc_to_Int (#151964)
Fixes `test_device_assert_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151964
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-23 00:30:00 +00:00
72f711e200 Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)"
This reverts commit 8d81806211bc3c0ee6c2ef235017bacf1d775a85.

Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/henrylhtsang due to Revert because this change isn't needed ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2822768377))
2025-04-23 00:26:49 +00:00
334aab0dea Updates NCCLConfig with QOS variable (#151821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151821
Approved by: https://github.com/kwen2501
2025-04-23 00:03:49 +00:00
aa61707a56 Fix extra heap allocation in Source constructor (#151800)
This was a sneaky one: the StringCordView default constructor allocates.

Differential Revision: [D73129448](https://our.internmc.facebook.com/intern/diff/D73129448/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151800
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #151682
2025-04-22 23:36:06 +00:00
cd576fdce5 [torch][fx] Add support for EXIR dialect overload ops in normalize_function (#143689)
Summary:
I had a minor annoyance when debugging graphs using EXIR dialect ops,
that all the function normalization went away. For functions with > 5 arguments,
some of which are just simple bools and ints, it's very helpful to have
the kwarg names attached.

Enhance `normalize_target` to handle EdgeOpOverload targets. To avoid
a circular dependency on Executorch from pytorch core, I just use a `hasattr`
check for "_op". This only happens if the target is not already a recognized
torch function.

Also, I noticed that the new `fx.Node.normalized_arguments` function
didn't forward an important kwarg to `normalize_target`, so I fixed that too.

Test Plan: Tested with FxGraphDrawer and an fx Graph containing EXIR nodes.

Differential Revision: D67545909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143689
Approved by: https://github.com/angelayi
2025-04-22 23:36:02 +00:00
4f8adde5ce Speed up OperatorEntry construction by avoiding updateDispatchTableFull_ (#151682)
The purpose of the updateDispatchTableFull_ call is, according to the comment, just to pick up fallback kernels if there are any. We can implement that directly more efficiently.

Differential Revision: [D73129447](https://our.internmc.facebook.com/intern/diff/D73129447/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151682
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh
2025-04-22 23:35:53 +00:00
c98340e268 [autodeps2] Replace third-party/pyyaml with third-party/pypi/pyyaml (#151668)
Summary: We should use the pypi version.

Test Plan: CI

Differential Revision: D73211869

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151668
Approved by: https://github.com/Skylion007
2025-04-22 23:27:13 +00:00
f4ac9a160d [fx] Filter stacktrace (#151029)
Filtering out the stacktrace so that the stacktrace on nodes when using fx.Tracer looks nicer. I just copied the filtering we have in [proxy_tensor.py](6720d23969/torch/fx/experimental/proxy_tensor.py (L1903-L1931)).

Previously the stacktrace looked like:
```
File "/data/users/angelayi/pytorch/moo.py", line 3964, in <module>
    run_tests()
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 1342, in run_tests
    unittest.main(argv=argv)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/runner.py", line 184, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 650, in __call__
    return self.run(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3324, in run
    self._run_custom(
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3296, in _run_custom
    super_run(result=result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3156, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/moo.py", line 1495, in test_stack_trace
    gm = torch.fx.GraphModule(m, tracer.trace(m))
  File "/data/users/angelayi/pytorch/torch/fx/_symbolic_trace.py", line 837, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
    x = x * 2
  File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 716, in impl
    return tracer.create_proxy("call_function", target, args, kwargs)
  File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 248, in create_proxy
    proxy.node.stack_trace = "".join(CapturedTraceback.extract().format())
```
Now it looks like:
```
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
    x = x * 2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151029
Approved by: https://github.com/jfix71, https://github.com/zou3519, https://github.com/jingsh
2025-04-22 22:50:36 +00:00
a7ccd96bbf logging start of torch elastic workers. (#150849)
Summary:
We would like to log start of the workers. It will help with complete logging.

Test Plan:
unit tests

https://www.internalfb.com/intern/testinfra/testrun/6473924724652056

e2e tests
https://www.internalfb.com/mlhub/pipelines/runs/mast/f712311762-27449483648-TrainingApplication_V403K?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Reviewed By: tnykiel

Differential Revision: D72297314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150849
Approved by: https://github.com/d4l3k, https://github.com/kiukchung
2025-04-22 22:35:06 +00:00
6a1b820255 [export] Enable symint inputs for AdditionalInputs and ShapesCollection (#151842)
With `AdditionalInputs`, the behavior is the same as with tensors:
```python
class M(torch.nn.Module):
    def forward(self, x, y):
        return x + y

additional_inputs = torch.export.AdditionalInputs()
additional_inputs.add((5, 5))
additional_inputs.add((3, 5))
additional_inputs.add((5, 4))
ep = torch.export.export(
    M(), (6, 7), dynamic_shapes=additional_inputs, strict=False
)
```

With `ShapesCollection`, we now need to wrap integer inputs as `_IntWrapper` so that we can have a unique identifier for each integer input.
```python
class M(torch.nn.Module):
    def forward(self, x, y):
        return x + y

from torch.export.dynamic_shapes import _IntWrapper

args = (_IntWrapper(5), _IntWrapper(5))
# Or we can do `args = pytree.tree_map_only(int, lambda a: _IntWrapper(a), orig_args)`
shapes_collection = torch.export.ShapesCollection()
shapes_collection[args[0]] = Dim.DYNAMIC
shapes_collection[args[1]] = Dim.DYNAMIC
ep = torch.export.export(
    M(), args, dynamic_shapes=shapes_collection, strict=False
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151842
Approved by: https://github.com/pianpwk
2025-04-22 22:29:18 +00:00
43de9b75c3 Remove mention of magma-cuda in readme.md, refactor magma_conda install (#147476)
Related to: https://github.com/pytorch/pytorch/issues/138506 we migrated magma-cuda build from anaconda to aws
Last version of magma-cuda published was 12.6 https://anaconda.org/pytorch/magma-cuda126

Here is the PR that moved from anaconda to tarball: https://github.com/pytorch/pytorch/pull/140417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147476
Approved by: https://github.com/albanD
2025-04-22 22:08:49 +00:00
c0b70f94e2 [Testing] Enable test_mutations_loop_fusion_mps (#151872)
By testing it against float32 rather than double dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151872
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151869, #151871
2025-04-22 22:00:16 +00:00
2f851ac8f8 [MPSInductor] Implement atomic_add store mode (#151871)
Which fixes `GPUTests.test_index_put2_mps`, `GPUTests. test__unsafe_masked_index_put_accumulate_mps` and dozen of scatter/gather tests that relied on atomic_add store mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151871
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #151869
2025-04-22 22:00:16 +00:00
3aecf2dc52 [MPS] Extend index_put to half precision floats (#151869)
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support

TODOs:
 - Get rid of indexing kernel and compute it directly when kernel is run
 - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
 - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-04-22 22:00:08 +00:00
b8f4dc5a9f [ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264)
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.

scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513

Using the following reproducer
```
import torch
import triton

def main():
    dtype = torch.float32
    dim = 1305301
    a = torch.rand(100, device="cuda", dtype=dtype)
    index = torch.randint(0, 100, (dim,), device="cuda")
    src = torch.rand(dim, device="cuda", dtype=dtype)

    print("=" * 20)
    print(
        triton.testing.do_bench(
            lambda: a.scatter_add(0, index, src),
            return_mode="median",
        )
    )
    print("=" * 20)

if __name__ == "__main__":
    main()
```

co-authored by: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-04-22 21:55:40 +00:00
e05ac9b794 Use folder tagged docker images for binary builds (#151706)
Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there

For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used.  Previously it would use imagename:arch-main, which could be a version of the image based on an older commit

After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images

Release and main builds should still pull from docker io

Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706
Approved by: https://github.com/atalman
2025-04-22 21:50:10 +00:00
017a6bd593 add min/max_seqlen to non_differentiable (#151750)
Fixes #148988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151750
Approved by: https://github.com/soulitzer
2025-04-22 21:46:02 +00:00
835413baed Revert "[Optimus][Observability] Improve tlparse logging (#151635)"
This reverts commit 06a3c3c8cdb2424d42d7926a49a18ee6852a40cb.

Reverted https://github.com/pytorch/pytorch/pull/151635 on behalf of https://github.com/clee2000 due to broke dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/14600342064/job/40970324075) [HUD commit link](06a3c3c8cd), test did fail on PR but dr ci says it matches an existing failure, which it does, but also this PR breaks the test too ([comment](https://github.com/pytorch/pytorch/pull/151635#issuecomment-2822538113))
2025-04-22 21:39:23 +00:00
bc6c0bc344 Revert "Do not generate long log messaged for suppressed data dependent errors. (#151023)"
This reverts commit dfdf731579d7472a009f8edf35994b8701e79065.

Reverted https://github.com/pytorch/pytorch/pull/151023 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
459c62ee1d Revert "Do not log exception when recording is disabled or already recording (#151038)"
This reverts commit 73d95893a2b844ba8ee523e0e3915adf54017411.

Reverted https://github.com/pytorch/pytorch/pull/151038 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
aaf71a481b Revert "Log information about suppressed data dependent errors (#151041)"
This reverts commit ccd00359da3423ff7bae8ee682df10590fc844ce.

Reverted https://github.com/pytorch/pytorch/pull/151041 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
2f74cffab2 Remove reinterpret_casts with undefined behavior from stable/library.h (#151595)
There is a list of valid uses of `reinterpret_cast` (see https://en.cppreference.com/w/cpp/language/reinterpret_cast), and the use here was not on the list, hence undefined behavior. Implement what we meant using memcpy, which is well-defined.

Differential Revision: [D73200791](https://our.internmc.facebook.com/intern/diff/D73200791/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151595
Approved by: https://github.com/janeyx99
2025-04-22 20:24:47 +00:00
3380a46b44 Fix DTensorTestBase to barrier with device ids (#150896)
try to get rid of the below annoying warnings when running the unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150896
Approved by: https://github.com/fegin
2025-04-22 20:22:55 +00:00
a48ccf02f9 [Inductor] move alignment tests to a separate file (#151841)
This is a pure code movement. test_torchinductor.py is already 15K lines of code. Move alignment related tests I added recently to a separate file. I need add more such kind of tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151841
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-22 20:18:58 +00:00
596296fb0b [standalone_compile] Dynamic shape handling (#151788)
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:

1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")

1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).

Most of this PR is tests.

Test Plan:
- a lot of new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
2025-04-22 20:17:24 +00:00
7e4b89ac6c fix spammy library deinit errors when user passes an invalid TORCH_LOGS argument (#151678)
fixes https://github.com/pytorch/pytorch/issues/151055. Thanks @desertfire for the patch that fixed this.

I was a bit careful about the test - I wanted to make sure the test accurately ensures that we don't regress and our error message is not spammy when users enter an invalid `TORCH_LOGS=....` argument. But I tried to avoid using expecttests, since people  occasionally add new logging artifacts and I didn't want to add to much churn by forcing this to fail CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151678
Approved by: https://github.com/desertfire, https://github.com/zou3519
2025-04-22 20:13:52 +00:00
0bb9b89fb7 Revert "[compile][compile time traces] Add more dynamo traces (#151357)"
This reverts commit 607443b16be705788ab06e9a31e4569e0f1516c3.

Reverted https://github.com/pytorch/pytorch/pull/151357 on behalf of https://github.com/wdvr due to stack in a weird state - reverting for now ([comment](https://github.com/pytorch/pytorch/pull/151357#issuecomment-2822369232))
2025-04-22 20:12:44 +00:00
d0d4e992f1 [associative_scan] Fixes for assoc_scan testcases (#149988)
This PR fixes some issues with the testcases of `associative_scan`, in particular the problem where the compile_mode is inadvertently always set to `none`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149988
Approved by: https://github.com/ydwu4
2025-04-22 20:09:12 +00:00
8ca7953d51 [cutlass backend] delay construction of cutlass presets to when called (#151875)
In hindsight, always constructing the dict is a bit silly. We should only construct it when we need it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151875
Approved by: https://github.com/yangw-dev
2025-04-22 20:03:10 +00:00
6cd1741985 [ONNX] Update decomposition logic to loop over onnx registry (#151826)
Fixes #150367

This PR makes decomposition table from onnx registry, which includes registered ops not only ATen and prim. This will help to keep the custom ops that are specified in the custom_translation table from decomposition during ONNX export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151826
Approved by: https://github.com/justinchuby
2025-04-22 19:40:52 +00:00
69ee6a9280 [Sana][HybridCache] Fix bug in detect_attr_assignment (#151824)
Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour

Test Plan: CI

Differential Revision: D73388529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824
Approved by: https://github.com/bdhirsh
2025-04-22 19:39:50 +00:00
337caacd4c Use more efficient mask to index computation (#151372)
This change addresses the third time/mem "spike" observed in

https://github.com/pytorch/pytorch/issues/151351

The change sees to perform better (time/mem) for both very sparse and very dense cases. It runs faster, and claims less memory both observed on CPU/GPU. It even avoids OOM for larger cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151372
Approved by: https://github.com/eqy
2025-04-22 19:31:12 +00:00
fbd29527d8 [MPS] Move ops modifiers to testing utils so other tests can reuse (#151781)
Test collection check:
```
python -m pytest test/test_mps.py --collect-only
```
Before:
```
6390 tests collected in 8.34s
```

After:
```
6390 tests collected in 7.71s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781
Approved by: https://github.com/malfet
2025-04-22 19:19:52 +00:00
982062dfc4 Cache the value of torch_key in subproc (#151057)
No need to recalculate torch_key in subprocs, lets pass it from main process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-04-22 18:54:06 +00:00
fa0f13b90b Fix doc requirements install error (#151787)
Fixes #151786

Change version in requirements of docs consistent with version in [CI version file](https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-docs.txt), which changed in #149331

### Test Result

![image](https://github.com/user-attachments/assets/f8646c03-116f-4f1c-b017-11b70995626b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151787
Approved by: https://github.com/malfet
2025-04-22 18:33:44 +00:00
4bf09562e4 [EZ/Profiler] Update Submodule (#151843)
Summary: Update to d82680bbd4

Test Plan: CI

Differential Revision: D73397323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151843
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2025-04-22 18:19:43 +00:00
834a017fe3 Optimize register_full_backward_hook description when all input no grad (#151785)
Fixes #100528

## Test Result

### Before

![image](https://github.com/user-attachments/assets/5dd2e1d3-3bb1-49d0-84bf-8a7a6b18fa4b)

### After

![image](https://github.com/user-attachments/assets/2e16d17b-1586-40d8-b0ef-35559fc064f4)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151785
Approved by: https://github.com/soulitzer
2025-04-22 17:57:31 +00:00
2c27597d6a Infra for handling builtin ops (min, max, math.pow) (#151348)
Reapply of https://github.com/pytorch/pytorch/pull/150003

Differential Revision: [D73050801](https://our.internmc.facebook.com/intern/diff/D73050801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151348
Approved by: https://github.com/zhxchen17
ghstack dependencies: #151347
2025-04-22 17:20:09 +00:00
264e8fb151 More fix for aot_export_module name collision during unlifting (#151684)
Summary: Also check the module's named buffers and parameters when resolving name collision

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```

Differential Revision: D73264885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684
Approved by: https://github.com/angelayi
2025-04-22 16:59:33 +00:00
06a3c3c8cd [Optimus][Observability] Improve tlparse logging (#151635)
Summary: We improve tlparse logging for Optimus graph transformaton to enable easier debug

Test Plan:
```
TORCH_TRACE=~/my_trace_log_dir CUDA_VISIBLE_DEVICES=5 buck2 run mode/opt //aps_models/ads/ecosystem/tooling/tools/efficient_module_suite/pyper_models:pyper_model_perf_benchmark -- --flow_id 720055919 --shrink_model --mfu_profile_module "impl.shared_arch.dense_sparse_interaction" --use_synthetic_data
```

Differential Revision: D73229681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151635
Approved by: https://github.com/Yuzhen11
2025-04-22 16:56:08 +00:00
5fc1eb85fc Add OIDC permissions to bazel workflow (#151456)
Update workflow to use OIDC authentication to access AWS resources rather than assuming the runner's default role. This is part of the multicloud effort to prepare jobs to support being run in non-AWS clouds.

The JWT ID token requires `id-token: write` in order to create the token for the job. See: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings

Ref: pytorch-fdn/multicloud-ci-infra#3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151456
Approved by: https://github.com/malfet
2025-04-22 16:54:14 +00:00
5d316ce0d0 Add device check for inputs (#151828)
Summary: Generate device checks for inputs in AOTI. Enable with AOTI_RUNTIME_CHECK_INPUTS=1

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_runtime_checks_device_type_failed
```

Differential Revision: D73382824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151828
Approved by: https://github.com/angelayi
2025-04-22 16:36:27 +00:00
3804aed32e Revert "[Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)"
This reverts commit 99aeee2c5f07f7fe6ec3f34aacb7db71569a60c5.

Reverted https://github.com/pytorch/pytorch/pull/150587 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73410693). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150587#issuecomment-2821828926))
2025-04-22 16:15:55 +00:00
4504910843 Revert "[ez] Make relaxed constraint error message more user friendly (#151407)"
This reverts commit e0f05229e9ff84aa6138df2bd51f5044bc743afb.

Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73198095). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts. ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2821819654))
2025-04-22 16:12:42 +00:00
f072bf27a7 Revert "faster gather implementation (#151490)"
This reverts commit 541f8cd34cbccfcaf04a377f747390f83658d6ec.

Reverted https://github.com/pytorch/pytorch/pull/151490 on behalf of https://github.com/malfet due to Looks like it breaks demucs accuracy, though may be bogus, but let's try to revert, see c729f7dbee/3 ([comment](https://github.com/pytorch/pytorch/pull/151490#issuecomment-2821803788))
2025-04-22 16:09:14 +00:00
ed0d2ebaa0 Revert "Non-deterministic alert in histc_cuda for floating types only (#151701)"
This reverts commit b7a7741411585817daa81780b078fd15816f2d2d.

Reverted https://github.com/pytorch/pytorch/pull/151701 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing inductor tests to fail. See here for more info: test_torch.py::TestTorchDeviceTypeCUDA::test_nondeterministic_alert_histc_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14586002763/job/40913547718) [HUD commit link](b7a7741411) ([comment](https://github.com/pytorch/pytorch/pull/151701#issuecomment-2821800837))
2025-04-22 16:07:25 +00:00
c729f7dbee [provenance_tracking][reland] Fix UT error and re-land ExternKernel support (#151709)
Summary:
ATT.

reverted previous diff :  D72572050

Test Plan:
```
 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_extern_kernel
```

Differential Revision: D73281217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151709
Approved by: https://github.com/jingsh
2025-04-22 15:44:56 +00:00
d778c92e16 [Metal][BE] Move atomic ops to c10/metal/atomic.h (#151868)
To be reused from indexing and MPSInductor implementaiton of atomic_add stores
Added wrapper for `metal::atomic<int>`(to be used by followup PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151868
Approved by: https://github.com/Skylion007
2025-04-22 14:11:29 +00:00
159e2f96e3 [dynamo][ci] Fix recently broken test (#151877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151877
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-04-22 06:42:03 +00:00
3aeeb77a3a [Dynamo][Easy] Remove unreachable code (#151739)
This line is unreachable:

f6c1cf04b5/torch/_dynamo/output_graph.py (L275)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151739
Approved by: https://github.com/Skylion007
2025-04-22 06:27:00 +00:00
ccd00359da Log information about suppressed data dependent errors (#151041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151023, #151038
2025-04-22 06:07:57 +00:00
73d95893a2 Do not log exception when recording is disabled or already recording (#151038)
I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be
transparent. namely logging dde could be spamming.

before:
<img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151023
2025-04-22 06:07:57 +00:00
dfdf731579 Do not generate long log messaged for suppressed data dependent errors. (#151023)
TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true

 before:
<img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" />

after:
<img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" />

Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen
even when recording is not enabled? I will follow up with PR for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023
Approved by: https://github.com/bobrenjc93
2025-04-22 06:07:57 +00:00
a09a3f4c30 [Hierarchical compile] Ensure output nodes are sorted last (#151295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151295
Approved by: https://github.com/anijain2305
ghstack dependencies: #151293, #151294
2025-04-22 05:13:07 +00:00
283884b224 [Hierarchical Compile] Handle autocast ctx manager (#151294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151294
Approved by: https://github.com/anijain2305
ghstack dependencies: #151293
2025-04-22 05:13:07 +00:00
4a643af992 [Hierarchical Compile] Fix small bug (#151293)
This technically would never be exposed because we never check that a node is an ancestor of itself, but it is good for it to be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151293
Approved by: https://github.com/anijain2305
2025-04-22 05:13:07 +00:00
e76c0b159a Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)"
This reverts commit a02eae8142ddd8fbf068a3e17fc0dd276d92fc78.

Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/malfet due to Caused TestDynamoTimed.test_dynamo_timed to fail on macOS, see https://github.com/pytorch/pytorch/actions/runs/14584536979/job/40908019050 ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2820081721))
2025-04-22 05:05:50 +00:00
0ff302e8e0 Revert "reroute index to fast implementation for indexing on 0th dimension (#151753)"
This reverts commit 4d78e19365c4e2189693c7a81b665d4ec2d2cf53.

Reverted https://github.com/pytorch/pytorch/pull/151753 on behalf of https://github.com/malfet due to Looks like it breaks bunch of distributed tests with DSA, see 4d78e19365 ([comment](https://github.com/pytorch/pytorch/pull/151753#issuecomment-2820078298))
2025-04-22 05:03:03 +00:00
95abc0f515 [c10d][fr] Fix another bug when we should continue when the op list is empty (#151798)
Differential Revision: D73375318

We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798
Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin
2025-04-22 04:43:31 +00:00
6f327128a9 [MKLDNN] Check that strides are positive (#151848)
For pooling ops. Prevents division-by-zero when argument is wrong

Fixes https://github.com/pytorch/pytorch/issues/149274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151848
Approved by: https://github.com/atalman
2025-04-22 04:25:47 +00:00
29811f68d2 [Inductor][FlexAttention] fix vars_and_sizes divisor error (#151634)
Triton codegen currently [sorts vars by divisor](ae6f6b8efb/torch/_inductor/codegen/simd.py (L233-L237)). When there are two vars with the same divisor, the order is undecided.

```python
nodes.sort(
   key=lambda x: V.graph.sizevars.size_hint(
       x.divisor, fallback=config.unbacked_symint_fallback
   )
)
```

The test case leads to the following nodes:
```
(Pdb) nodes[0]
IterationRangesEntry(x1, ((s37 + 127)//128), 2, (xindex//ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) nodes[1]
IterationRangesEntry(x0, 1, ((s37 + 127)//128), ModularIndexing(xindex, 1, ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) nodes[2]
IterationRangesEntry(x2, 2*(((s37 + 127)//128)), ((s12 + 127)//128), (xindex//(2*(((s37 + 127)//128)))), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].length, 2)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].length, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].length, 1)
True

(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].divisor, 2)
True
```

Since x1 and x0 both have divisor 1, the relative order is random across runs.
In some runs, we have order [x1, x0, x2] with divisors as [1,1,2] and lengths as [2,1,1]. After x1, we have [divisor = divisor * node.length](ae6f6b8efb/torch/_inductor/codegen/simd.py (L246)) = 1 * 2 = 2. Then, when processing x0, we have node.divisor=1, divisor=2, and [FloorDiv(node.divisor, divisor)](ae6f6b8efb/torch/_inductor/codegen/simd.py (L251)) = 0, which indicates an iteration length of 0 and leads errors later.

The fix is to sort by both divisor and length_is_one. So for two nodes with the same divisor, we process the node with length=1 first.

Fixes #149789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151634
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-04-22 04:24:56 +00:00
529f698ad4 [logging] Put "everything" WaitCounters in dynamo_timed (#151757)
Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic.

Test Plan:
Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime).

Overall compile time from various sources matches up:
* tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s
* ods: https://fburl.com/canvas/r4clhnb7. Right on.
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on.

And the runtime overhead:
* ods: https://fburl.com/canvas/nvgjb282
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh

If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot:
* tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s
* ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct.

The discrepancy is even worse if we focus on the runtime events:
* ods: https://fburl.com/canvas/a4o9f7ou
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151757
Approved by: https://github.com/ppanchalia
ghstack dependencies: #151749
2025-04-22 03:29:13 +00:00
edba20b853 [logging] Fix duration logging for dynamo_compile (#151749)
Summary: There are a few issues I'm solving:.
1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events.
2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling.
3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety.

Test Plan:
Ran internal models and compared dynamo_compile to pt2_compile_events:
`TORCHINDUCTOR_BENCHMARK_FUSION=0`
* tlparse: https://fburl.com/itciwnxc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7

`TORCHINDUCTOR_BENCHMARK_FUSION=1`
* tlparse: https://fburl.com/jgurcvkw
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151749
Approved by: https://github.com/Skylion007
2025-04-22 03:29:13 +00:00
b7a7741411 Non-deterministic alert in histc_cuda for floating types only (#151701)
The note about atomic add only applies for floating point. The
implementation is deterministic for integer data types.

fixes: #151610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-04-22 03:24:36 +00:00
14e3ffb1ff Deprecate host allocator legacy APIs (#151437)
# Motivation
This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as:
```cpp
at::getHostAllocator(device_type)->allocate(...);
at::getHostAllocator(device_type)->empty_cache();
at::getHostAllocator(device_type)->record_event(...);
at::getHostAllocator(device_type)->get_stats();
at::getHostAllocator(device_type)->reset_accumulated_stats();
at::getHostAllocator(device_type)->reset_peak_stats();
```

# Additional Context
TODO:
- [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator`
- [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151437
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #151403, #151431
2025-04-22 03:13:24 +00:00
a4fdae5c84 Lift guard checking logic to AOTAutogradCache (#151563)
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.

# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.

# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.

After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.

## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
2025-04-22 03:01:08 +00:00
40cf49d460 Revert "[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114)"
This reverts commit 08831f30bbe745cd9f0c07d1868583a68f613514.

Reverted https://github.com/pytorch/pytorch/pull/149114 on behalf of https://github.com/guangyey due to CI is broken ([comment](https://github.com/pytorch/pytorch/pull/149114#issuecomment-2819890341))
2025-04-22 02:22:42 +00:00
a02eae8142 [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-22 01:14:15 +00:00
80a3877b3d [easy] Fix test_dynamo_timed (#151816)
Summary: The structured logging counter is a global that might have been affected by earlier tests. Clear it explicitly.
Fixes #148093

Test Plan: `pytest test/dynamo/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151816
Approved by: https://github.com/ppanchalia
2025-04-22 00:12:31 +00:00
b3b1616560 Add explict type info in the try-catch for dynamo logging (#151733)
Differential Revision: D73295871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151733
Approved by: https://github.com/hl475
2025-04-21 23:29:10 +00:00
a35e73b91f [c10] add #pragma once to leftright (#151710)
Summary: i am getting duplicate defn's when including in my binary that already includes the dispatcher.

Test Plan: CI

Differential Revision: D73237748

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151710
Approved by: https://github.com/georgiaphillips
2025-04-21 23:18:49 +00:00
99aeee2c5f [Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`

These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.

While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.

These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.

Test Plan:
contbuild & OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
2025-04-21 23:18:33 +00:00
4d78e19365 reroute index to fast implementation for indexing on 0th dimension (#151753)
Per title, improve x[index] cuda perf for the common case of indexing along the first dim, using vectorized gather kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151753
Approved by: https://github.com/eqy
2025-04-21 23:15:30 +00:00
01f1cc44cb Rename register_fake_profile to unsafe_generate_fake_kernels (#151797)
Fixes https://docs.google.com/document/d/1BZsuUR1zJ-52Y7wP4yWX8beB4dwYbgdu5o1qKam_iWg/edit?disco=AAABiJdX1XU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151797
Approved by: https://github.com/zou3519
2025-04-21 23:08:15 +00:00
efdcc981d0 Back out "Do not propagate real tensor in extern kernel" (#151813)
Summary:
D73002775 breaks aot_compile for many draft exported models on PT2I dashboard. Revert.

Example error msg:

```
OrderedSet([]) >= OrderedSet([u1185, u1186, u1187]) (inductor >= fx)
fx node is: %embedding_bag_byte_prepack : [num_users=4] = call_function[target=torch.ops.quantized.embedding_bag_byte_prepack.default](args = (%view_10,), kwargs = {})
new operations are:
```

Differential Revision: D73381032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151813
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-04-21 22:54:03 +00:00
79a9447f0e FlexAttention add decorator for large test cases (#151459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151459
Approved by: https://github.com/Skylion007
2025-04-21 22:53:13 +00:00
6ea2e6a2d2 Do not do proper const fold during tensorify_python_scalars (#151494)
Chatting with Bob the goal of this is to const fold the floats that where tensorified by calling
guard_scalar(val) on them and then replacing their usages by their values.
Hence we do not need to do this for nodes with no float symbols.

We do not want todo proper const folding because we need to preserve statements that deferred
runtime asserts depend on. (see the added test)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151494
Approved by: https://github.com/bobrenjc93
2025-04-21 22:39:50 +00:00
cd1317f92f [export] suggest dynamic re-export in input constraints hook (#151624)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151624
Approved by: https://github.com/angelayi
2025-04-21 22:29:46 +00:00
c312d8c501 [Dynamo] Clean up old torch function flag (#149711)
This is tracked via `SymbolicTorchFunctionState` now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149711
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-04-21 21:33:58 +00:00
25a11850e9 [symmem] Add some code comments to rendezvous code (#151716)
While reading and learning the rendezvous code, I just want to add some comments to explain the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151716
Approved by: https://github.com/kwen2501
2025-04-21 20:45:39 +00:00
352019bf9e [BE]: Better cleanup optimized code from #151474 (#151794)
This change addresses the first/second time/mem "spike" observed  Improves on #151474 by removing unnecessary stride calculations and unused arguments to the helper function

https://github.com/pytorch/pytorch/issues/151351

Fixes https://github.com/pytorch/pytorch/issues/151351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151794
Approved by: https://github.com/albanD, https://github.com/eqy
2025-04-21 20:32:11 +00:00
1f0d764b65 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-21 20:14:34 +00:00
02cecd1018 [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-21 20:14:34 +00:00
191b0237a6 Added to docs for out_dtype arg in torch gemms (#151704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151704
Approved by: https://github.com/bdhirsh
2025-04-21 20:09:17 +00:00
1a6effc5d8 [torch] Expose PCI info from CUDA device (#151672)
Summary:
PR#125083 add cuda device UUID info, but due to meta internal [version of ROCM the code was excluded](https://github.com/pytorch/pytorch/pull/125083?fbclid=IwY2xjawJvLnNleHRuA2FlbQIxMQABHlY55crrkTqWBWTsr2HVfuqnZ3R1GHR3o9Kf1o3h3uvyawEmCEdhdT48iY1P_aem_8tfrGrWE9SxFYasGfH8kCQ#issuecomment-2103315320).

This change will ensure meta internal code is built and PCI info is available

Test Plan: pass CI

Differential Revision: D73253426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151672
Approved by: https://github.com/Skylion007
2025-04-21 19:55:19 +00:00
2fb1326483 Add dates to pages (#151602)
re: #150873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151602
Approved by: https://github.com/albanD
2025-04-21 19:53:55 +00:00
b7c7000728 Ensure runners have the required prefix (#151815)
Clone changes from https://github.com/pytorch/pytorch/pull/151696/ since that PR wouldn't merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151815
Approved by: https://github.com/seemethere
2025-04-21 19:09:17 +00:00
9680016bcf [MergeBot] Update PullRequestResolved Regex (#151814)
By copying an updated one from cff091f3f3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151814
Approved by: https://github.com/izaitsevfb, https://github.com/albanD
2025-04-21 19:02:05 +00:00
d79144da52 [BE] Move aarch64 docker build to larger node (#151808)
They happen once a week or so, not sure why it needs to be on the slowest machine possible

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151808
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2025-04-21 18:54:31 +00:00
fd04c79878 Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)"
This reverts commit 8e373592c8be3e28a5f5a774fc1d517aa3dbe8b4.

Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))
2025-04-21 18:49:23 +00:00
f37e138bc4 [MPS] Enable log1p and sigmoid for int64 (#151791)
It works on MacOS-15, but likely will need a skip for MacOS-13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791
Approved by: https://github.com/Skylion007
ghstack dependencies: #151790
2025-04-21 18:30:04 +00:00
e2b1c06319 [cutlass] Define GELU_taylor<float> only if CUTLASS version is <= 380 (#151702)
Summary:
#buildmore

df8a550d39/include/cutlass/epilogue/thread/activation.h (L610)
was added in v3.9 (not tagged yet)

Test Plan:
mostly ci.

Logic seems same.

Reviewed By: drisspg

Differential Revision: D72615240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151702
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-21 18:23:46 +00:00
0f8613bf5c Introduce unsafe way to mark functions as cacheable (#151603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151768, #151609
2025-04-21 17:37:38 +00:00
67c2869a38 Unpack the output code in the standalone_compile (#151609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151609
Approved by: https://github.com/zou3519
ghstack dependencies: #151768
2025-04-21 17:37:38 +00:00
287998b87f Run standalone compile tests on cpu/gpu (#151768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151768
Approved by: https://github.com/zou3519
2025-04-21 17:37:29 +00:00
cea43f721a [Testing] Unskip expm1 log1p for MPS (#151790)
But don't test them for unsupported dtypes (which is float64 for MPS)
- Skip int64 for log1p for now (next PR will fix that)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151790
Approved by: https://github.com/Skylion007
2025-04-21 17:18:47 +00:00
9374064483 Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit 783be8f93248ca3af24b968bdf84188f5a3257d1.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/malfet due to suspected of breaking linux builds and breaks internal tests as well ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2819041756))
2025-04-21 17:11:53 +00:00
33808f0ebd Revert "[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)"
This reverts commit 8e5fefedf4af3f31ccd05290c1b21eedf6a4ad1b.

Reverted https://github.com/pytorch/pytorch/pull/151226 on behalf of https://github.com/malfet due to Reverting to unblock revert of https://github.com/pytorch/pytorch/pull/151404 ([comment](https://github.com/pytorch/pytorch/pull/151226#issuecomment-2819030735))
2025-04-21 17:07:49 +00:00
515a0f606b [ez] fix typo in comment (#151755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151755
Approved by: https://github.com/Skylion007
2025-04-21 14:52:39 +00:00
2eacdb91c3 Add OIDC permissions to xpu workflow (#151455)
The reusable workflow requires OIDC authentication to work and is configured via it's only caller xpu.yml however setting it here too to clarify that it is required. This setting also flags jobs that call this workflow without the required permissions set to remind them it need to be set.

JWT ID token requires `id-token: write` permissions as documented here https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings

Ref: pytorch-fdn/multicloud-ci-infra#3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151455
Approved by: https://github.com/chuanqi129, https://github.com/atalman
2025-04-21 14:39:40 +00:00
bf28d1cafc Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817)
When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's  `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`.

Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817
Approved by: https://github.com/Skylion007
2025-04-21 08:55:27 +00:00
2a9afdae81 [Benchmarking] Add sam and stable_diffusion to MPS benchmarked models (#151748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151748
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #151747
2025-04-21 05:51:46 +00:00
f7ddc5125e [Easy] Fix the compilation warning of BlasKernel. (#151736)
As the title stated.

Change Before:
```C++
[2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  346 | void gemv_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  329 | bool gemv_use_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  301 | void gemv_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  273 | bool gemv_use_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151736
Approved by: https://github.com/shink, https://github.com/Skylion007
2025-04-21 03:31:46 +00:00
8eb21dffa9 consolidate ATen/test/dispatch_key_set_test.cpp with rest of DispatchKeySet tests (#151697)
Doesn't seem to be a reason to have two test files for this.

Differential Revision: [D73274020](https://our.internmc.facebook.com/intern/diff/D73274020/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151697
Approved by: https://github.com/Skylion007
ghstack dependencies: #151626, #151627, #151628, #151629, #151630
2025-04-21 02:58:12 +00:00
9c2ac2b876 [pytorch][triton] Enable warp spec for FlexAttention kernel (#150470)
Summary:
Given inductor support for warp-specialization for `TritonTemplateKernel`, this change adds:
- num_consumer_groups
- num_buffers_warp_spec

to the flexattention template generated by inductor in `torch.compile`.

NOTE: Currently default config doesn't enable warp-spec and needs explicit args for num_consumer_groups, num_buffers_warp_spec in the kernel options to enable.

Test Plan:
### Functional Testing
```Py
import torch
from torch.nn.attention.flex_attention import flex_attention
from triton.testing import do_bench
make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()
flex_compiled = torch.compile(flex_attention, fullgraph=True)
print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4, "num_consumer_groups": 2,
                "num_buffers_warp_spec": 3,})))
```
- (best config) without WS: 11.06
- with WS: 9.35

Differential Revision: D70501880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150470
Approved by: https://github.com/drisspg
2025-04-21 02:00:55 +00:00
fc2dd6d408 [Inductor] Update should_decompose_mm condition for CPU (#151730)
Summary:
Similar to what we did previously in D70033166

Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and statically_known_true(mat1.shape[0] == 1)
        and statically_known_true(mat2.shape[0] <= 64)
        and statically_known_true(mat2.shape[1] <= 512)
```
We have a new case where `mat1.shape[0] = 80`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and statically_known_true(mat1.shape[0] == 1)
        and statically_known_true(mat2.shape[0] <= 128)
        and statically_known_true(mat2.shape[1] <= 512)
```

Differential Revision: D73292985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151730
Approved by: https://github.com/kflu, https://github.com/houseroad
2025-04-21 01:56:47 +00:00
470132c6a1 [MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-20 17:44:40 +00:00
c3a7278278 Use more efficient row/col computation (#151474)
This change addresses the first/second time/mem "spike" observed in

https://github.com/pytorch/pytorch/issues/151351

Fixes #151351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151474
Approved by: https://github.com/eqy, https://github.com/amjames, https://github.com/Skylion007
2025-04-20 16:02:19 +00:00
6b45b6e6c9 run lintrunner for Export d68846308 (#151725)
fixes broken lint tests in https://github.com/pytorch/pytorch/pull/151481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151725
Approved by: https://github.com/exclamaforte, https://github.com/Skylion007

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-20 14:58:17 +00:00
a40e876b08 Support fp8 dtypes in assert_close (#150002)
Fixes #135998

Adds support for fp8. These are compared bitwise, without atol and rtol. The implementation uses the same comparison functions, just with atol and rtol forced to zero. The error message is different from the default case; it only tells the user the first mismatch. This is to avoid triggering the error from #135998.

Test Plan:
New unit test covers new code paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150002
Approved by: https://github.com/cyyever, https://github.com/zou3519
2025-04-20 01:24:21 +00:00
48761e9737 Revert "[Easy] Fix the function signature of torch.Event (#151221)"
This reverts commit 92baeecbdd3fb717880485e529df4efb02627c9d.

Reverted https://github.com/pytorch/pytorch/pull/151221 on behalf of https://github.com/malfet due to This broke rocm tests, see 92baeecbdd (40818271233-box) ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))
2025-04-19 22:06:24 +00:00
c4482565cc Revert "[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)"
This reverts commit 1e1d0a4be63b354f762ee21bdccec03c1e5b371c.

Reverted https://github.com/pytorch/pytorch/pull/151411 on behalf of https://github.com/malfet due to This broke rocm tests, see 92baeecbdd (40818271233-box) ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))
2025-04-19 22:06:24 +00:00
9b74ea2490 [Benchmarking] Run MPS benchmarks for [b]float16 (#151747)
And implicitly pass `--float32` when collecting results for "notset" option. Speedups for some models are much higher for float16 dtype, but it's important to track accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151747
Approved by: https://github.com/Skylion007
2025-04-19 16:40:08 +00:00
ed511cd537 [Testing] Make test_add_complex3 run on different devices (#151732)
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.

Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
2025-04-19 14:29:13 +00:00
483e61bfec [BE][Easy]: Simplify reversed call in graph matcher (#151674)
Another list call on reversed that is no longer necessary since ItemViews reversed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151674
Approved by: https://github.com/albanD
2025-04-19 14:14:31 +00:00
68f748a992 Revert "[Testing] Make test_add_complex3 run on different devices (#151732)"
This reverts commit 414ce713fb329b20f93002fa4ffd6bb23bc3b93b.

Reverted https://github.com/pytorch/pytorch/pull/151732 on behalf of https://github.com/malfet due to It breaks MacOS-13 ([comment](https://github.com/pytorch/pytorch/pull/151732#issuecomment-2816690571))
2025-04-19 12:35:41 +00:00
1e1d0a4be6 [Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)
**Changes:**
- add detailed function or class signature
- fix the wrong display of torch.Event.wait and torch.Event.record
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411
Approved by: https://github.com/albanD
ghstack dependencies: #151226, #151221
2025-04-19 12:21:02 +00:00
92baeecbdd [Easy] Fix the function signature of torch.Event (#151221)
As the title stated.

The difference between declaration and implemention.
declaration:
d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)

Implementation:
d5a19e4525/torch/csrc/Event.cpp (L30-L32)

**Question**: Which one should we choose?
- Change enable_timing to False to be consistent with torch.cuda.Event
- Change enable_timing to True to avoid BC-break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221
Approved by: https://github.com/albanD
ghstack dependencies: #151226
2025-04-19 11:56:37 +00:00
8e5fefedf4 [Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.

The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.

Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD
2025-04-19 10:42:00 +00:00
92d0c40c49 Revert "Cache the value of torch_key in subproc (#151057)"
This reverts commit 5f5805a6ac44179520291b2aa6e18d286dc93669.

Reverted https://github.com/pytorch/pytorch/pull/151057 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151057#issuecomment-2816614510))
2025-04-19 08:48:12 +00:00
f6c1cf04b5 [ROCm][TunableOp] Support submatrices in offline tuning (#151138)
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM

New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.

There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.

To improve code readability, swapped definition of transA and transB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
2025-04-19 04:14:27 +00:00
2673ea4131 Add api to enable/disable NaN detector per-PG (#151723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151723
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-04-19 03:55:25 +00:00
414ce713fb [Testing] Make test_add_complex3 run on different devices (#151732)
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.

Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
2025-04-19 03:14:46 +00:00
6261db7719 Revert "inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)"
This reverts commit cfc4d74b0c9a0d21debbebb41e1dfa4dd2acf2a0.

Reverted https://github.com/pytorch/pytorch/pull/151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](https://github.com/pytorch/pytorch/pull/151481#issuecomment-2816490764))
2025-04-19 03:12:56 +00:00
843e4d11ba [Benchmarking] Enable HF_GPT2 benchmarking on Metal (#151721)
By building wheel with USE_DISTRIBUTED=1

Otherwise attempt to run
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps
```
wil fail with
```
  File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module>
    import torch.distributed.tensor
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module>
    import torch.distributed.tensor._ops  # force import all built-in dtensor ops
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module>
    from ._conv_ops import *  # noqa: F403
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module>
    from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module>
    from torch.distributed.tensor.placement_types import (
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module>
    import torch.distributed._functional_collectives as funcol
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module>
    import torch.distributed.distributed_c10d as c10d
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module>
    from torch._C._distributed_c10d import (
ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721
Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn
2025-04-19 02:57:03 +00:00
cfc4d74b0c inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)
Summary:

This config is not supported (it throws an error when set), and doesn't really make sense imo.

Approved by: https://github.com/eellison

Test Plan: contbuild & OSS CI, see edf266e9bb

Reviewed By: masnesral

Differential Revision: D68846308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151481
Approved by: https://github.com/masnesral
2025-04-19 01:13:35 +00:00
adf5f38eae Don't specialize min/max (#151347)
address https://github.com/pytorch/pytorch/issues/149635
Differential Revision: [D73041489](https://our.internmc.facebook.com/intern/diff/D73041489/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151347
Approved by: https://github.com/bobrenjc93
2025-04-19 00:11:15 +00:00
359e1d517c [Profiler] Remove Decref From Python Context (#151625)
Summary: When doing on-demand profiler with stack, the decref causes a segfault. I tried checking the refcount and the object itself and they both look fine but still segfaults every time. Lets remove it for now and revisit.

This will induce a small memory leak but it should be small enough that it does not create any significant impact on jobs ran.

Test Plan:
Removed decref and got clean traces
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1744933624/localhost/libkineto_activities_2936811.json.gz&bucket=gpu_traces

Differential Revision: D73225468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151625
Approved by: https://github.com/davidberard98
2025-04-18 23:55:19 +00:00
e48189cf03 Don't eagerly create AliasInfo in parseAliasDeclaration (#151630)
No need to create an AliasInfo...unless we need it.

Differential Revision: [D73129452](https://our.internmc.facebook.com/intern/diff/D73129452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151630
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626, #151627, #151628, #151629
2025-04-18 22:51:37 +00:00
cac8d35503 Use fmt::format for debug strings in Library init (#151629)
Observed several ms taken during `import torch` by c10::str here.

Differential Revision: [D73129453](https://our.internmc.facebook.com/intern/diff/D73129453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151629
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #151626, #151627, #151628
2025-04-18 22:51:37 +00:00
313ceb4da3 Reserve vector in StringCordView ctor (#151628)
Clear missing reserve (we should expect that pieces are not empty).

Differential Revision: [D73129445](https://our.internmc.facebook.com/intern/diff/D73129445/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151628
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626, #151627
2025-04-18 22:51:29 +00:00
704a504e8a Reserve vectors in FunctionSchema::cloneWithRealTypes (#151627)
1) reserving is much better than not reserving
2) std::transform for a 1-line-body loop is generally not considered to be an improvement (and doesn't get seem to get boiled away by clang under -Oz)

Differential Revision: [D73013363](https://our.internmc.facebook.com/intern/diff/D73013363/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151627
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626
2025-04-18 22:51:23 +00:00
fc7d493908 Overload Library::def rather than templating it (#151626)
It ends up being templated over a bunch of reference-to-array-of-characters types with different lengths, such as `char const (&) [88]`, which is an annoyance when profiling and possibly a source of code bloat.

Differential Revision: [D73129450](https://our.internmc.facebook.com/intern/diff/D73129450/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151626
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-18 22:51:16 +00:00
97d97aef24 Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)"
This reverts commit 1dd2033c0a1de460ee2bad8d64c36a0344886071.

Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/clee2000 due to maybe caused export test to fail? export/test_draft_export.py::TestDraftExport::test_masked_linear [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538768138/job/40794985504) [HUD commit link](1dd2033c0a), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2816232086))
2025-04-18 21:38:47 +00:00
bd77c3e054 [easy] Update test/dynamo/test_structured_trace.py (#151606)
Summary: test/dynamo/test_structured_trace.py is out of date because of some new fields. (I guess the test is disabled?). Bring it up to date.

Test Plan: `python test/dynamo/test_structured_trace.py`

Fixes #149671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151606
Approved by: https://github.com/Skylion007
ghstack dependencies: #151599
2025-04-18 21:33:13 +00:00
56d318bfac [ONNX][Eazy] Update onnx program doc formatting and improve robustness (#151623)
- Update docstring list formatting
- Use a try finally block to keep the model unmodified if save() fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151623
Approved by: https://github.com/titaiwangms
2025-04-18 21:31:31 +00:00
02dd096e51 [invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633
Approved by: https://github.com/zou3519
ghstack dependencies: #151330, #151256, #151357, #151477
2025-04-18 21:23:21 +00:00
b74be52454 [CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583)
Fixes [#147220]

Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit."
On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions).
As described in #147220, the reason it can find pytorch/third_party/nvtx is because it used
6f035d8462/cmake/public/cuda.cmake (L176)
note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](6f035d8462/cmake/public/cuda.cmake (L176))

Before this PR:
PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision *can* fail, and would by default fail because the following are happening:
1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default)
2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal)
3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source!
4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work

After this PR:
PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`"
For torchvision, it still invoke  [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet
2025-04-18 21:18:09 +00:00
6e7b6e8d57 [c10d][fr] Fix a bug when first rank is not zero in the script (#151683)
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.

Test Plan: Directly test with corner case.

Differential Revision: D73266257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
2025-04-18 20:55:06 +00:00
a6e46faff4 Use reusable binary docker build action for manywheel (#151489)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Similar to https://github.com/pytorch/pytorch/pull/151483 but for manywheel

Changed the job name

s390x doesn't have access to aws ecr so it doesn't use the action.  manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151489
Approved by: https://github.com/seemethere
2025-04-18 20:38:33 +00:00
b0f26e81a5 Use reusable binary docker build action for libtorch (#151488)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Similar to https://github.com/pytorch/pytorch/pull/151483 but for libtorch

Changed the job name

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151488
Approved by: https://github.com/atalman
2025-04-18 20:37:38 +00:00
88b0553c58 [AMD] Remove fbcode limit for uuid (#151652)
Summary: We're now w/ later rocm version so ok to add uuid back.

Test Plan: sandcastle

Differential Revision: D73240086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad
2025-04-18 20:37:09 +00:00
7ffa9000ed Replace perf-nightly-macos with inductor-perf-nightly-macos (#151698)
The name was updated by https://github.com/pytorch/pytorch/pull/151155.  The benchmark results weren't updated on the dashboard otherwise.

For PT2 compiler perf benchmark, we are still relying on this old workflow.  To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev)

The results are there on the new database:

```
SELECT
    *
FROM
    oss_ci_benchmark_v3
WHERE
    workflow_id = 14510035576
```

but not on the old database:

```
SELECT
    *
FROM
    inductor_torch_dynamo_perf_stats
WHERE
    workflow_id = 14510035576
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151698
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-04-18 20:31:36 +00:00
1b267a58a1 Revert "[export] allow partially specifying keys for dynamic shapes dict spec (#151597)"
This reverts commit c8240e3492e4813e822d7265eb3afb7f1168db39.

Reverted https://github.com/pytorch/pytorch/pull/151597 on behalf of https://github.com/clee2000 due to broke some export test export/test_converter.py::TestConverter::test_aten_len [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538615968/job/40792673415) [HUD commit link](c8240e3492), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151597#issuecomment-2816127271))
2025-04-18 20:17:44 +00:00
f20a266512 [easy] Update test/dynamo/test_utils.py (#151599)
Summary: test/dynamo/test_utils.py is out of date because of some new dynamo_timed fields. (I guess the test is disabled?). Bring it up to date

Test Plan: `python test/dynamo/test_utils.py`

Fixes #148093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151599
Approved by: https://github.com/Skylion007
2025-04-18 18:49:24 +00:00
e434a9152e Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)"
This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a.

Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))
2025-04-18 18:40:17 +00:00
cccfc146fe [BE][Easy]: Simplify ModuleList reversed method (#151673)
Removes unnecessary list calls now that we are in Python 3.9 and KeyViews implement reversed directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151673
Approved by: https://github.com/albanD
2025-04-18 18:39:32 +00:00
b7807759de Revert "stage 2 of depreate silent fallback of tuning gemm (#148622)"
This reverts commit 181b3883e71b9771e8a3cdaf43d627f68e9f0fa6.

Reverted https://github.com/pytorch/pytorch/pull/148622 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/148622#issuecomment-2815995105))
2025-04-18 18:37:09 +00:00
b73606dcc5 Add jk for force_disable_caches (#151621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151621
Approved by: https://github.com/jamesjwu
2025-04-18 18:19:40 +00:00
9ccdeae7db Fix uint view copy (#151598)
Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598
Approved by: https://github.com/zou3519
ghstack dependencies: #151562
2025-04-18 18:13:39 +00:00
28974a1ec3 Revert "[Easy] Fix the compilation warning of BlasKernel. (#151302)"
This reverts commit 32c79da789af84312a0db2de19211a7c57196ba7.

Reverted https://github.com/pytorch/pytorch/pull/151302 on behalf of https://github.com/malfet due to Breaks builds without OpenMP, see https://github.com/pytorch/pytorch/issues/151680 ([comment](https://github.com/pytorch/pytorch/pull/151302#issuecomment-2815954855))
2025-04-18 18:10:45 +00:00
115a0c6413 add privateuse1 device type to pre forward hook of fsdp (#149487)
add privateuse1 device type to pre forward hook of fsdp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149487
Approved by: https://github.com/FFFrog, https://github.com/cyyever, https://github.com/shink, https://github.com/albanD
2025-04-18 17:50:23 +00:00
1a48382a4c [Easy] Optimize container.py typing (#151653)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151653
Approved by: https://github.com/albanD
2025-04-18 17:33:43 +00:00
931bd05560 Do not propagate real tensor in extern kernel (#151377)
Summary: See internal Diff for more details.

In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex

buck2 run mode/dev-nosan  fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op
```

Differential Revision: D73002775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377
Approved by: https://github.com/angelayi
2025-04-18 17:28:13 +00:00
181b3883e7 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-18 17:26:16 +00:00
6246c7d62c [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-18 17:26:16 +00:00
1dd2033c0a [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-18 17:05:11 +00:00
c8240e3492 [export] allow partially specifying keys for dynamic shapes dict spec (#151597)
Fixes #148564

Should help with exporting HF-style models, so users don't have to specify 100 Nones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151597
Approved by: https://github.com/angelayi
2025-04-18 16:53:01 +00:00
9eaaca2ece Turn off symm_mem when cuda version is <12.3 (#151203)
Summary: It looks symmetric memory only supports cuda12.3+. We do have the definition w/ 12.3- but we don't have implementation. So maybe a good idea to even disable the definition.

Test Plan: CI

Reviewed By: jianyuh, houseroad, ngimel, jiawenliu64

Differential Revision: D72936993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151203
Approved by: https://github.com/ngimel, https://github.com/houseroad
2025-04-18 16:37:12 +00:00
783be8f932 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-18 15:26:13 +00:00
29317f8585 [standalone_compile] Some misc fixes (#151502)
This PR fixes two things.

The first problem is that in the vLLM style standalone_compile is
called from within a custom torch.compile backend. If there already is a
FakeTensorMode (which there is), we shouldn't create a new
FakeTensorMode with the same shape_env, instead we should just reuse the
same FakeTensorMode.

The second thing is that compile_fx can mutate the passed in gm, so we
deepcopy (since standalone_compile should be standalone)

Test Plan:
- new test
- updated old tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502
Approved by: https://github.com/oulgen
ghstack dependencies: #151501, #151551
2025-04-18 12:34:13 +00:00
58310a0043 [standalone_compile] support multiple returns (#151551)
We were only returning the first one. There's an edge case on what to do
if the original function returns a single Tensor. capture(f) returns a
function that returns a tuple of one Tensor in this case and we were
originally converting this back to one single Tensor. I think it's fine
to return a tuple of one Tensor (that is what the graph passed to
standalone_compile asked for!) but we can revisit.
fine

Test Plan:
- modified one test to used multiple outputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551
Approved by: https://github.com/Skylion007, https://github.com/oulgen
ghstack dependencies: #151501
2025-04-18 12:34:13 +00:00
ac715e96b4 [standalone_compile] Don't check if path is directory if it doesn't exist (#151501)
os.path.isdir(path) will return False if the path doesn't exist.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151501
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2025-04-18 12:34:13 +00:00
14293c2377 [MPS] Allow isin for mixed types (#151600)
To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed

- Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes
```
>>> import torch
>>> x=torch.arange(4.0, device='mps')
>>> y=torch.arange(1.0, 3.0, device='mps')
>>> x, y, torch.isin(x, y), torch.isin(y, x)
(tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False,  True, False, False], device='mps:0'), tensor([False, False], device='mps:0'))
>>> torch.__version__
'2.6.0'
```
- Cleanup code a bit

Fixes https://github.com/pytorch/pytorch/issues/151443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth
2025-04-18 12:30:32 +00:00
675f69f40f collect_env: gracefully handle no pip (#151607)
If pip is not installed:

### Before

```console
> python3 torch/utils/collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 694, in <module>
    main()
    ~~~~^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 677, in main
    output = get_pretty_env_info()
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 672, in get_pretty_env_info
    return pretty_str(get_env_info())
                      ~~~~~~~~~~~~^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 497, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 450, in get_pip_packages
    for line in out.splitlines()
                ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'splitlines'
```

### After

```console
> python3 torch/utils/collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 15.4 (arm64)
GCC version: Could not collect
Clang version: 20.1.0
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.13.2 (main, Apr  8 2025, 15:27:33) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform: macOS-15.4-arm64-arm-64bit-Mach-O
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] Could not collect
[conda] Could not collect
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151607
Approved by: https://github.com/malfet
2025-04-18 12:28:58 +00:00
776aa68221 Update torch-xpu-ops commit pin (#150827)
Update the torch-xpu-ops commit to [b51dd3ef4f4d0f6b44c59e61431c5d29354dcaf6](b51dd3ef4f), including:
- Update commit pin to xpu-ops main branch
- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
1. Cache `cclStream` to improve performance.
2. Add support for complex datatypes in `allgather` and `broadcast`.
3. Support `coalescing` operations and `batch_isend_irecv`.
4. Introduce additional logging; use `export TORCH_CPP_LOG_LEVEL=INFO`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150827
Approved by: https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-18 10:12:59 +00:00
0376bbf5b3 [XPU] skip a subprocess UT for Windows (#150999)
This case creates subprocess in a subprocess. In Windows it can't load function at this scenario hence I have to skip it
```
File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'run_model' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 25, in <module>
  File "<string>", line 16, in test_multi_process
AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150999
Approved by: https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-18 08:55:47 +00:00
541f8cd34c faster gather implementation (#151490)
So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions.

Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh.
Additional tests are coming to make sure this implementation doesn't break anything

`gather` is equivalent to x[indices] for 1d indices via
```
def fn_gather(x, indices):
    return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1]))

def fn_index(x, indices):
    return x[indices]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-18 07:48:31 +00:00
eb1f85a2a0 Support C++ statically_known_true (#151346)
Differential Revision: [D73040543](https://our.internmc.facebook.com/intern/diff/D73040543/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151346
Approved by: https://github.com/laithsakka
2025-04-18 06:42:12 +00:00
8895c290f4 [Easy] enable PYFMT for torch/quantization/eager (#150761)
All modifications are done through tools, the detailed commands are as follows:

```bash
lintrunner -a --take "PYFMT" --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150761
Approved by: https://github.com/jerryzh168
2025-04-18 05:53:33 +00:00
91b090c912 [executorch hash update] update the pinned executorch hash (#151632)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151632
Approved by: https://github.com/pytorchbot
2025-04-18 05:07:28 +00:00
6649ed9deb [ez] fix code owners typo (#151499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151499
Approved by: https://github.com/laithsakka
2025-04-18 04:24:16 +00:00
bedefa46a9 Document non-pytorch CUDA memory allocation and how to query it (#150880)
This PR documents the fact that PyTorch does not have visibility into how every CUDA memory allocation happend - it only knows about allocations that went through the pytorch CUDA allocator.

It also adds a code snippet showing how to use pynvml to query current GPU memory usage.

## Preview
Added a note at the top of "Understanding CUDA Memory Usage" doc:
<img width="732" alt="image" src="https://github.com/user-attachments/assets/69e28d2a-841a-4b1b-b886-e96fb5d76582" />

which links to a section below:
<img width="733" alt="image" src="https://github.com/user-attachments/assets/cab4f252-9ac2-4fc6-a45d-fdb958fc7dbc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150880
Approved by: https://github.com/kwen2501, https://github.com/ngimel
2025-04-18 03:48:54 +00:00
7d282da449 Add automatic categorization for release notes: inductor (aoti) (#151569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151569
Approved by: https://github.com/desertfire
ghstack dependencies: #151453
2025-04-18 03:39:06 +00:00
2426258789 [doc fix] fix torch export docs for preserve_module_call_signature (#151140)
The preserve_module_call_signature explanation is missing in the __init__.py. Copying that from _trace.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151140
Approved by: https://github.com/angelayi
2025-04-18 02:55:35 +00:00
33cfe30ee1 Add HostAllocator as the unified parent class (#151431)
# Motivation
This PR introduces a unified parent class `HostAllocator` with the following goals:
1. Enable backend-specific host allocator registration, including support for out-of-tree backends.
2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators.

The new interface includes:
- `at::getHostAllocator()->allocate`
- `at::getHostAllocator()->empty_cache`
- `at::getHostAllocator()->record_event`
- `at::getHostAllocator()->get_stats`
- `at::getHostAllocator()->reset_accumulated_stats`
- `at::getHostAllocator()->reset_peak_stats`

# Additional Context
We plan to deprecate legacy APIs such as `at::cuda::CachingHostAllocator_emptyCache` and recommend users migrate to the new backend-specific API, for example:
```cpp
at::getHostAllocator(at::kCUDA)->empty_cache();
```
This refactor will help standardize host memory management across devices and simplify backend integration in the future.
Another key improvement I am going to do is move the `is_pinned` functionality into the `HostAllocator` class, which enables centralized pinned memory verification through calls like `at::getHostAllocator(at::kCUDA)->is_pinned(ptr)`.
Benefits include:
 - Consistent host memory handling across all device backends
 - Decouple pinned memory functionality with `AcceleratorHooksInterface` in a more modular way
 - Clearer separation between device memory allocation and pinned host memory management

This architecture makes the system more maintainable and extensible for future device support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151431
Approved by: https://github.com/albanD
ghstack dependencies: #151403
2025-04-18 02:44:17 +00:00
1cc5a8452b [Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)
As the title stated.

Related PR: https://github.com/pytorch/pytorch/pull/147066

Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091
Approved by: https://github.com/albanD
ghstack dependencies: #151007
2025-04-18 02:40:07 +00:00
3528488061 [Openreg][PrivateUse1] Enable CI for openreg (#151007)
Changes:
- move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/
- update README.md for openreg
- enable CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007
Approved by: https://github.com/albanD
2025-04-18 02:40:07 +00:00
09e8ff92cc refresh benchmark results (#151622)
updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469
not all benchmarks were updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622
Approved by: https://github.com/oulgen
2025-04-18 02:39:13 +00:00
98c892749b c10d/Store: add nonblocking mode to queue_pop (#151485)
This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.

We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly.

Test plan:

```
pytest test/distributed/test_store.py -k queue -v -s -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485
Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank
2025-04-18 02:14:50 +00:00
3ed5f1fb77 [CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812)
Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators:
* mm
* bmm
* addmm
* baddmm

Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390

Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-18 01:53:26 +00:00
a6182903cd Update PyTorchStreamReader API to take cpu allocator override (#150439)
Summary: Add allocator param in getRecord

Test Plan:
newly added UT
```
buck test caffe2/caffe2/serialize:inline_container_test
```

Differential Revision: D72252585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439
Approved by: https://github.com/albanD
2025-04-18 01:53:14 +00:00
b434322075 Fix has_free_symbols (#151492)
used to fail for
        self.assertFalse(has_free_symbols(sympy.S.true))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151492
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151170, #151171
2025-04-18 01:19:01 +00:00
c2a202169d Fix implicit state dict modification (#151436)
Summary: Previously we were modyfing ep.state_dict while runnning decomp which it shouldn't

Test Plan: CI

Fixes: https://github.com/pytorch/pytorch/issues/151366

Differential Revision: D73102315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151436
Approved by: https://github.com/angelayi
2025-04-18 00:58:55 +00:00
34266836d5 [Inductor] Suppress cuda init error for CPU only Inductor (#151528)
**Summary**
After https://github.com/pytorch/pytorch/pull/151255, invoking `torch.compile` on a non-CUDA device prints the following error:
`E0416 23:39:55.953000 418833 torch/_inductor/codegen/cuda/cuda_env.py:22] Error getting cuda arch: Torch not compiled with CUDA enabled.`
This PR updates the code to initialize `PRESETS` only when CUDA is available, preventing this error message from being printed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151528
Approved by: https://github.com/jansel, https://github.com/henrylhtsang
2025-04-18 00:55:01 +00:00
9e235c549c [C10D] avoid computing global_rank when group_rank is used (#151373)
collective APIs accept either group or global rank for src/dst rank.

We provide a helper `_canonicalize_group_rank` which converts from maybe
group or maybe global to one particular format (defined by the kwarg
return_global: bool=False).

In this PR we stop performing the mapping lookup that converts group to
global or global to group in the case that the caller wants us to return
the same value that was passed in.  The PR should be functionally
equivalent, except in cases where the mapping itself would raise an
exception but the mapping was not necessary in the first place.

This has come up in cases where people create new process groups outside
of 'init_process_group' APIs and group-specific ranks may not have a
valid mapping to the 'global' rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373
Approved by: https://github.com/xunnanxu, https://github.com/d4l3k
2025-04-17 23:53:50 +00:00
d8bafd23ab [DDP] add one option to allow skipping all reduce unused parameters (#151503)
Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model.

Test Plan: unit tests, CI

Differential Revision: D72282069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503
Approved by: https://github.com/mrshenli
2025-04-17 23:30:19 +00:00
6d46b530fc Remove libdevice ops in inductor (#151562)
Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562
Approved by: https://github.com/isuruf, https://github.com/jansel
2025-04-17 22:18:00 +00:00
bdb34f55a0 [fake tensor cache] Support index with non bool/int8 indices (#151477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151330, #151256, #151357
2025-04-17 21:51:08 +00:00
0129c3a4e1 Use reusable binary docker build action for almalinux, clean up script (#151483)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Use the binary docker build action from https://github.com/pytorch/pytorch/pull/151471

Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes.

build script:
* change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag
* remove docker push parts since that will happen during the workflow
* clean up a bit
* make the build script more like the CI build script (use a temp image name)

I don't think this image is actually used anywhere

Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151483
Approved by: https://github.com/ZainRizvi
2025-04-17 21:32:56 +00:00
652fa451a4 [dynamo] support fb internal bytecode EAGER_IMPORT_NAME (#151362)
Differential Revision: [D73127097](https://our.internmc.facebook.com/intern/diff/D73127097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151362
Approved by: https://github.com/oulgen
2025-04-17 21:19:45 +00:00
d5dda82586 [export] Integrate meta kernel generation with draft-export (#150809)
If a custom operator does not contain a fake impl, currently draft-export will use the real-tensor propagation to get an output for the operator and continue tracing. However if we retrace the exported model using `ep.run_decompositions`, or `export`, or run the exported program with fake tensors, we'll still fail because there's no fake impl.

With this PR, after draft-export we will generate an operator profile for each operator call that we encounter, and store this on the report attached to the exported program `ep._report.op_profiles`. Users can then use `torch._library.fake_profile.register_fake_profile` to temporarily generate and register a fake impl based on these operator profiles. This way future fake tensor retracing will work.

The workflow would look something like:
```python
class M(torch.nn.Module):
    def forward(self, a, b):
        res = torch.ops.mylib.foo8(a, b)  # no fake impl
        return res

ep = export(M(), (torch.ones(3, 4), torch.ones(3, 4)) # this fails bc no fake impl
ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4))

ep.run_decompositions()  # this fails bc no fake impl
# this registers fake impls based on the profiles
with torch._library.fake_profile.register_fake_profile(ep._report.op_profiles):
    decomp = ep.run_decompositions()  # this works

new_inp = (
    torch.ones(2, 3, 4),
    torch.ones(2, 3, 4),
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150809
Approved by: https://github.com/zou3519
2025-04-17 20:52:31 +00:00
4f62dccbda [Cutlass] Implement Epilogue Argument emitter (#150903)
This implements epilogue visitor tree argument generation (example type [here](3fe62887d8/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp (L332))).

Details:
The codegen task here is to implement a function which can generate a tree of C++ structs and properly extract the correct properties from Inductor buffers and write them to the correct locations in the generated struct. To implement this with the minimum amount of code, I generate the cutlass DAGIR (the EVT internal represenation) which specifically has a pass, [pass_argument_type.py ](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L4)) which generates a nested tree of custom argument types for each node in the DAGIR. This nested tree of constructors is then passed kwargs to fill in the proper values, where the node's name is used to differentiate between different values in the kwarg dictionary. This however is non-customizable; the nested tree of EVT args is a nested tree of ctypes which looks for *actual values* so that this object can be passed directly to the cutlass-python C++ runner. Inductor on the other hand needs to fill this struct with string C++ expressions representing the values (or extracting the values from kernel launcher args). So `_render_argument_type` implements this: it iterates over the tree of types created by pass_argument_type.py and generates a string representing the nested structs, filling in C++ expressions representing the different fields.

Long term plan:
Long term, I will ask the nvidia to provide an overridable [visitor_factory](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L82)) which could allow us to override the behavior of pass_argument_type.py to generate the string we would like during DAGIR generation.

Previously merged:
* #150346
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150903
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-04-17 20:30:21 +00:00
8e0f9fbccf [c10] helpers for runtime c10::alias re-use (#151361)
Summary: we need these to check whether the input tensor was re-sized/strided between executions when choosing to alias

Test Plan: CI

Reviewed By: henryoier

Differential Revision: D73061676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151361
Approved by: https://github.com/SherlockNoMad
2025-04-17 20:27:17 +00:00
da580123a0 [BE][Easy]: Dedupe a TypeAlias in PrimsCommon (#151565)
Replaces a duplicate TypeAlias with a reference to the global constant for them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151565
Approved by: https://github.com/albanD
2025-04-17 19:59:41 +00:00
c4688af254 Fix lint
Introduced by fb6ac2f16132f7953711ce6924bc2ee4a033228c
2025-04-17 12:48:52 -07:00
473a38b562 [DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)
Summary: As titled.

Test Plan: CI

Differential Revision: D73040700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320
Approved by: https://github.com/saumishr
2025-04-17 12:48:39 -07:00
c5b10ff119 [BE][Easy]: Normalize Dim typing in torch distributed (#151566)
Improve typing using prims_common dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151566
Approved by: https://github.com/albanD
2025-04-17 19:30:09 +00:00
2ed2cb5805 add generalized pareto distribution (GPD) (#135968)
Add the GPD as a distribution class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135968
Approved by: https://github.com/albanD

Co-authored-by: Alexander März <statmixedmlgit@gmail.com>
2025-04-17 18:51:02 +00:00
7e2081fa93 Optimize interpolate saturate description (#151304)
Fixes #108225

## Test Result

### Before

![image](https://github.com/user-attachments/assets/bdbf8a5c-d5a4-44a5-b81e-2cbb5b8bfd02)

### After

![image](https://github.com/user-attachments/assets/1c21a27d-1700-4661-9988-dbb1cdc81fa2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151304
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-04-17 18:34:29 +00:00
055e59e709 [bazel] Build flatbuffers within bazel (#151364)
This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly.

This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364
Approved by: https://github.com/malfet
2025-04-17 18:33:51 +00:00
3a6b3c8e0e Combine windows x64 and arm64 yaml template files (#149850)
While introducing Windows-Arm64 nightly workflows, we created a separate template file for win-arm64. This PR combines x64&arm64 and deletes the win-arm64 one.
Fixes #148776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149850
Approved by: https://github.com/ozanMSFT, https://github.com/malfet
2025-04-17 17:58:55 +00:00
1ce7969e81 Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit 90c5b86cd8fcbbe6ee7c46ad17a05767f884794b.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](90c5b86cd8), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))
2025-04-17 17:45:41 +00:00
ae6f6b8efb [Inductor] Remove singleton tiling splits when prefer_nd_tiling=True (#151508)
# Issue
Users who want block pointers are like to use the config settings `{"trition.use_block_ptr": True, "triton.prefer_nd_tiling": True, "triton.max_tiles": 3}` . Among other things, these settings allow us to generate 3D block pointers for broadcasts. However, broadcasts which don't truly require 3D often end up introducing a superfluous tiling dimension of size 1.

For example, given this function with elementwise multiplication:
```
def foo(x, y, z):
            a = x * y
            b = 128.0
            c = a * b
            d = a * z
            e = x * z
            return a, c, d, e

inps = [
            torch.randn((8, 11, 128), device=self.device),
            torch.randn((128,), device=self.device),
            torch.randn((8, 11, 128), device=self.device),
]

torch.compile(foo)(*inps)
```

We get the following Triton kernels:
```
@triton.jit
def triton_poi_fused_mul_0(in_ptr0, in_ptr1, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 88
    ynumel = 1
    xnumel = 128
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
    xmask = xindex < xnumel
    x1 = xindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')[:, None, :]
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
    tmp2 = tmp0 * tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp2, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')

@triton.jit
def triton_poi_fused_mul_1(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr):
    xnumel = 11264
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp3 = tl.load(tl.make_block_ptr(in_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = 128.0
    tmp2 = tmp0 * tmp1
    tmp4 = tmp0 * tmp3
    tmp6 = tmp5 * tmp3
    tl.store(tl.make_block_ptr(out_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
    tl.store(tl.make_block_ptr(out_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp4, [XBLOCK]).to(tl.float32), boundary_check=[0])
    tl.store(tl.make_block_ptr(out_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp6, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Note that one kernel has `ynumel=1`. The extra dimension results in more expensive address calculations, and also seems to prevent fusion.

# Fix

To fix this, this PR filters out any splits of size 1 from the `prefer_nd_tiling` algorithm. This results in the following fused kernel, with 2D tiling:

```
@triton.jit
def triton_poi_fused_mul_0(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 88
    xnumel = 128
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[:, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, :]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
    tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp2 = tmp0 * tmp1
    tmp3 = 128.0
    tmp4 = tmp2 * tmp3
    tmp6 = tmp2 * tmp5
    tmp7 = tmp0 * tmp5
    tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp2, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr1, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp4, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp6, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr3, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp7, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```

# Test plan
Added the test case above to CI. Checked that a single kernel is generated with 2D tiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151508
Approved by: https://github.com/jansel
2025-04-17 17:37:45 +00:00
b4550541ea [ROCm] upgrade nightly wheels to rocm6.4 (#151355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151355
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-17 17:29:07 +00:00
ef64beb232 Include post grad gm and fx runnable in cache artifacts for tlparse (#151469)
Fixed #151462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469
Approved by: https://github.com/bdhirsh
2025-04-17 17:14:13 +00:00
ee3366dbb2 [MegaCache] Encode key in base64 (#151472)
I have noticed that there are some errors like
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 169302: invalid start byte
```

I havent been able to repro this locally yet, this change should fix the encoding issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151472
Approved by: https://github.com/masnesral
2025-04-17 17:12:22 +00:00
8404c09b15 [MegaCache] Rename the PGO artifact when used between different jobs (#151482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151482
Approved by: https://github.com/bobrenjc93, https://github.com/jamesjwu
2025-04-17 17:09:29 +00:00
fe90a5c140 [Easy] Optimize clip_grad param description (#151532)
Fix missing optional description in `clip_grad_norm_` and `clip_grad_value_`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3393dd4b-a730-4dd4-8304-9b895ac669d4)

![image](https://github.com/user-attachments/assets/220c4738-a728-474b-b06d-b5be7660d150)

### After

![image](https://github.com/user-attachments/assets/5637bb68-3b6d-49a3-8ee1-3af636950aa0)

![image](https://github.com/user-attachments/assets/c0f1d966-a9ba-4fac-a874-9d4955f6e0d6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151532
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-04-17 16:47:38 +00:00
c3a18f6126 [AOTInductor] Add states for constant folding process (#151273)
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.

Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-04-17 16:41:38 +00:00
4843ce7611 [BE] Remove outdated script to check namespace BC (#151453)
Now that we have bc_lint in CI, this script is no longer needed (nor has it ever been conclusive). I've already updated the Runbook to not need this script.

Suppressing bc_lint as this script is not shipped as a part of torch--it is not user facing! For context, this script is (rarely) used by the release notes manager to ensure BC across releases. It had been broken for at least since 2.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151453
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-04-17 15:43:53 +00:00
90c5b86cd8 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-17 15:30:12 +00:00
7f528751cc [Inductor] fix torch._inductor.exc.InductorError: KeyError (#151424)
Fixes #151423, which is a regression after #150845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151424
Approved by: https://github.com/eellison
2025-04-17 15:07:43 +00:00
bb11122e12 Update docker image names for s390x (#151426)
Disable switching tag for s390x docker images

Keep it that way unless they are published.
There's no way to determine in advance
which docker image names are needed
for building s390x binaries otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151426
Approved by: https://github.com/malfet, https://github.com/seemethere
2025-04-17 12:47:23 +00:00
fa6e842527 [MPS] Make fused rms_norm traceable (#150661)
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.

- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`,  which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future

TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
    self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
    dtr = [self.dim() - i - 1 for i in range(ndim)]
    return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```

Fixes https://github.com/pytorch/pytorch/issues/150629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
2025-04-17 11:32:00 +00:00
41b82611ee Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)"
This reverts commit 300e0ee13c08ef77e88f32204a2e0925c17ce216.

Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with  TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))
2025-04-17 11:09:01 +00:00
e4fe67f623 Revert "[MPS] Make fused rms_norm traceable (#150661)"
This reverts commit 682f09ec51526aefe6b504ac8081944baa866556.

Reverted https://github.com/pytorch/pytorch/pull/150661 on behalf of https://github.com/malfet due to Has decomp started to fail again ([comment](https://github.com/pytorch/pytorch/pull/150661#issuecomment-2812520408))
2025-04-17 11:06:05 +00:00
32c79da789 [Easy] Fix the compilation warning of BlasKernel. (#151302)
As the title stated.

Change Before:
```C++
[2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  346 | void gemv_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  329 | bool gemv_use_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  301 | void gemv_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  273 | bool gemv_use_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151302
Approved by: https://github.com/malfet, https://github.com/aditew01
ghstack dependencies: #151427
2025-04-17 10:50:22 +00:00
f29fe78cf2 [Dynamo] Implement sourceless named tuple support (#151266)
Fixes https://github.com/pytorch/pytorch/issues/140903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151266
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/anijain2305
2025-04-17 08:43:03 +00:00
49c91b4be9 [Easy][Building] Fix the warning of int4mm.cu when building (#151427)
As the title stated.

**Changes Before:**

```C++
[999/1526] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/int4mm.cu.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/cuda/int4mm.cu(142): warning #177-D: variable "at::native::kWarpSize" was declared but never referenced
  constexpr int32_t kWarpSize = 32;
                    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151427
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-17 08:21:32 +00:00
a05cc9f494 Remove Clear Cache Time from do_bench_using_profiling (#150696)
Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot.

Test Plan:
We can compare the benchmark results for
```
CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024
```
before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules.

Differential Revision: D72469845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696
Approved by: https://github.com/frank-wei
2025-04-17 07:25:41 +00:00
e0f05229e9 [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-17 06:43:10 +00:00
10a54ffe5a [inductor] Reduce runtime of CPU OpInfo tests (#151435)
`has_triton()` returns True if Triton is present on the system and supports _any_ backend we care about. In this case, that means we _always_ check gradients, even though the intended behavior is to skip gradients when testing on CPU.

Fixes a bug from #146911.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151435
Approved by: https://github.com/masnesral
2025-04-17 05:25:14 +00:00
b7d9f44602 [executorch hash update] update the pinned executorch hash (#151493)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151493
Approved by: https://github.com/pytorchbot
2025-04-17 05:14:12 +00:00
682f09ec51 [MPS] Make fused rms_norm traceable (#150661)
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.

- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`,  which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future

TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
    self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
    dtr = [self.dim() - i - 1 for i in range(ndim)]
    return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```

Fixes https://github.com/pytorch/pytorch/issues/150629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
2025-04-17 04:15:24 +00:00
17ea9d1478 Revert "[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)"
This reverts commit fb6ac2f16132f7953711ce6924bc2ee4a033228c.

Reverted https://github.com/pytorch/pytorch/pull/151320 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/151320#issuecomment-2811669325))
2025-04-17 03:57:03 +00:00
a94483329c [MPS] Start benchmarking compile results (#151155)
To know passrate and speedup
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151155
Approved by: https://github.com/dcci
2025-04-17 02:45:39 +00:00
f5851efed9 Fix torch.autograd.backward inputs validation (#150975)
- Fixes #150883
- Fixes #70504

This is my first PR to pytorch, so please tell me if I'm forgetting anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150975
Approved by: https://github.com/soulitzer
2025-04-17 02:11:13 +00:00
6f9ffaa991 [c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475)
Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases.

Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475
Approved by: https://github.com/fegin
2025-04-17 02:11:08 +00:00
cd1db55817 Fix tensor_constant name collision in aot_export_module (#151123)
Summary:
When we have an exported program that looks like this:

```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]",

            clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0);  c_lifted_tensor_0 = None

            index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone);  clone = None
```

The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the  `_tensor_constant0`  in the input module .

```
def forward(self):
        arg9_1: "f32[10, 2139]"

        _tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting

        index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]);  _tensor_constant0 = None
```

This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index.

We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`.

To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding

buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```

Differential Revision: D72761937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123
Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh
2025-04-17 01:52:21 +00:00
bf92c9883b Refine host caching allocator (#151403)
# Motivation
This stack of PRs aims to generalize and improve PyTorch host allocator code.

This PR introduces a `DeleterFnPtr` template parameter to `CachingHostAllocatorInterface` to resolve circular dependency issues. This change allows for better code reuse and simplifies the implementation of host allocators.

# Additional Context
TODO:
- [ ] Unify host allocator related API
- [ ] Deprecate those device-specific legacy API
- [ ] Move `is_pinned` to host allocator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151403
Approved by: https://github.com/gujinghui, https://github.com/albanD
2025-04-17 01:50:47 +00:00
fb6ac2f161 [DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)
Summary: As titled.

Test Plan: CI

Differential Revision: D73040700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320
Approved by: https://github.com/saumishr
2025-04-17 01:08:32 +00:00
300e0ee13c [Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)
Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire
2025-04-17 00:26:55 +00:00
2fd26925c4 improve noop elimination for view (#151095)
This PR improves noop elimination.

### View Noop

```python
>>> torch.Size([1,2,3]) == [1,2,3]
False
>>> torch.Size([1,2,3]) == (1,2,3)
True
```
So we add `tuple(size)` in `view_noop`.

Example:
```python
import torch

@torch.compile()
def f(x):
    batch_size = x.shape[0]
    x = x.transpose(1, 2) # (batch_size, 2, 3)
    x = x.reshape(batch_size, 2, 3) # noop
    return x

x = torch.randn((2,3,2))
f(x)

x = torch.randn((4,3,2))
f(x)
```

Before:
![image](https://github.com/user-attachments/assets/be488881-6c99-43a9-b088-fa481f675775)

After:
![image](https://github.com/user-attachments/assets/6d93be3d-128b-44d4-ad6a-d3d18e272329)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151095
Approved by: https://github.com/eellison
2025-04-16 23:55:32 +00:00
9a2624c712 Fix keepdim param optional description (#151197)
Fixes #151104

Fix optional description of `dim`  and `keepdim`, except `torch.quantile` which already fixed in #146485

## Test Result

### Before

![image](https://github.com/user-attachments/assets/69f1824d-3d15-407e-8c92-f25a22e16914)

### After

![image](https://github.com/user-attachments/assets/e5aac674-ab8f-4988-a5f1-7400c36bdc99)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151197
Approved by: https://github.com/mikaylagawarecki
2025-04-16 23:15:30 +00:00
9e6ad274dc Action for building docker binary builds (#151471)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR

based off of a582f04608/.github/workflows/build-manywheel-images.yml (L101)

Also moves the push of the docker images from inside the build scripts to inside the workflow

Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151471
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi
2025-04-16 23:01:35 +00:00
cd7bc60e11 Migrate to new theme (#149331)
- Migrate pytorch docs, cpp docs and functorch docs to the pytorch_sphinx_theme2
- Migrate index.rst to markdown and restructure to use high-level horizontal bar sections Python API, Developer Notes
- Added python-api.md which becomes the main container for the API docs. This file will be used to add all api references in the toctree. It would be great to have lint for this file: https://github.com/pytorch/pytorch/issues/150718
- Enabled mermaid sphinx extension and opengraph sphinx extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149331
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD
2025-04-16 21:35:19 +00:00
1ffaa00ad7 [MPS] Migrate bitwise_not to unary operator (#151460)
That kills to birds with one stone:
 - Makes implementations more standartized (and faster for strided inputs/outputs)
 - Fixes bug strided inplace bitwise_not

I.e. before this change
```python
import torch
x=torch.arange(32, device="mps")
x[::2].bitwise_not_()
print(x)
```
produced
```
tensor([ -1,  -2,  -3,  -4,  -5,  -6,  -7,  -8,  -9, -10, -11, -12, -13, -14,
        -15, -16,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31], device='mps:0')
```
after, it generates reasonable output
```
tensor([ -1,   1,  -3,   3,  -5,   5,  -7,   7,  -9,   9, -11,  11, -13,  13,
        -15,  15, -17,  17, -19,  19, -21,  21, -23,  23, -25,  25, -27,  27,
        -29,  29, -31,  31], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460
Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007
2025-04-16 21:34:45 +00:00
f252f9df5e Revert "[Openreg][PrivateUse1] Enable CI for openreg (#151007)"
This reverts commit abbca37fe882541e0259b43dd314a324180550ed.

Reverted https://github.com/pytorch/pytorch/pull/151007 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))
2025-04-16 21:05:17 +00:00
e0535e823f Revert "[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)"
This reverts commit e229ce34c4ab8cd4e2800227615be32fb362b1e6.

Reverted https://github.com/pytorch/pytorch/pull/151091 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))
2025-04-16 21:05:17 +00:00
5b5399bfcd [graph partition] reorder to reduce #partitions for simple dependencies (#150814)
This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes:
    a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass
    b. If the only user of a node is OutputNode: move it to the end.

#### Example

The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front.

```python
import torch

torch._inductor.config.graph_partition = True

# Two reasons for this:
# 1. We want to reuse the same mask for many masked_fill calls
# 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill)
#    so we can still reorder in scheduler
@torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,))
def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor:
    mask = torch.zeros((padded_size,), dtype=torch.bool, device=device)
    mask[original_size:] = True
    return mask

@create_mask.register_fake
def _(padded_size, original_size, device):
    return torch.empty((padded_size,), dtype=torch.bool, device=device)

def f(padded_tensor, original_tensor, weight):
    original_size = original_tensor.size()[0]
    padded_size = padded_tensor.size()[0]

    # element wise op so we don't care padding value
    padded_tensor = padded_tensor + 1
    padded_tensor = torch.nn.functional.relu(padded_tensor)

    # dot product requires padding with 0
    dot_res = padded_tensor.dot(weight)
    padded_tensor += dot_res

    # min requires padding with inf, so we create mask now
    mask = create_mask(padded_size, original_size, padded_tensor.device)
    min_res = torch.min(
        torch.ops.aten.masked_fill(padded_tensor, mask, float("inf"))
    )

    # max requires padding with inf. we can reuse previous mask
    max_res = torch.max(
        torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf"))
    )

    return min_res+max_res+padded_tensor

compiled_f = torch.compile(f, mode="reduce-overhead")

def run(padded_size, original_size):
    padded_tensor = torch.randn(padded_size, device="cuda")
    padded_tensor[original_size:] = 0
    original_tensor = torch.randn(original_size, device="meta")

    weight = torch.randn(padded_size, device="cuda")
    eager_out = f(padded_tensor, original_tensor, weight)
    compiled_out = compiled_f(padded_tensor, original_tensor, weight)
    assert torch.allclose(eager_out[0], compiled_out[0])
    assert torch.allclose(eager_out[1], compiled_out[1])

# new cudagraph
run(8, 4)

# new cudagraph due to recompile
run(8, 6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814
Approved by: https://github.com/eellison
2025-04-16 20:49:20 +00:00
a582f04608 Revert "[ez] Make relaxed constraint error message more user friendly (#151407)"
This reverts commit bc934f57d7c14b07e7497eb72a90d893270bc662.

Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2810716135))
2025-04-16 20:40:22 +00:00
607443b16b [compile][compile time traces] Add more dynamo traces (#151357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151357
Approved by: https://github.com/williamwen42
ghstack dependencies: #151330, #151256
2025-04-16 20:37:08 +00:00
8e373592c8 [aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral
ghstack dependencies: #151330
2025-04-16 20:37:08 +00:00
c58b3f6be3 [invoke_subgraph][inductor] Run pre and post grad passes on invoke_subgraph (#151330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151330
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-04-16 20:37:01 +00:00
4c4a5df73b Allow to run flex_attention on HPU (#148656)
HPU specific implementation details are to be located in out-of-tree HPU library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148656
Approved by: https://github.com/drisspg
2025-04-16 19:49:15 +00:00
9400f53903 [Inductor] Broadcast to range tree shape before block pointer store (#151399)
# Feature

This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer.

The existing logic missed an important edge case.  We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them.

```
@triton.jit
def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 32
    ynumel = 1
    xnumel = 32
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
    xmask = xindex < xnumel
    x1 = xindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```

The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape.

# Test plan

Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151399
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-16 19:03:40 +00:00
eqy
17bf59340c [cuSPARSE][B200] Bump tolerances for test_sparse_csr matvec (#148721)
Small tolerance bump for blackwell (appears to use same kernel as prev. arches)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148721
Approved by: https://github.com/nWEIdia, https://github.com/ngimel
2025-04-16 18:44:18 +00:00
1f29190b59 [dynamo] unimplemented -> unimplemented_v2 in variables/builtin.py (#151145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151145
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel, https://github.com/zou3519
2025-04-16 17:16:05 +00:00
bc934f57d7 [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-16 17:00:06 +00:00
cedcdda0ed Add ccode for CeilToInt and IntTrueDiv (#151375)
Summary: As titled

Test Plan: Test in D73052653 -- shape calculator generates successfully

Differential Revision: D73073845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151375
Approved by: https://github.com/kalpit-meta-1, https://github.com/Skylion007
2025-04-16 16:47:55 +00:00
6a3a6d22dc Revert "[dynamo] context manager/decorator for dynamo config patching during tracing (#150586)"
This reverts commit 40ce4fb24a536d175348df876f61956d4945778e.

Reverted https://github.com/pytorch/pytorch/pull/150586 on behalf of https://github.com/clee2000 due to broke some inductor tests? inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/14486513628/job/40635178179) [HUD commit link](40ce4fb24a), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150586#issuecomment-2810064322))
2025-04-16 16:13:47 +00:00
0c77af3576 [MPSInductor] Add pow, log2 and FloorToInt ops (#151449)
That enables `test_pow_by_natural_log2_dynamic_shapes_mps`

Not sure why log2 printer function suffix is `OpaqueUnaryFn_log2`, rather than just `log2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151449
Approved by: https://github.com/jansel
2025-04-16 15:56:21 +00:00
e229ce34c4 [Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)
As the title stated.

Related PR: https://github.com/pytorch/pytorch/pull/147066

Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091
Approved by: https://github.com/albanD
ghstack dependencies: #151005, #151007
2025-04-16 13:12:17 +00:00
c7400d0026 [inductor][comms] skip reorder_for_locality for wait nodes (#150074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150074
Approved by: https://github.com/eellison, https://github.com/bdhirsh
ghstack dependencies: #150258
2025-04-16 10:18:33 +00:00
159d8a14a6 [inductor][comms] fix node_summary for composite scheduler nodes (#150258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150258
Approved by: https://github.com/yf225
2025-04-16 10:18:33 +00:00
41c97a72a1 [export] Add draft-export to error msg (#151065)
Given an exception in torch.export, I want to try/catch it to add the message "hey try out draft-export!". Currently I only add this message for errors that draft-export is known to fix, like DataDependentErrors, ConstraintViolationErrors, and no fake impl.

Originally the error message looks like:
```
  File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
    raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.
```

Now, the error msg now looks something like:
```
  File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
    raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can rerun your program with the `DRAFT_EXPORT=1` envvar, or replace your `export()` call with `draft_export()`.
```

In python versions >= 3.11, we can use `exception.add_note` to add to the error message. However with previous versions I did a hack to modify `e.args`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151065
Approved by: https://github.com/pianpwk
ghstack dependencies: #151051
2025-04-16 08:56:02 +00:00
84e633e09d [export] Make draft-export predispatch=True by default (#151051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151051
Approved by: https://github.com/pianpwk
2025-04-16 08:56:02 +00:00
a5c61668d7 fix ambiguous error message (#150086)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150086
Approved by: https://github.com/anijain2305
2025-04-16 08:48:05 +00:00
0a489f924d Fix: missing () in generated runtime assert c++ code (#151171)
Address one of the issues in https://github.com/pytorch/pytorch/issues/151127
generated code used to be
not a==5 or b==5

should be
not (a==5 or b==5)

address one of the issues in the comments of Address one of the issues in https://github.com/pytorch/pytorch/issues/151127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151171
Approved by: https://github.com/aorenste, https://github.com/eellison
ghstack dependencies: #151170
2025-04-16 08:10:17 +00:00
55595e0c85 Fix Issues in deferring runtime assertions. (#151170)
This PR fix two bugs:
1)  Update self.bound_unbacked_symbols before emitting runtime asserts :
set self.bound_unbacked_symbols before emitting runtime asserts to include runtime asserts depending on the current node

2) In the pass that remove unused graph inputs, we should not remove symbols that are used by runtime assertions.

Address some of the issues in https://github.com/pytorch/pytorch/issues/151127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151170
Approved by: https://github.com/bobrenjc93, https://github.com/eellison
2025-04-16 08:10:17 +00:00
abbca37fe8 [Openreg][PrivateUse1] Enable CI for openreg (#151007)
Changes:
- move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/
- update README.md for openreg
- enable CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007
Approved by: https://github.com/albanD
ghstack dependencies: #151005
2025-04-16 07:55:51 +00:00
a9dbbe1aee [OpenReg][PrivateUse1] Refactoring the csrc files of pytorch_openreg (#151005)
As the title stated.

**Changes:**
- Remove unnecessary header file
- Remove unnecessary registry logic about PrivateUse1HooksRegistry,such as TORCH_DECLARE_REGISTRY, C10_DEFINE_REGISTRY, etc,.
- using static + global variable to do initialization instead of call_one

**Next Step:**
Enable test_openreg.py in CI/CD to guard the quality of PrivateUse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151005
Approved by: https://github.com/albanD
2025-04-16 07:55:50 +00:00
40ce4fb24a [dynamo] context manager/decorator for dynamo config patching during tracing (#150586)
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.

The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.

Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.

Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)

See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.

NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
2025-04-16 06:49:58 +00:00
daf2ccf023 [custom ops] Fix destroy function (#151299)
Summary:
D72906445 seemed to cause a SIGABRT when running the test in the test plan. The change I narrowed it down to was where in fake_impls the [`deregister_fake_kernel` no longer calls `lib.destroy`](https://github.com/pytorch/pytorch/pull/150806/files#diff-7fd3f4222276c63b91f3a895530bb5efe137fd23165b48f25afcf3c06a5d2a8fL65-L69).

Calling `lib.destroy` in that handle results in a maximum recursion error where someone calls library.destroy which calls the handle which calls back to library.destroy.

So I compared the implementation of this _del_library and lib.destroy and it seemed like the main thing different was deleting `self.m`. So adding that fixed my issue!

Side note, I feel like we can combine `_del_library` and `library._destroy`? But I won't do it in this diff to make sure we don't break too many things 😅

Test Plan:
`buck test 'fbcode//mode/opt' fbcode//aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests -- --exact 'aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests - aiplatform.gmpp.bulk_eval.reader.service.tests.reader_service_handler_tests.ReaderServiceHandlerTests: test_add_preproc_output_into_queue'`
https://www.internalfb.com/intern/testinfra/testrun/10977524170296078

Differential Revision: D73017613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151299
Approved by: https://github.com/zou3519
2025-04-16 06:18:09 +00:00
585d03fa39 Record how many parameters we're parsing within dynamo (#148508)
This allows us to track how many paramaters we have in compilations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148508
Approved by: https://github.com/jansel, https://github.com/anijain2305

Co-authored-by: Sam Larsen <slarsen@meta.com>
2025-04-16 06:15:11 +00:00
b4cee2bf57 [executorch hash update] update the pinned executorch hash (#151280)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151280
Approved by: https://github.com/pytorchbot
2025-04-16 05:39:06 +00:00
107121dfad [AOTInductor] Add interface for user managed buffer in package api. (#151325)
Summary:
https://github.com/pytorch/pytorch/pull/151141
We add interface for user managed buffer in the package api.

Test Plan:
Included in commit.]

Reviewed By: henrylhtsang

Differential Revision: D72985440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151325
Approved by: https://github.com/angelayi
2025-04-16 04:25:40 +00:00
82200e33b5 Make torch._chunk_cat support non-contiguous inputs (#151263)
Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors:

- size=[8192, 16416], stride=[16448, 1]  # stride[0] is larger than size[1]
- size=[1152, 384], stride=[1, 1152]  # column-major tensor

In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263
Approved by: https://github.com/BoyuanFeng
2025-04-16 04:18:46 +00:00
30101aa450 [c10d][fr] Add counters for FR dump and reduce its timeout to finish dump before watchdog timeout (#151329)
After https://github.com/pytorch/pytorch/pull/150652, we still see some ranks missing dumps. Upon looking further, the case is that FR dump timed out for its first attempt:
watchdog thread: notify FR dump -> wait for 1 mins -> throw watchdog timeout -> notify elastic to kill process
FR dump thread: received FR dump signal -> timeout after 1 mins with first attempt -> started 2nd attempt -> got killed.

So we want to make the FR dump timeout shorter, in reality, the log shows that the dump finished within one sec. Even if we consider a very slow speed like 200K/s the usual size FR (1MB at most) takes around 5 secs, so 15 secs is like 3 times buffer.

Also we still let watchdog sleep for 1 min so that we can wait enough time for two dump to timeout and the following check like GIL checker to execute.

Also, if we get stuck in getting GIL or cuda hang, 15 seconds should be enough to detect the hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151329
Approved by: https://github.com/fegin
2025-04-16 03:48:03 +00:00
3a90fd481e fix test_einsum: use initialized values (#151363)
Summary: `empty` uses uninitialized values so that could be NaNs, thus, the assert_close kept failing in FBCode.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints_cpu -- --exact 'caffe2/test/inductor:unbacked_symints_cpu - test_einsum_cpu (caffe2.test.inductor.test_unbacked_symints.TestUnbackedSymintsCPU)' --env TORCH_LOGS="+output_code" --print-passing-details --env TORCH_LOGS_FORMAT="%(filename)s:%(lineno)s: %(message)s"
```

Differential Revision: D73067722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151363
Approved by: https://github.com/Camyll

Co-authored-by: Camyll Harajli <camyllh@meta.com>
2025-04-16 03:10:29 +00:00
6124dabd30 [CI][NoOp] Update skip reason for argmin_with_nan (#151374)
Which is https://github.com/pytorch/pytorch/issues/130295 (i.e. torch.compile produces correct results, but eager is not)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151374
Approved by: https://github.com/dcci
2025-04-16 02:33:20 +00:00
ae53510b9e Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-16 02:18:42 +00:00
067a7b1d4a Disable -Werror for s390x test module compilation (#150413)
This change should make nightly testsuite green again for s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150413
Approved by: https://github.com/seemethere
2025-04-16 02:15:17 +00:00
aacac88bee [ROCM] Fix in-place aten sum with specialized templated kernels. (#151230)
We noticed a regression when doing aten.sum in-place (a+=b) and the type of the output is not the same as the functor.

Co-authored by: Jerry Mannil <jerry.mannil@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151230
Approved by: https://github.com/jeffdaily
2025-04-16 02:07:46 +00:00
cyy
cadd832c19 [1/N] Use std::string_view in torchgen (#146403)
Moves remaining c10::sv to std::sv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146403
Approved by: https://github.com/albanD
2025-04-16 01:50:22 +00:00
dd11613f94 [cutlass backend][experimental] Try out presets for cutlass instead of searching all configs (#151255)
Differential Revision: [D72668861](https://our.internmc.facebook.com/intern/diff/D72668861/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151255
Approved by: https://github.com/mlazos
2025-04-16 01:48:06 +00:00
532025fbd0 [cutlass backend][ez] Ban FP32 output dtype from using CUTLASS GEMM backend (#151279)
FP32 not supported: https://github.com/pytorch/pytorch/issues/145952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151279
Approved by: https://github.com/ColinPeppler
2025-04-16 01:12:18 +00:00
8780d18f64 [ONNX] Add a comment for handling bf16/fp8 tensor to numpy conversion (#151371)
Follow up of https://github.com/pytorch/pytorch/pull/151259
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151371
Approved by: https://github.com/titaiwangms
2025-04-16 00:49:38 +00:00
4bbb61812c [BE][1/2] Move original_weights_lookup attribute to constant (#151241)
Summary: As title. Cleaning usages by using global constant.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_keep_original_weights (quantization.fx.test_quantize_fx.TestQuantizeFx)'`

Differential Revision: D72892815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151241
Approved by: https://github.com/Skylion007, https://github.com/hl475
2025-04-16 00:41:25 +00:00
44a522dd78 [BE] Fix extra-semi warning in attention.cpp (#151367)
Introduced by https://github.com/pytorch/pytorch/pull/149512

Before this change, following warning was generated
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/transformers/attention.cpp:452:71: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  452 | REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta);
      |                                                                       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151367
Approved by: https://github.com/drisspg
2025-04-16 00:31:45 +00:00
8e6415fd32 [cutlass backend] "Fix" FlexibleLayout (#151284)
So Horace was right, Triton does fix the layout when rendering the template (i.e. roughly at the same time).

You can double check that running the unit test with gemm backend as "TRITON,CUTLASS". You will notice that the layout is fixed if we have triton in gemm backend, but flexible if triton is not there.

code pointer: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/select_algorithm.py#L927

In the future, we should remove `fix_op_layout` from class CUTLASSGemmTemplate. But maybe we can monitor it for a bit first.

Differential Revision: [D72996143](https://our.internmc.facebook.com/intern/diff/D72996143/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151284
Approved by: https://github.com/ColinPeppler
2025-04-16 00:10:52 +00:00
e55eb5c870 [Cutlass] Integrate EVT codegen into 3x gemm template (#150346)
Previously merged:
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150346
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #150344, #150345
2025-04-16 00:08:22 +00:00
3cf0e2d8ec Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-15 23:38:15 +00:00
9917feff50 [ONNX] Produce correct dtypes for bf16/f8 in IR TorchTensor (#151259)
Split the changes from https://github.com/pytorch/pytorch/pull/151069 to address https://github.com/microsoft/onnxscript/issues/2187, where the output np arrays do not have the correct ml_dtypes types as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151259
Approved by: https://github.com/titaiwangms
2025-04-15 23:21:04 +00:00
331423e5c2 Fix tensorpipe compilation with clang-17 (#151344)
By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now
```
In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21:
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  241 |     index_ = value_.template Construct(std::forward<Args>(args)...);
      |                              ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  258 |     if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) {
      |                          ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  265 |     if (!value_.template Assign(index_, std::forward<T>(value))) {
      |                          ^
3 errors generated.
```

Fixes https://github.com/pytorch/pytorch/issues/151316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2025-04-15 22:18:06 +00:00
98b1e82ba8 Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129)"
This reverts commit bd4cf30e31a2a0b0a57f54c7eedd3a39d5778cbe.

Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))
2025-04-15 22:07:25 +00:00
e1d8b3f838 [inductor] Check NoneLayout in update_zero_dim_cpu_tensor (#151321)
Summary:
This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/
I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed.
```
Traceback (most recent call last):
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile
    compiled_module = graph.compile_to_module()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module
    return self._compile_to_module()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen
    self._update_scheduler()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler
    self.scheduler = Scheduler(self.operations)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__
    self._init(nodes)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init
    self.update_zero_dim_cpu_tensor()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor
    and buffer.get_size() == []
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size
    return [*self.get_layout().size]
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout
    raise NotImplementedError(type(self.layout).__name__)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: NoneLayout
```

Test Plan: OP said the issue is fixed

Differential Revision: D72575808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321
Approved by: https://github.com/BoyuanFeng
2025-04-15 21:58:09 +00:00
4518b30680 Clarify that x and dx are mutually exclusive in torch.trapezoid doc (#151190)
This PR addresses [#151105](https://github.com/pytorch/pytorch/issues/151105) by stating that x and dx are mutually exclusive parameters in torch.trapezoid()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151190
Approved by: https://github.com/soulitzer
2025-04-15 21:42:05 +00:00
630cf46039 [Cutlass] Codegen for EVT Epilogue (#150345)
Previously merged:
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150345
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #150344
2025-04-15 21:31:21 +00:00
27ef3f6cdc [ROCm][CI/CD] Create ROCm6.4 magma tarball (#151345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151345
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-15 21:12:48 +00:00
71e7dcda87 [c10d][fr] Record each individual collective being coalesced (#151238)
During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced.

Also the added unit test also address the unit test ask in the comment in https://github.com/pytorch/pytorch/pull/150863?fbclid=IwZXh0bgNhZW0CMTEAAR4a5Rd_JyJlrbKZcacbIv5WX5b4MqBRNn0hpgl-VTSD0eeXRlPZ9Ty_CPOYhQ_aem_ALEG1ibRajwie-rn1B4n5w#pullrequestreview-2751254224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151238
Approved by: https://github.com/d4l3k, https://github.com/wconstab
ghstack dependencies: #151247
2025-04-15 20:56:37 +00:00
ae648f047c [c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247)
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.

Added more unit test for slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-15 20:53:03 +00:00
f98150fc8e Warn user of existing lock file to avoid infinite waiting (#149382)
Sometimes the python script didn't exit normally and the lock file remains in the path. In this case, the `file_baton.py` may sleep forever waiting for the lock file to release. This PR will add a warning to show the existing lock file path, let the user better understand which file to delete when the waiting time is too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149382
Approved by: https://github.com/soulitzer
2025-04-15 20:25:29 +00:00
bd4cf30e31 Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-15 20:13:26 +00:00
d77e0cddfe [Cutlass] Import cutlass python API for EVT (#150344)
This imports the pieces of the cutlass python API that are needed for python EVT tracing. It builds on existing importing for cutlass_library. Once EVT tracing has been added to cutlass_library (should be later this year) this can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150344
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-04-15 20:11:40 +00:00
91923f0ee1 [inductor] disable alignment asserts in fbcode (#151274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151274
Approved by: https://github.com/Mingming-Ding, https://github.com/Microve, https://github.com/eellison
2025-04-15 19:59:54 +00:00
a2632d5241 [HOP] Reworked DispatchKey.Autograd (#151107)
This PR intends to rework the dispatching of the autograd key.
I.e., currently the DispatchKey.Autograd of the HOPs was triggered, even if non of the operands of the HOP have `requires_grad=True`. With this rework, the autograd is bypassed if non of the operands require gradients and only invoked if any of the operands require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151107
Approved by: https://github.com/ydwu4
2025-04-15 19:55:46 +00:00
19a33b20c2 [ROCm][CI/CD] create ROCm 6.4 images, part 1, skip magma tarball (#151236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151236
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-15 19:45:15 +00:00
8d5f7ab06c Replace all random is_fbcode imports to environment (#151283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151283
Approved by: https://github.com/masnesral, https://github.com/Skylion007
2025-04-15 19:42:58 +00:00
eea4a7b424 update expected results for comptime benchmark (#151319)
This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark.

Modeled this PR after https://github.com/pytorch/pytorch/pull/144274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319
Approved by: https://github.com/jamesjwu, https://github.com/laithsakka
2025-04-15 19:40:13 +00:00
e45a6a9300 [inductor][test] Disable Triton GEMM backend tests for SM89 (#150485)
Motivation: To deprecate a silent fallback behavior https://github.com/pytorch/pytorch/issues/150390

Problem: On SM89, Trition GEMM backend isn't working. This seems to be a pre-existing issue. I don't have access to SM89 to debug further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150485
Approved by: https://github.com/xmfan, https://github.com/eellison
2025-04-15 19:03:52 +00:00
f1adf22b5f improve noop elimination for slice and slice_scatter (#151175)
Improves noop elimination for `slice` and `slice_scatter`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151175
Approved by: https://github.com/zou3519
2025-04-15 18:56:50 +00:00
d7050ef48b [CI] Run test_torchinductor for MPS device (#150821)
There are only 118 failures atm, mark them all with xfail to avoid new regressions

Add `xfail_if_mps_unimplemented` decorator to distinguish between tests that call unimplemented eager op vs ones that fail for some other reason.

Added `aten._scaled_dot_product_attention_math_for_mps` fallback to make test behavior consistent between MacOS-15 (where falback is in place) and MacOS-14

Weird MacOS-14 specific skips:
- test_torchinductor.py::GPUTests::test_cat_extern_kernel_mps
- test_torchinductor.py::GPUTests::test_sort_transpose_mps (likely an eager bug)
- test_torchinductor.py::GPUTests::test_unaligned_input_mps

Numerous MacOS-13 skips, including few eager hard crashes, for example running `test_torchinductor.py::GPUTests::test_scatter5_mps` causes
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayScatter.mm:309: failed assertion `Rank of destination array (1) must be greater than or equal to inner-most dimension of indices array (3)'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150821
Approved by: https://github.com/ZainRizvi, https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272, #151282, #151288
2025-04-15 18:42:39 +00:00
7e5f6dcf7f Add @requires_multicast_support to test_multimem_all_gather (#151227)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227
Approved by: https://github.com/jeffdaily
2025-04-15 18:41:12 +00:00
83d88d128d [reland] Make export._trace._WrapperModule work in strict mode (#146919) (#151264)
Summary:

as title

`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.

We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`

Fixes https://github.com/pytorch/pytorch/issues/146867

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```

Differential Revision: D72986826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151264
Approved by: https://github.com/angelayi
2025-04-15 18:35:34 +00:00
61f127aac5 [Export] fix automatically convert instances of _check(u>=0) to check_is_size() (#148844)
Fixes #148826

Understanding:

1. PyTorch should automatically convert instances of _check(u>=0) to check_is_size()
2. The export mechanism should suggest using check_is_size() instead of _check(u>=0) when applicable

Changes made:
1. Added a helper function to detect non-negative checks: is_non_negative_check
2. Modified the suggestion logic in _suggest_torch_checks to detect and handle non-negative checks
3. unit tests test_is_non_negative_check_function, test_suggest_torch_checks_with_non_negative_check, and test_suggest_torch_checks_with_regular_check

unit tests:

base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_non_negative_check
=================================== test session starts ==================
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

======================== 1 passed in 1.67s =======================
(base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_regular_check
======================= test session starts =================
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

================================= 1 passed in 1.61s ================
(base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_is_non_negative_check_function
================================ test session starts =============
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

======================= 1 passed in 1.62s =========================
(base) sany@sandishs-Laptop pytorch %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148844
Approved by: https://github.com/laithsakka
2025-04-15 17:41:11 +00:00
74f6bc28a7 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit c9aef508984a31f03821eaad381468673ef29c0a.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))
2025-04-15 17:35:59 +00:00
c0a0761871 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-15 17:28:36 +00:00
8f440a8e70 don't return logits for benchmark script (#151075)
PT2 benchmark scripts has a pattern like:
```
    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
        cloned_inputs = clone_inputs(inputs)
        self.optimizer_zero_grad(mod)
        with self.autocast(**self.autocast_arg):
            pred = mod(**cloned_inputs)
            loss = self.compute_loss(pred)
        self.grad_scaler.scale(loss).backward()
        self.optimizer_step()
        if collect_outputs:
            return collect_results(mod, pred, loss, cloned_inputs)
        return None
```
for training.

The collect_outputs argument is True only for accuracy testing and it's false for performance testing.

For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow.

A few bad things if we keep logits here
1. the peak memory will be higher since the logits is large and we can not release its memory earlier.
2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph

Actually I think it's fine to not return logits at all.
- For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients.
- Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us.

On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d))

Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a)
- HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement
- I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075
Approved by: https://github.com/eellison, https://github.com/jansel
2025-04-15 17:13:00 +00:00
7d205b22b5 [profiler][retry] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#151124)
Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures

Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604

Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.

This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.

Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)!

Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124
Approved by: https://github.com/sraikund16
2025-04-15 16:11:49 +00:00
c5de6ff079 Remove ls from filesystem base (#151117)
Summary: User reported issue where they are inheriting from filesystembase but don't have the ls method which was added in the PR https://github.com/pytorch/pytorch/pull/150701#discussion_r2039840129. Removing the method from the base class but keeping it in derived class

Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D72867722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151117
Approved by: https://github.com/Skylion007, https://github.com/lw
2025-04-15 14:45:20 +00:00
f1f18c75c9 Gracefully handle optree less than minimum version, part 2 (#151257)
If optree is less than the minimum version, we should pretend it doesn't
exist.

The problem right now is:
- Install optree==0.12.1
- `import torch._dynamo`
- This raise an error "min optree version is 0.13.0"

The fix is to pretend optree doesn't exist if it is less than the min
version.

There are ways to clean up this PR more (e.g. have a single source of
truth for the version, some of the variables are redundant), but I am
trying to reduce the risk as much as possible for this to go into 2.7.

Test Plan:

I verified the above problem was fixed. Also tried some other things,
like the following, which now gives the expected behavior.
```py
>>> import torch
>>> import optree
>>> optree.__version__
'0.12.1'
>>> import torch._dynamo
>>> import torch._dynamo.polyfills.pytree
>>> import torch.utils._pytree
>>> import torch.utils._cxx_pytree
ImportError: torch.utils._cxx_pytree depends on optree, which is
an optional dependency of PyTorch. To u
se it, please upgrade your optree package to >= 0.13.0
```

I also audited all non-test callsites of optree and torch.utils._cxx_pytree.
Follow along with me:

optree imports
- torch.utils._cxx_pytree. This is fine.
- [guarded by check] f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)

_cxx_pytree imports
- [guarded by check] torch.utils._pytree (changed in this PR)
- [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR)
- [guarded by try-catch] f76b7ef33c/torch/distributed/_functional_collectives.py (L17)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)
- [guarded by try-catch] f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257
Approved by: https://github.com/malfet, https://github.com/XuehaiPan
2025-04-15 13:08:26 +00:00
12cb11a268 [Inductor UT] Refactor FlexAttention UT and add CPU tests (#144953)
This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py`  and `test/inductor/test_flex_decoding.py`, as a follow-up to https://github.com/pytorch/pytorch/pull/141453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144953
Approved by: https://github.com/drisspg
2025-04-15 12:44:49 +00:00
2180e87d7c [fbgemm_gpu] Incorporate Torch DSA (#151148)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1035

X-link: https://github.com/pytorch/FBGEMM/pull/3950

- Incorporte the PyTorch DSA infrastructure into the FBGEMM kernel launcher
  utility

Test Plan:
```
# Nvidia
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run 'fbcode//mode/opt'  -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100  -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

# AMD
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:split_embeddings_utils
```

Differential Revision: D72759030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151148
Approved by: https://github.com/huydhn
2025-04-15 11:34:04 +00:00
70e7b76707 [AOTInductor] Add Python interface for user managed buffer. (#151141)
Summary: Add pybind for user managed buffer in update_constants_buffer.

Test Plan:
Included in commit.
```
python test/inductor/test_aot_inductor.py -k user_managed
```

Differential Revision: D72892310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151141
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2025-04-15 09:36:30 +00:00
bd9c436c99 [Intel GPU][PT2E] Register qconv impls to general qconv_pointwise schema (#151092)
# Motivation
Refer to https://github.com/pytorch/pytorch/pull/150751, general scheme for `qconv_pointwise` is added and `qconv2d_pointwise` is removed in callers. This PR registers the XPU backend implementations to this operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151092
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-04-15 08:42:14 +00:00
a756c50315 [Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996)
sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device.

I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996
Approved by: https://github.com/drisspg, https://github.com/EikanWang
2025-04-15 08:08:01 +00:00
ccfce9ae86 Fix score_mod.py dynamic max autotune for backward (#151270)
Same as https://github.com/pytorch/pytorch/pull/148991 but this PR fixes the backward path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151270
Approved by: https://github.com/drisspg, https://github.com/bobrenjc93
2025-04-15 06:33:37 +00:00
afaadce083 [MPSInductor] Adjust memory format detection (#151288)
MPS conv implementation will only yield channels last if input is in channels_last format
Fixes `TestGPUTests.test_conv2d_backward_channels_last` on MacOS-15

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151288
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272, #151282
2025-04-15 06:25:00 +00:00
b8a2824755 [MPS] Fix logit output for half/bfloat (#151282)
Which also fixes MPSInductor pointwise test
TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272
2025-04-15 06:25:00 +00:00
a2f7764507 [Dynamo] Fix the unimplemented_v2 of EventVariable.call_method in ctx_manager.py (#151208)
Changes:
- Field of `explanations` shoule be `str` instead of `tuple`
- Not only `torch.cuda.Event`, but alse `torch.xpu.Event` can trigger this message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151208
Approved by: https://github.com/Skylion007
2025-04-15 05:26:39 +00:00
9e20a8411b make einsum unbacked friendly (#151032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151032
Approved by: https://github.com/pianpwk
2025-04-15 04:35:17 +00:00
5a51de5ab1 [cutlass backend] Add more logs for cutlass backend benchmark (#150639)
Goal is to have a way to compare if a change make it better or worse.

```
Average edge over aten (max(-edge, 0), higher is better):
triton: 8.596507086950552 (from 6 valid values)
triton_persistent_tma: 9.517193693923307 (from 6 valid values)
cutlass_lvl_default: 3.3234737908691785 (from 6 valid values)
cutlass_lvl_1111: 7.088173348313991 (from 6 valid values)
cutlass_lvl_2222: 7.291869722320318 (from 6 valid values)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639
Approved by: https://github.com/ColinPeppler
2025-04-15 04:19:51 +00:00
48b4bc1640 [c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243)
This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today:

For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons):

- Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon.
- Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC.
- Way three: Using _coalescing_manager in python, like:
```
with _coalescing_manager():
    for i in range(num_colls):
           dist.all_reduce(tensors[i])
```
This way has two path:
   - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1.
   - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case.
   - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL.

This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective.

We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243
Approved by: https://github.com/d4l3k, https://github.com/wz337
2025-04-15 04:08:28 +00:00
f66229de2b [dynamo] Remove traceable_tensor_subclasses-related code (#151062)
Since #149792 deprecates `traceable_tensor_subclasses` and it's been
landed for over a week, we can safely remove all the old code that uses
`traceable_tensor_subclasses` (they were primarily for testing purposes
and are equivalent to no-ops now).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151062
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #151060, #151061
2025-04-15 03:55:35 +00:00
6a1499d209 [dynamo] handle tensor subclass with non-classmethod __torch_function__ (#151061)
As title, this patch fixes bugs in
1. emulating `has_torch_function`
2. emulating calling `__torch_function__`
3. building a callable VT for non-classmethod `__torch_function__`

Fixes #120799, #150265, #150848.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151061
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #151060
2025-04-15 03:55:34 +00:00
73129b8974 [dynamo] Properly handle super().some_classmethod(...) (#151060)
Previously we were passing in the instance as first argument to a
`super().some_classmethod(...)` call, but we should've passed in the
type object instead, per semantics of `@classmethod`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151060
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/anijain2305
2025-04-15 03:55:34 +00:00
e178a3aa94 clang-format CUDASymmetricMemory.cu (#151260)
Ported from #146592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151260
Approved by: https://github.com/Skylion007
2025-04-15 02:00:34 +00:00
25803d3a22 Optimize typing in lr_scheduler.py (#151219)
## Changes

- Add typing annotation in `lr_scheduler.py`

## Test Result

```bash
pytest test/optim/test_lrscheduler.py -vv
```

![image](https://github.com/user-attachments/assets/34a91965-ff3a-462a-9ab0-b46ad4b290e9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151219
Approved by: https://github.com/janeyx99
2025-04-15 01:00:13 +00:00
4ede6705b5 test_store: fix timeout for test_queues (#151252)
Fixes #151216, #151215

Previously I forgot to revert the timeout after setting it for the timeout test.

To prevent this in the future I split the test into 3 different tests so timeout testing is isolated.

Test plan:

Stress tested

```
pytest test/distributed/test_store.py -k queue -v -s --minutes 10
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151252
Approved by: https://github.com/XilunWu
2025-04-15 00:44:19 +00:00
263f08e119 [PP] Add schedule visualizer (#150347)
Added a new private file (`_schedule_visualizer.py`) with some helper methods that can be used to visualize the operations of a schedule and plot with matplotlib.

InterleavedZeroBubble(pp_group=4, microbatches=8):
![image](https://github.com/user-attachments/assets/610ba9a8-7d18-4a99-bcad-6f43e5b23c8c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150347
Approved by: https://github.com/kwen2501
2025-04-15 00:38:18 +00:00
070357b61a [MPSInductor] Fix silent correctness in bitcast (#151272)
By using Metal `as_type` which according to documentation does exactly
that:
> Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not
a pointer) to be reinterpreted as another scalar or vector data type of the same size. The bits in
the operand are returned directly without modification as the new type. The usual type
promotion for function arguments is not performed.

Using `reinterpret_cast` created a potential silent correctness error when dtypes of different sizes were bitcast to each other
Add expicit cast to src_type to avoid errors due to type promotion (i.e.
soemthing like (x+1).view(dtype=torch.float16) would work correctly in
eager mode for int16 dtype, but would fail in compile, as arithmetic
operations will promote int16 to int32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151272
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246
2025-04-14 23:39:42 +00:00
508b882513 [dynamo][invoke_subgraph] Use FxGraphModule comparison instead of hashing (#150911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150911
Approved by: https://github.com/zou3519
2025-04-14 23:34:26 +00:00
a24a9c42fb [ROCm] Improve behavior of get_torch_rocm_version helper function on non-ROCm systems. (#151040)
Fixes #150041

Return a zero tuple when ROCm is _not_ supported, similar to what is done for the CUDA version of this function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151040
Approved by: https://github.com/jeffdaily
2025-04-14 22:50:07 +00:00
c9aef50898 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 22:00:09 +00:00
4a47dd9b3f Revert "[map] always turn on dynamo for map (#150962)"
This reverts commit a72d56cb6be8c6ded5678b0b98003c90fd1b5a71.

Reverted https://github.com/pytorch/pytorch/pull/150962 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))
2025-04-14 21:09:22 +00:00
6a77a0a50c Revert "[map] make proxy mode re-dispatch to fake key (#151034)"
This reverts commit ca2e8cd3528635526a3fe09444139ffa748e97be.

Reverted https://github.com/pytorch/pytorch/pull/151034 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))
2025-04-14 21:09:21 +00:00
070f389745 Mark auto_functionalized HOPs as cacheable (#151194)
Fixes #151188

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151194
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #151193
2025-04-14 20:05:32 +00:00
dea50b0778 Improve sort with non-constant keys error message (#151193)
Fixes https://github.com/pytorch/pytorch/issues/143505

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151193
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42
2025-04-14 20:05:32 +00:00
46ce8f7df6 [MPSInductor] Cast halfs to floats (#151246)
To avoid accuracy issues when small reductions are unrolled, cast half to float during the `load` op
As `op_math_t<half>` is indeed float

This fixes `test_unroll_small_reduction` for reduced precision types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151246
Approved by: https://github.com/dcci
ghstack dependencies: #151224
2025-04-14 19:47:04 +00:00
0a6e1d6b9b Expand docs for nn.functional, and make the wording consistent (#148436)
Expands the docs for the loss functions, and makes the wording consistent.

Fixes #148353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148436
Approved by: https://github.com/albanD
2025-04-14 19:37:12 +00:00
23a3cef5d9 [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162)
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.

### Context

As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.

By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.

### Testing

We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
2025-04-14 19:31:38 +00:00
7deed1946f Fix assert_tensor_meta (#150808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150808
Approved by: https://github.com/pianpwk
ghstack dependencies: #150806, #150807
2025-04-14 19:28:54 +00:00
53528440e1 Generate meta kernel with operator profiles (#150807)
Added a context manager, `torch._library.fake_profile.register_fake_profile(op_profiles)`, where given an operator profile, it will generate and register a fake impl for the operator based on the operator profile.

The input to `register_fake_profile` is a dictionary mapping operator name to a set of profiles which describe the input and outputs of the operator. Here's an example of a profile for `mylib.foo.default`:
```
"mylib.foo.default": {
    OpProfile(
        args_profile=(
            TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
            TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
        ),
        out_profile=TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
    )
}
```
`foo`'s profile contains only one profile, which says that for 2 input tensors of rank 2, dtype float32, device cpu, we will return one tensor of rank 2, dtype float32, and device cpu.

This will then generate a fake kernel where given 2 input tensors of rank 2 (and the other tensor metadata), we will output one tensor of rank 2 (and the other tensor metadata). If the operator also supports other input ranks, then we can add to the profile for the fake impl to support more input types.

This profile can either be manually written or created by draft-export, and then checked into the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150807
Approved by: https://github.com/zou3519
ghstack dependencies: #150806
2025-04-14 19:28:54 +00:00
901e37515f [ONNX] Fix bfloat16 support in onnx_program callable (#151121)
- Added a test to guard bfloat16. The optimizer incorrectly turns bfloat16 initializers into uint16, but this is not relevant to export logic.
- Fix bfloat16 support in onnx_program callable

Tested with the following with cuda

```py
import torch

class BfloatModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.param = torch.nn.Parameter(torch.tensor(2.0, dtype=torch.bfloat16))

    def forward(self, x):
        return x * torch.tensor(1.0, dtype=torch.bfloat16) * self.param

input = torch.randn(1, 10, dtype=torch.bfloat16)
model = BfloatModel()
onnx_program = torch.onnx.export(model, (input,), dynamo=True, optimize=False, verify=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151121
Approved by: https://github.com/titaiwangms

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-14 19:27:29 +00:00
f76b7ef33c Add error check for out variant of tensordot function with requries_grad tensor (#150270)
Fixes #147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph.

Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270
Approved by: https://github.com/soulitzer
2025-04-14 18:43:14 +00:00
1f5af12cd9 Using hasattr for _boxed_call is asking for trouble (#151130)
Summary:
There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example).

Change `hasattr()` to `getattr(..., False)` for these cases.

Test Plan: unit tests pass

Differential Revision: D72806693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130
Approved by: https://github.com/Skylion007
2025-04-14 18:36:30 +00:00
6dddd6520d [dynamic shapes] add sym_and, sym_or (#150456)
This has been pretty helpful for the size-oblivious rewrite. Wanted the variadic args version to avoid `sym_or(a, sym_or(b, sym_or(c, d)))` in favor of `sym_or(a, b, c, d)`. Happy to change this to ban the 1-arg version.

This is better than plain and/or because the whole symbolic expression gets preserved, and if we guard on it or defer as a runtime assert, we preserve all branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150456
Approved by: https://github.com/laithsakka
2025-04-14 18:18:06 +00:00
785495ee29 [dynamo][error message] Hint for dict_items as inputs to the compiled region (#151169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151169
Approved by: https://github.com/zou3519
ghstack dependencies: #151164, #151168
2025-04-14 17:38:20 +00:00
3c46808a14 [dynamo] Graph break fixes while tracing inspect module (#151168)
Fixes https://github.com/pytorch/pytorch/issues/139374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151168
Approved by: https://github.com/jansel
ghstack dependencies: #151164
2025-04-14 17:38:20 +00:00
b0bdd76f2e [scan] Autograd with partial gradient support (#146285)
This PR introduces the Autograd feature for scan with partial gradient support. It is a combination of the already opened PRs: https://github.com/pytorch/pytorch/pull/135631 and https://github.com/bohnstingl/pytorch/pull/4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146285
Approved by: https://github.com/ydwu4

Co-authored-by: Yidi Wu <yidi@meta.com>
2025-04-14 17:01:31 +00:00
50abc1ecc4 Super tiny fix typo (#151212)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151212
Approved by: https://github.com/Skylion007
2025-04-14 16:47:40 +00:00
184ac8c7f7 [MPSInductor] Fix noop codegen (#151224)
By adding `pass` in front of the comment for fake set_device call
Which fixes `TestGPU.test_zero_element_mutation_mps`, which previously
failed with
```
torch._inductor.exc.InductorError: RuntimeError: Failed to import /var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmp2emka_sx/7k/c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py
IndentationError: expected an indented block after 'with' statement on line 38 (c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py, line 40)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151224
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
2025-04-14 16:38:47 +00:00
001695c397 [ROCm][CI] Enable distributed CI on MI300 (#150667)
* Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs.
* Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153)
* Disable failing distributed tests via `skipIfRocm`: ea9315ff95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667
Approved by: https://github.com/jeffdaily
2025-04-14 16:19:04 +00:00
cyy
eb19f5abab [2/N] Use internal linkage in aten C++ files (#151070)
Turn functions and variables into static if they are not used outside the ten cpp files. In some cases, missing header inclusion is added. In other cases, unused functions are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151070
Approved by: https://github.com/Skylion007
2025-04-14 16:07:17 +00:00
24b3ab9255 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit bbc5fe850454df6860814ab77a1f3a4ca3698157.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))
2025-04-14 15:22:33 +00:00
d99236b68c Optimize cdist param description (#151178)
Fixes #151101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151178
Approved by: https://github.com/soulitzer
2025-04-14 13:53:10 +00:00
8497491f38 [ez] remove unused arg in _create_wrapped_callback (#151179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151179
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754, #150755, #150828
2025-04-14 12:54:23 +00:00
d5a19e4525 [ez] dynamo fix typo in comment (#150828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150828
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754, #150755
2025-04-14 10:09:28 +00:00
5eebcb991a Add scripts to generate plots of LRSchedulers (#149189)
Fixes #92007

## Changes

- Add script to generate plots for `lr_scheduler`
- Add plots to `lr_scheduler` docs
- Add example section if it missing in `lr_scheduler` docs

## Test Result

### LambdaLR

![image](https://github.com/user-attachments/assets/37fc0894-e2ec-48f2-a2d6-3514e51e1ea2)

### MultiplicativeLR

![image](https://github.com/user-attachments/assets/2122b3a0-a4ce-42c7-bb45-559c1fc73e0f)

### StepLR

![image](https://github.com/user-attachments/assets/47bc9d96-4b60-4586-a000-f213583bbe8f)

### MultiStepLR

![image](https://github.com/user-attachments/assets/c822b849-d5be-4b94-aa7a-0017a2c9ff15)

### ConstantLR

![image](https://github.com/user-attachments/assets/83107cdd-7b00-44a6-b09d-e8ee849b4a12)

### LinearLR

![image](https://github.com/user-attachments/assets/60190105-691a-4101-8966-5b0c396093a4)

### ExponentialLR

![image](https://github.com/user-attachments/assets/dfcbcbca-89e5-4a2f-b1bd-33e25d2405ec)

### PolynomialLR

![image](https://github.com/user-attachments/assets/7c3d4fce-c846-40a0-b62e-f3e81c7e08bd)

### CosineAnnealingLR

![image](https://github.com/user-attachments/assets/26712769-dde9-4faa-b61b-e23c51daef50)

### ChainedScheduler

![image](https://github.com/user-attachments/assets/20734a8b-e939-424f-b45a-773f86f020b1)

### SequentialLR

![image](https://github.com/user-attachments/assets/2cd3ed67-2a0a-4c42-9ad2-e0be090d3751)

### ReduceLROnPlateau

![image](https://github.com/user-attachments/assets/b77f641e-4810-450d-b2cd-8b3f134ea188)

### CyclicLR

![image](https://github.com/user-attachments/assets/29b8666f-41b3-45e4-9159-6929074e6108)

### OneCycleLR

![image](https://github.com/user-attachments/assets/d5b683ef-41e8-4ca8-9fe8-0f1e6b433866)

### CosineAnnealingWarmRestarts

![image](https://github.com/user-attachments/assets/1d45ea80-dea8-494d-a8ab-e9cfc94c55d6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149189
Approved by: https://github.com/janeyx99
2025-04-14 09:53:38 +00:00
5a64476ed6 [Easy] Add output_size in forward method of ConvTranspose2d (#150609)
Fixes #74593

Add description for `forward` in [ConvTranspose2d](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) doc

## Test Result

![image](https://github.com/user-attachments/assets/eebad7a2-f782-4219-9756-344e0f34fada)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150609
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-04-14 09:53:22 +00:00
01f226bfb8 Add check for ctc_loss targets param (#150981)
Fixes #150835

## Test Result

```python
# cuda
>>> import torch
>>> import torch.nn.functional as F
>>> device = "cuda" # "cpu" is fine
>>> num_classes = 4
>>> log_probs = torch.rand(0, 0, num_classes, device=device)
>>> targets = torch.tensor([], device=device, dtype=torch.long)
>>> input_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> target_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss
    return torch.ctc_loss(
           ^^^^^^^^^^^^^^^
RuntimeError: log_probs tensor must not be empty

# cpu
>>> device = "cpu"
>>> num_classes = 4
>>> log_probs = torch.rand(0, 0, num_classes, device=device)
>>> targets = torch.tensor([], device=device, dtype=torch.long)
>>> input_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> target_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss
    return torch.ctc_loss(
           ^^^^^^^^^^^^^^^
RuntimeError: log_probs tensor must not be empty

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150981
Approved by: https://github.com/eqy
2025-04-14 07:24:30 +00:00
bbc5fe8504 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 07:07:10 +00:00
189bc9283e [ez] move GuardsContext code comment to the right place (#150755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150755
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754
2025-04-14 07:03:23 +00:00
9757092aed [executorch hash update] update the pinned executorch hash (#151195)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151195
Approved by: https://github.com/pytorchbot
2025-04-14 05:46:54 +00:00
0d09a33819 [Attention] Always pad in preprocess_mask to avoid recompilations (#150403)
Motivation: for the following script:

```
// demo.py
import torch
import json
from transformers import BertModel, BertConfig

CONFIG = """
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
"""

config = json.loads(CONFIG)
bloom_config = BertConfig(**config)
model = BertModel(bloom_config).half().cuda()

torch.compiler.reset()
torch.cuda.empty_cache()
compiled_fn = torch.compile(model)
vocab_size = 30522

for b in range(1, 3):
    for s in range(1, 10):
        print(f"🚀 {b} {s}")
        input_ids = torch.randint(0, vocab_size, (b, s)).cuda()
        attention_mask = torch.ones(b, s).cuda()

        with torch.no_grad():
            out = compiled_fn(input_ids, attention_mask).last_hidden_state
```

when we run it with:

```
time TORCH_LOGS=recompiles python demo.py
```

We can see there are 7 recompilations and it takes 2 mins (fresh build) or 1 min (cached build)  in my machine.

One root cause of the recompilations is, there are guards to check the alignments of the inputs (see the patch).  So there are unexpected recompilations for `(1, 4)`, `(1, 8)`, `(2, 4)` and `(2, 8)` inputs.

In this patch, we always try to always pad the inputs if we don't know its shape at compilation to avoid the guards on alignment. It is fine to always pad the tensor. It won't change the semantics.

Now there are only 3 recompilations and it takes 1 min (fresh build) and 17s (cached build) in my machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150403
Approved by: https://github.com/drisspg
2025-04-14 04:18:22 +00:00
9458b83729 [HPU] Add HPU as a supported device for NestedTensor (#148659)
This change enables basic NestedTensor operations on HPU,
    fixing the runtime error when creating a NestedTensor on HPU.

    - Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device.
    - Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`.
    - Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`.
    - Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659
Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati
2025-04-14 03:42:34 +00:00
9aca00102f [ez]][dynamo] remove useless super().__init__() (#150754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150754
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #150753
2025-04-14 03:37:42 +00:00
101c4f482a Docs: Fix typos in the Symbolic Numbers docstrings (#151181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151181
Approved by: https://github.com/soulitzer
2025-04-14 01:46:02 +00:00
ddfc14b3ae [MPS] Fix where (#151176)
Fixes #150967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-04-13 20:44:50 +00:00
8494d5582a Propagate callable parameter types using ParamSpec (#142306) (#151014)
Partially addresses #142306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014
Approved by: https://github.com/Skylion007
2025-04-13 20:38:11 +00:00
3f0931b1de [ez][dynamo] some code movement (#150753)
`optimize_assert` already does the lookup for `backend` and
`backend_ctx_ctor`. This simply moves the lookups within `optimize`
lower so we don't end up calling these functions twice unnecessarily
in the `optimize_assert` path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150753
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-04-13 15:44:42 +00:00
b0810168a3 Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-13 09:54:30 +00:00
304633152c Clean up duplicated code in lr_scheduler (#150984)
## Changes

- Remove duplicated code in `ReduceLROnPlateau`
- Remove redundant `noqa` comment

## Test Result

```bash
pytest test/optim/test_lrscheduler.py
```

![image](https://github.com/user-attachments/assets/37f91f31-0e77-4abf-9dd1-75538c0f0792)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150984
Approved by: https://github.com/janeyx99
2025-04-13 09:18:50 +00:00
b59f3d3ae0 [Intel GPU] skip a cuda api call in amp to save some host overhead on xpu (#151111)
This can save ~0.2ms on non cuda devices by skip calling `amp_definitely_not_available()`. It can improve small models in torchbench like lennard_jones on xpu 10% on both eager and inductor in dynamo benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151111
Approved by: https://github.com/soulitzer
2025-04-13 06:37:07 +00:00
1c5619ef9c [DTensor] Add DTensor redistribute fwd/bwd datatype conversion to enable SimpleFSDP mixed precision training (#150740)
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.

In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.

1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.

2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.

In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.

```python
output = x.redistribute(
        placements=self.compute_placements,
        forward_dtype=self.param_dtype,
        backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```

Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.

The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
2025-04-13 05:49:03 +00:00
00c6caaf3d [executorch hash update] update the pinned executorch hash (#150722)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150722
Approved by: https://github.com/pytorchbot
2025-04-13 05:37:33 +00:00
587aec2b4f [dynamo][nn_module] Use method.__self__ to find source for patched methods (#151164)
Fixes https://github.com/pytorch/pytorch/issues/137476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151164
Approved by: https://github.com/jansel
2025-04-13 04:50:19 +00:00
7b1a2373e8 [dynamo][super variable] Fix bug to use correct source (#151154)
Fixes https://github.com/pytorch/pytorch/issues/150994

We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154
Approved by: https://github.com/jansel
2025-04-13 04:48:52 +00:00
8157e76b79 Revert "[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)"
This reverts commit fe7f425de7b76ef33d308d0a03779b97a914d186.

Reverted https://github.com/pytorch/pytorch/pull/150458 on behalf of https://github.com/clee2000 due to broke a lot of tests internally? D72906459 ([comment](https://github.com/pytorch/pytorch/pull/150458#issuecomment-2799578597))
2025-04-13 03:52:42 +00:00
67188cd38d [Testing] Skip test_unspec_inputs_float64_mps (#151167)
As backend does nto support float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151167
Approved by: https://github.com/dcci
ghstack dependencies: #151166
2025-04-13 00:41:51 +00:00
d289d1177c [CI] Fix GPUTests.test_scheduler_vertical_fusion1 (#151166)
By enabling the test_operators on MPS device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151166
Approved by: https://github.com/dcci
2025-04-13 00:41:51 +00:00
9699cc3eb9 [MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)
By using `welford_combine` primitive in the loop
This fixes `GPUTests.test_multilayer_var_lowp_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824, #151151
2025-04-12 21:44:51 +00:00
7762bddd87 Revert "[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)"
This reverts commit 71073caa00836c23e3fc7fcfe1d69b77ffb9d9c9.

Reverted https://github.com/pytorch/pytorch/pull/151152 on behalf of https://github.com/malfet due to Another lint failure ([comment](https://github.com/pytorch/pytorch/pull/151152#issuecomment-2799027274))
2025-04-12 20:27:48 +00:00
3dcb46c30e [easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025)
This will help us better debug pickling errors, etc, in internal models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025
Approved by: https://github.com/masnesral
2025-04-12 19:56:32 +00:00
9d4de265db [AMD] Block mem efficient attention for FP32 in CK backend (#151132)
Summary: CK doesn't support FP32 attention, but aotriton does. If we prefer CK, and the input dtype is FP32, we'll select mem efficient attention but CK doesn't support it. So we'll exclude mem eff attention and pick math.

Differential Revision: D72880985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151132
Approved by: https://github.com/yoyoyocmu
2025-04-12 19:36:20 +00:00
71073caa00 [MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)
By using `welford_combine` primitive in the loop
This fixes `GPUTests.test_multilayer_var_lowp_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824, #151151
2025-04-12 19:16:33 +00:00
3b86cb8dff [MPSInductor][BE] Implement reduction caching (#151151)
That avoids double/triple invocation of welford reductions when both
mean and deviation must be returned

Code has been copy-n-pasted for Halide implementation
575f348965/torch/_inductor/codegen/halide.py (L1189-L1191)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151151
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824
2025-04-12 19:16:33 +00:00
2653498ff3 [Openreg][PrivateUse1] Refactor csrc files of Pytorch_openreg (#151004)
I want to format and refactor the csrc file of pytorch_openreg. To make the code review clearer and easier to understand, I divide the code refactoring into two parts:

- Part 1: Code formatting
- Part 2: Code refactoring and optimization (Next PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151004
Approved by: https://github.com/albanD
ghstack dependencies: #151000
2025-04-12 17:22:28 +00:00
c181403063 [Openreg][PrivateUse1] Improve openreg module capabilities (#151000)
----

- Add more functionalities for openreg in openreg module
- Remove related functionalities from test_cpp_extensions_open_device_registration.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151000
Approved by: https://github.com/albanD
2025-04-12 17:21:35 +00:00
be24e7b4b4 [dynamo] Use sentinel value for guard filter. (#151131)
Summary: `None` can collide with the real values in the scope, so we should use a separate value. Also added "has_value" to the struct so that it's more clear whether the value is absent or not.

Test Plan: CI

Differential Revision: D72881300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151131
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-12 15:29:57 +00:00
5b16a0704e Fix license check for setuptools>=77 (#151158)
Fixes #151157

See issue for more information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151158
Approved by: https://github.com/malfet
2025-04-12 13:41:12 +00:00
7dd2ed1197 [dtensor] add op support for torch._grouped_mm (#151072)
This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072
Approved by: https://github.com/wanchaol, https://github.com/wwwjn
2025-04-12 07:07:44 +00:00
0c59a031c8 [OpenReg][PrivateUse1] add device context for OpenReg Module (#150997)
Add device context support for OpenReg Module, which is depended by
some tests such as ``torch.serialization.default_restore_location``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150997
Approved by: https://github.com/albanD
2025-04-12 06:32:30 +00:00
3e9f4f3f78 docs: allow empty targets tensor in ctc_loss (#151080)
docs: allow empty targets tensor in ctc_losswhen target_lengths are zero, as described in issue

Fixes #150995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151080
Approved by: https://github.com/albanD
2025-04-12 05:26:54 +00:00
2f899f07aa Revert "Make export._trace._WrapperModule work in strict mode (#146919)"
This reverts commit dad5e5e2622c82ca272290225abe16ee461d9ac9.

Reverted https://github.com/pytorch/pytorch/pull/146919 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/14415686353/job/40431799827 ([comment](https://github.com/pytorch/pytorch/pull/146919#issuecomment-2798446930))
2025-04-12 04:12:36 +00:00
dad5e5e262 Make export._trace._WrapperModule work in strict mode (#146919)
Summary:
as title

`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.

We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`

Fixes https://github.com/pytorch/pytorch/issues/146867

Test Plan:
```
 buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```

Differential Revision: D69434316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146919
Approved by: https://github.com/angelayi
2025-04-12 03:22:08 +00:00
19b76bd873 hack to try to fix not empty triton dir (#151119)
Differential Revision: D72741938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151119
Approved by: https://github.com/hl475, https://github.com/muchulee8, https://github.com/Skylion007
2025-04-12 03:21:41 +00:00
c1470d4dc4 [graph partition] support graphsafe_run_with_rng_state (#150958)
Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs:
```python
def partition_0(args):
    primals_2, primals_1 = args
    ...
    buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0)
    # <----- access fwd_rng_state_0 but it's not an input
    ...

def call(self, args):
    primals_1, primals_2, fwd_rng_state_0 = args
    ...
    partition0_args = [primals_2, primals_1]
    (buf2, primals_2, primals_1) = self.partitions[0](partition0_args)
     # <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0]
     ...
```

This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958
Approved by: https://github.com/eellison
2025-04-12 03:17:08 +00:00
397d37acc5 [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-12 03:11:38 +00:00
32f0f414ab Add some autograd producer consumer stream sync tests (#150952)
Thanks @ngimel and @albanD for some ideas on test cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150952
Approved by: https://github.com/albanD
2025-04-12 02:44:09 +00:00
397b7f9b82 [custom ops] Override fake registration (#150806)
Added a flag, `allow_override`, to allow overriding existing kernel implementations in `torch.library.register_fake` `library.impl`. The default is false, where if a user tries to register a kernel to a dispatch key that already contains a kernel, it will error. This flag doesn't apply to CustomOpDefs, where overriding a fake kernel is already allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150806
Approved by: https://github.com/zou3519
2025-04-12 02:43:47 +00:00
77407b38a9 Revert "[MPSInductor] Naive welford_reduce implementation (#150824)"
This reverts commit 575f348965abe8ea428eba7098f67ec9764a7f9a.

Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to Linter fails again, landrace this time? ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798392241))
2025-04-12 02:22:22 +00:00
f6e9e064a7 [CI][CUDA] xfail grouped gemm unit tests on blackwell (#150982)
On SM100OrLater, Expect failures like:

RuntimeError: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

`
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0005s] (Issue with numpy versi...) [  2%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [  4%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [  6%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [  8%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 10%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 12%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 14%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 16%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 18%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 20%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 22%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 25%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 27%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 29%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 31%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 33%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0002s] (Issue with numpy versi...) [ 35%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 37%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 39%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 41%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 43%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 45%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 47%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 50%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 52%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 54%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 56%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 58%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 60%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 62%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 64%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 66%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_False_cuda XFAIL [0.8166s]                                        [ 68%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0017s]                                         [ 70%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0012s]                                         [ 72%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 75%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0033s]                                        [ 77%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 79%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0015s]                                         [ 81%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 83%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_False_cuda XFAIL [0.0012s]                                        [ 85%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 87%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0011s]                                         [ 89%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 91%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0014s]                                        [ 93%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 95%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0011s]                                         [ 97%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0011s]                                          [100%]
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150982
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-12 01:53:12 +00:00
fe7f425de7 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-12 01:15:19 +00:00
575f348965 [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-12 00:46:01 +00:00
83f14c0b06 Revert "[MPSInductor] Naive welford_reduce implementation (#150824)"
This reverts commit 5edfb4c4fad1bb9504482d930a2540d22427d383.

Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to I should have waited for lint ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798249264))
2025-04-12 00:21:14 +00:00
ca2e8cd352 [map] make proxy mode re-dispatch to fake key (#151034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034
Approved by: https://github.com/zou3519
ghstack dependencies: #150962
2025-04-11 23:28:06 +00:00
a72d56cb6b [map] always turn on dynamo for map (#150962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150962
Approved by: https://github.com/zou3519
2025-04-11 23:28:06 +00:00
5edfb4c4fa [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-11 23:21:35 +00:00
eqy
c4f826d5e8 [CUDA][TF32] Account for TF32 in test_alexnet_prefix (#150970)
Mainly seems to be an issue on Blackwell with e.g.,
```
Mismatched elements: 1 / 746496 (0.0%)
Greatest absolute difference: 0.005461275577545166 at index (2, 32, 11, 9)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150970
Approved by: https://github.com/soulitzer
2025-04-11 23:13:54 +00:00
2d187bf7e6 Support tuning of _scaled_grouped_mm (#150421)
This includes the default aten implementation, as well as a Triton
implementation imported from FBGEMM
(https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421
Approved by: https://github.com/ngimel
2025-04-11 23:03:49 +00:00
c3bc6b3542 [DTensor] Fix empty shard global-offset calculation (#150862)
`compute_local_shape_and_global_offset` util computes the local shape of
a particular shard of a DTensor, and the global offset (which describes
how the shard fits into the global tensor).

When the tensor dim does not evenly divide into the mesh dim, uneven
sharding occurs.  In some cases, uneven sharding results in an empty
shard.

e.g.
   tensor dim size: 4096
   mesh dim size: 30
   ranks 0..27 have local size 18
   rank 28 has local size 8
   rank 29 has local size 0 <--- empty shard

The global offset for an empty shard was previously undefined and
returned values that were computed based on logic that assumes no empty
shards.  This caused DCP to fail to save a checkpoint, becuase
deduplication logic could 'throw away' real (non-empty) shards thinking
they were duplicates of zero-sized shards with the same offset.

Now, we define the global offset of an empty shard to be the dim-size,
which is out of bounds of the tensor and can't overlap with any
non-empty shards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862
Approved by: https://github.com/teja-rao, https://github.com/XilunWu
2025-04-11 22:25:57 +00:00
85549fe6de Add __all__ for torch.utils.dlpack (#149026)
Fixes the issue:

```python
torch.utils.dlpack.to_dlpack(tensor)  # "to_dlpack" is not exported from module "torch.utils.dlpack" Pylance[reportPrivateImportUsage](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportPrivateImportUsage)
```

the docs for `torch.utils.dlpack`: https://pytorch.org/docs/stable/dlpack.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149026
Approved by: https://github.com/mikaylagawarecki
2025-04-11 22:03:24 +00:00
2a909cab16 Update ninja missing error message (#147698)
In cpp_extensions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147698
Approved by: https://github.com/Skylion007
2025-04-11 21:56:53 +00:00
a78ac409b5 [AOTI] Add _weight_int4pack_mm to the C shim fallback list (#151059)
Summary: As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151059
Approved by: https://github.com/yushangdi
2025-04-11 21:22:35 +00:00
12281f9c18 [dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True (#151008)
[dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True

Reading the feature enabling param `enable_cpp_framelocals_guard_eval `at the CPP level is time consuming and slows down the operation of the dynamo as it is done every time the function using this param is called. Reading the value only once at init isn’t an option as it would disable the modification of this param at the runtime. Since this feature is enabled by default for some time and it doesn’t cause known issues, the `enable_cpp_framelocals_guard_eval `configuration param will be deprecated by this commit and its value is hardcoded to true.

Local microbenchmark dynamo_guard_eval.py:
- 931.9 us -> 538.9 us (3.10)

@williamwen42 @jansel @anijain2305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151008
Approved by: https://github.com/williamwen42
2025-04-11 21:07:59 +00:00
8910e4f2bb Fix 32-bit indexing overflows in ReducedPrecisionGemV (#150949)
By chaining `lda` type from `int` to  ~~`long`~~ `int64_t`

Add regression test (but probably restrict it to CPUs (or may be skip float32 testing on GPUs)

Fixes https://github.com/pytorch/pytorch/issues/150637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150949
Approved by: https://github.com/Skylion007
2025-04-11 20:55:20 +00:00
05236b5045 Allow OpaqueTensorImpl to be used for views (#151028)
Summary:
When creating an `OpaqueTensorImpl`, currently there's only an option to create it for a non-view tensor, but it can be useful to create one for view tensors as well.

View tensors should contain the same autograd parameters as the original tensor, whereas non-view tensors get created with whatever `inference_mode` option is currently enabled. For this reason, `TensorImpl` has a special view constructor that takes `TensorImpl::ImplType` as its first parameter, so adding a new constructor to `OpaqueTensorImpl` that does the same thing allows us to create views with it.

Test Plan: CI

Reviewed By: scottxu0730

Differential Revision: D71748460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151028
Approved by: https://github.com/scottxu0730, https://github.com/chaos5958
2025-04-11 20:07:47 +00:00
bb60e82672 c10d/Store: add queues (#150969)
This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943.

This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue.

This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported.

`wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them.

This also has a few cleanups to error types/documentation in related code.

Example trace:

```
[I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816
[I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816
[I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816
[I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816
[I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816
[I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816
[I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816
[W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout
[W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms
[I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816
```

Test plan:

```
$ pytest test/distributed/test_store.py -k queue -v -s

test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues)
test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s]
test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues)
test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues)
test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s]
test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 19:24:17 +00:00
83ae61fd8e [Inductor] Add Subgraph as a Autotuning Choice (#150653)
Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor.

Using https://github.com/pytorch/pytorch/pull/150654 and a simple script:

```
import torch

def f(a, b):
    return torch.matmul(a, b)

def decompose_func(a_in, b_in):
    M, K = a_in.shape
    K, N = b_in.shape

    # TODO: Ideally we want to autotune over this parameter
    kPartitions = 256
    assert K % kPartitions == 0, "K must be divisible by Kmini"
    B = K // kPartitions

    a_reshaped = a_in.reshape(M, B, kPartitions).transpose(
        0, 1
      )  # Shape: (B, M, kPartitions)
    b_reshaped = b_in.reshape(B, kPartitions, N)  # Shape: (B, kPartitions, N)
    result = torch.bmm(a_reshaped, b_reshaped)  # Shape: (B, M, N)
    return result.sum(dim=0).to(torch.float16)  # Sum over B dimension, Shape: (M, N)

for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]:
    a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True)
    b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True)

    compiled_res = torch.compile(f, dynamic=False)(a, b)
    decompose_res = decompose_func(a, b)

    print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}")
    print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}")
```

we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.:

```
AUTOTUNE mm(32x28672, 28672x32)
  decompose_k_mm 0.0126 ms 100.0%
  mm 0.0144 ms 87.5%
  triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
```

https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653
Approved by: https://github.com/eellison
2025-04-11 19:08:43 +00:00
ad5e9065ac [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068)
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.

Test Plan: CI

Reviewed By: sanrise

Differential Revision: D72672186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
fe961679d5 [Inductor] add support for disabling atomic adds (#151033)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-04-11 18:41:56 +00:00
67d3053d4b Revert "update benchamark result due to <1% regression (#150937)"
This reverts commit 860765d621e14730f8b6e7344da0053c4f00d540.

Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))
2025-04-11 17:36:47 +00:00
6b32255e37 [c10d][fr] Add logging of nccl_version into fr and its dump (#151048)
Users also want to see the nccl version in the FR dump so let's add it to FR. We only add it per rank per PG nccl comm, so this is really add a couple bytes to FR memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151048
Approved by: https://github.com/kwen2501
2025-04-11 17:36:09 +00:00
5f5805a6ac Cache the value of torch_key in subproc (#151057)
No need to recalculate torch_key in subprocs, lets pass it from main process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-04-11 17:30:23 +00:00
fc1cccd012 Register also future allocations in mempool with NCCL (#150684)
This is the final PR, where everything comes together.

The problem I'm trying to solve is the following: when we register a MemPool with the NCCL ProcessGroup, it calls `ncclCommRegister` on all the allocations that are _currently_ in the pool. However, any later allocation will _not_ be registered with the NCCL communicator!

This is terribly inconvenient, because it means that every piece of code that allocates a tensor must be changed to become aware of whether it's doing so within a private pool, and it must become aware of NCCL and of all the PGs in existence, in order to re-register that pool with them.

Moreover, I believe there can be performance implications because allocating tensors is usually done in the critical path (i.e., during the forward and backward of every step of a training), whereas registering memory is a slow operation that should be done once at init time.

With this PR, once the user registers a Mempool with the NCCL PG, we install some hooks into the CachingAllocator in order to listen for all future memory allocations and, if they belong to the pool, we automatically call `ncclCommRegister` on them! (In fact, we reuse the hooks that already exist for `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150684
Approved by: https://github.com/kwen2501
ghstack dependencies: #150683
2025-04-11 17:26:37 +00:00
99642182f2 Add mempool to allocator's trace events (#150683)
In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683
Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501
2025-04-11 17:26:37 +00:00
d385179886 [dtensor] add op support for torch.cumsum (#151071)
For `torch.cumsum`, any sharding placement shoud propogate through if the cumsum `dim` is not sharded; otherwise it needs to be replicated first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151071
Approved by: https://github.com/wanchaol
2025-04-11 16:42:19 +00:00
1fe260f7c4 [cutlass backend] Add and fix logs, fix types, and make cutlass generator only generate GEMM (#150973)
Differential Revision: [D72760205](https://our.internmc.facebook.com/intern/diff/D72760205/)

We hardcoded to only use GEMM anyway.

This also raises the problem with high instantiation level. As the instantiation level goes higher (here it is 3333), the time it takes to list the configs might be long already (here it is >3 minutes).

If we know exactly what configs we care, we should have a way to generate them without calling generators. But let's see if we need that.

using this script
```
import os

os.environ["TORCH_LOGS"] = "inductor"

import torch

import torch._inductor.config

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.max_autotune_gemm_backends = "Aten,CUTLASS"
# intentionally use no cutlass ops
torch._inductor.config.cuda.cutlass_max_profiling_configs = 0
torch._inductor.config.cuda.cutlass_instantiation_level = "3333"

def main():
    M = 128
    dtype = torch.float16
    A = torch.randn(M, M, device="cuda", dtype=dtype)
    B = torch.randn(M, M, device="cuda", dtype=dtype)

    compiled_model = torch.compile(torch.mm)

    _ = compiled_model(A, B)
    print("done")

if __name__ == "__main__":
    main()
```

before, with logs:
```
CUTLASS library generated 7 operations in 235.03 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 10.51 seconds
```

after:
```
CUTLASS library generated 1 operations in 207.39 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 9.53 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150973
Approved by: https://github.com/ColinPeppler
2025-04-11 16:24:26 +00:00
f1364431f0 Add debug_lines of FXGraphCacheKey to AOTAutogradCacheEntry (#150594)
Previously we didn't save debug_lines because it's pretty large, but compared to the size of FXGraphCache entries it's still pretty small. So let's add it to AOTAutogradCache for easier debugability.

Differential Revision: [D72361611](https://our.internmc.facebook.com/intern/diff/D72361611/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150594
Approved by: https://github.com/oulgen
2025-04-11 15:24:13 +00:00
38bec787fa cleanup JK for duplicate pt2 compile callbacks prevention (#148704)
Summary: This diff cleans up the JK we used for enabling `add pt2 callbacks for backward pass and prevent duplicate callbacks` feature.

Differential Revision: D70643543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148704
Approved by: https://github.com/mlazos
2025-04-11 15:17:06 +00:00
91920661b4 Don't log benchmarking event to Scuba (#151053)
These two events are really common, and also make up a huge portion of logs (~70%) we get internally in PT2 Compile Events. I don't think it's actually that useful to aggregate them, so instead of logging them to PT2 Compile Events, lets just only log them to chromium.

These two events will still be visible from tlparse: they just won't be in our internal tables. Please let me know if folks disagree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151053
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-04-11 14:56:36 +00:00
d94cc0e994 Optimize ConvTranspose2d stride description (#150819)
Fixes #150775

## Test Result

### Before

![image](https://github.com/user-attachments/assets/81cd932f-9447-4924-9553-a5cb88fc5d0e)

### After

![image](https://github.com/user-attachments/assets/6365c71c-7268-4226-b722-ee7446cb2467)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150819
Approved by: https://github.com/jbschlosser
2025-04-11 09:37:56 +00:00
183bca41de [dynamo] unimplemented -> unimplemented_v2 in variables/builder.py (#151044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151044
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-04-11 09:07:01 +00:00
d6f1c72354 [PrivateUse1] Allow out-of-tree devices to pass check when validating csr tensor args (#149374)
Fixes #149303
Fllow-up: #147306

Because we have a dispatch key named `DispatchKey::SparseCsrPrivateUse1` for this case, we allow users to create a csr tensor on out-of-tree devices, so we should also let that pass the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149374
Approved by: https://github.com/FFFrog, https://github.com/albanD
2025-04-11 09:05:20 +00:00
5590a0692c [aotinductor] fix std::{min.max} compilation error for sympy expr with multiple args (#150894)
### Compilation error
The issue is that u0 (an unbacked symint) can come from a smaller int dtype e.g. int16, int32.
```
error: no matching function for call to ‘min(int64_t&, short int&)’
  759 |     call_add_kernel_with_scaling_0(... std::min(100L, s97, u0) ...);
```

### Diff
The fix is to explicitly specify `int64_t` in the std::min template.
```
int64_t s97 = arg0_1_size[0];
int16_t u0_raw;      # not a long
auto u0 = u0_raw;

# Before
std::min({100L, s97, u0})
# After
std::min<int64_t>({100L, s97, u0})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150894
Approved by: https://github.com/desertfire
2025-04-11 07:32:47 +00:00
44ed0c9fbb Revert "[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957)"
This reverts commit 37812009fd123d5c4a038ce798eedd4a89eeffad.

Reverted https://github.com/pytorch/pytorch/pull/150957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150957#issuecomment-2795878848))
2025-04-11 05:38:58 +00:00
6c7336cb31 [Profiler][HPU] Enable profiler.key_averages().table() for HPU devices (#150770)
Fixes #150769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150770
Approved by: https://github.com/sraikund16, https://github.com/jeromean
2025-04-11 05:17:12 +00:00
85ada5d6dd [Dynamo] Allow dynamo to handle 'or' operator between two dicts (#147305)
Fixes #146538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147305
Approved by: https://github.com/anijain2305
2025-04-11 04:47:31 +00:00
6f6ff8837a [Inductor UT][Break XPU] Fix UTs for XPU broken by community. (#150830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150830
Approved by: https://github.com/anmyachev, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #149862
2025-04-11 04:30:46 +00:00
d186c933f8 [Inductor UT][Break XPU] Apply CUDA tolerances changes on XPU that introduced by #144579. (#149862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149862
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-04-11 04:30:46 +00:00
a22d3e778e [dynamo][guards] Print relational guards only once (#150810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150810
Approved by: https://github.com/anijain2305
2025-04-11 04:10:37 +00:00
8b5e717601 c10d/Store: add clone feature (#150966) (#150966) (#151045)
Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see 205881ea4a

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 04:00:23 +00:00
75162aa7de [ONNX] Support running bfloat16 models with ONNX Runtime (#149646)
Use ORTValue objects to support bfloat16 and other dtypes as inputs. This only supports cuda as ort only implements bfloat16 on cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149646
Approved by: https://github.com/titaiwangms
2025-04-11 03:38:26 +00:00
86370fd658 [dynamo] Allow guards to be dropped with custom filter functions. (#150936)
Summary: A follow up of https://github.com/pytorch/pytorch/pull/150689.

Test Plan: test_dynamo -k test_guard_filter_fn

Differential Revision: D72722322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150936
Approved by: https://github.com/jansel
2025-04-11 03:06:34 +00:00
4b0cf9fc00 Optimize transformer encoder/decoder init suggestion (#146882)
Fixes #72253

Add hint message for users to manually initialize after created.

## Test Result

**Before**

![image](https://github.com/user-attachments/assets/1914223f-008e-4ff7-aea1-c54c55679f65)

![image](https://github.com/user-attachments/assets/fd4110c1-26f7-48fe-9582-80581ab72328)

**After**

![image](https://github.com/user-attachments/assets/12270ba2-b384-4fe6-b351-4287b272d102)

![image](https://github.com/user-attachments/assets/0194e3a0-700a-40da-a9de-e9854c2d5d2e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146882
Approved by: https://github.com/jbschlosser
2025-04-11 02:31:56 +00:00
1e92579126 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-11 02:23:03 +00:00
24ca7e91e6 [1/N] Use internal linkage in torch/csrc C++ files. (#150930)
Turn more functions and variables into static if they are not used outside the cpp files. Unused functions are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150930
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-11 02:19:31 +00:00
48132de4af [c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063)
When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check.

Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-04-11 02:11:58 +00:00
df4e5294a6 Reapply "ProcessGroupGloo: support lazy_init (#150801)" (#151031)
This reverts commit 73f3d6d9aaa128d9917e8b3790933ba2855066cc.

Reapplies #150801

Test plan:

See #150801

submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031
Approved by: https://github.com/fduwjj
2025-04-11 01:58:35 +00:00
b7c0fda163 [MPS] Fix determine_backend_memory_format logic (#151042)
If input is channels last than MPS will return a channels last output

This fixed `GPUTests.test_convolution_4_mps` from test_torchinductor.py

That previous failed with
```
AssertionError: expected size 3==3, stride 1==192 at dim=1; expected size 12==12, stride 48==16 at dim=2; expected size 16==16, stride 3==1 at dim=3
```
As FakeTensor implementation of conv returned `Contiguous`, rather than `ChannelLast` layout on MacOS-15 or later.
This doesn't seem to be very well documented, so will try to document the call path for `ExternKernel` invocation for `aten::convolution`:
 - First inductor decomp defined here is called
 c93e4b8290/torch/_inductor/kernel/conv.py (L424-L425)

- Then it goes thru FakeTensor decomposition implemented here
320914f1b6/torch/_subclasses/fake_impls.py (L739-L740)
- Finally it goes down to convolution meta registrations implemented here
320914f1b6/torch/_meta_registrations.py (L2416-L2417)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151042
Approved by: https://github.com/dcci
2025-04-11 01:51:34 +00:00
320914f1b6 [c10d][libuv] Add back correct EOF case check (#151052)
We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052
Approved by: https://github.com/XilunWu
2025-04-11 01:37:30 +00:00
c93e4b8290 [BC-breaking] Set NonStrict as default for export_for_training (#150941)
Summary:
- Flip default value of `strict` argument from True to False on torch.export.export_for_training API
- All callsites have been updated to provide this argument explicitly to avoid behavior change.
- If you see any breakages, that means you may have a new callsite that is missed, please set `strict=True` explicitly to the callsite to mitigage.

Test Plan: CI

Differential Revision: D72724975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150941
Approved by: https://github.com/ydwu4
2025-04-11 00:50:05 +00:00
e945247f05 Revert two recent prologue prs (#151013)
These were landed in a bit of a rush to try to make the release.. Reverting, then will re-land with https://github.com/pytorch/pytorch/pull/151009 applied, and do full benchmark run with max-autotune.

Differential Revision: [D72791103](https://our.internmc.facebook.com/intern/diff/D72791103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151013
Approved by: https://github.com/zou3519
2025-04-10 23:48:41 +00:00
c9a35c2a6e [C10D] Document object collectives limitations (#150815)
Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.

Preview:
<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />

addresses #150798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150815
Approved by: https://github.com/kwen2501
2025-04-10 22:48:39 +00:00
dbcd0b571d Back out "[AOTI] Always use oss schema for ExternKernelNodes serialization" (#151026)
Summary: Revert for FC breaking

Test Plan: CI

Differential Revision: D72802075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151026
Approved by: https://github.com/hl475
2025-04-10 22:36:35 +00:00
f304483e95 [ONNX] Add asdict method to VerificationInfo class (#151024)
This pull request introduces a new method to convert `VerificationInfo` objects to dictionaries and includes a corresponding test to ensure the method works correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151024
Approved by: https://github.com/titaiwangms
2025-04-10 22:23:33 +00:00
8d81806211 [inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)
context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888
Approved by: https://github.com/jansel
2025-04-10 22:10:55 +00:00
e786b3bf54 Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)"
This reverts commit 115a165f9b24e3aaaeb2d0994678116758bd636f.

Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/malfet due to This indeed broke all those inductor tests ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2795231901))
2025-04-10 21:46:23 +00:00
6a65f2c4fe Revert "Support tuning of _scaled_grouped_mm (#150421)"
This reverts commit 8efcf21fff327d155350bf26ccba769bab58c077.

Reverted https://github.com/pytorch/pytorch/pull/150421 on behalf of https://github.com/malfet due to Looks like it broke lint, see a0ab243c3a/1 ([comment](https://github.com/pytorch/pytorch/pull/150421#issuecomment-2795218547))
2025-04-10 21:36:41 +00:00
a0ab243c3a Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082))
2025-04-10 21:02:14 +00:00
8efcf21fff Support tuning of _scaled_grouped_mm (#150421)
This includes the default aten implementation, as well as a Triton
implementation imported from FBGEMM
(https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421
Approved by: https://github.com/ngimel
2025-04-10 20:34:16 +00:00
abe41c5c9c Revert "c10d/Store: add clone feature (#150966)"
This reverts commit 205881ea4a451574c3a3de87c42484043a955d6e.

Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))
2025-04-10 20:17:53 +00:00
8fdd61bc45 Fix torchscript issues with reference quantized modules (#150870)
Summary:
The reference quantized modules for linear / conv / etc fail to torchscript due to two issues

(1) The type of torch.qscheme doesn't script
(2) The "_DTYPE_TO_QVALUE_BOUNDS" values were resolving to union[float, int] instead of just int. We fix that with a hard cast.

See: <internal post> + comments for more context

Test Plan: unit tests + fixing this NB N6923590

Differential Revision: D72652616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150870
Approved by: https://github.com/jerryzh168
2025-04-10 20:14:45 +00:00
31162214d8 Revert "[AOTI] Remove typedef for half and bfloat16 (#150657)"
This reverts commit 357814c85c00a2b5b3fb9add97735e4789caa7e0.

Reverted https://github.com/pytorch/pytorch/pull/150657 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150657#issuecomment-2795042772))
2025-04-10 20:08:03 +00:00
252029b294 [Inductor] assert fallback output alignment (#150804)
Previous PR (https://github.com/pytorch/pytorch/pull/150777) fixes the alignment problem for fallback kernel assuming meta kernel is correct. This PR handles the case that meta kernel is incorrect. Assertion is added if the compiler assumes a  fallback kernel output is aligned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150804
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #150777
2025-04-10 20:01:06 +00:00
115a165f9b [inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)
context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888
Approved by: https://github.com/jansel
2025-04-10 19:46:35 +00:00
4161c752bb [dynamo] unpack sequence lazily for list extend/deque extendleft (#150965)
Fixes https://github.com/pytorch/pytorch/issues/133063.

We were unpacking generators/iterators eagerly when we should be unpacking them one-by-one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150965
Approved by: https://github.com/jansel
2025-04-10 19:31:31 +00:00
389cd15265 [export] check tuple length mismatch for dynamic_shapes spec (#150976)
Summary: weren't checking this

Test Plan: test_export

Differential Revision: D72761995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150976
Approved by: https://github.com/angelayi
2025-04-10 19:08:43 +00:00
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
e7ed50f27b [async TP] Fix handling of case where scatter dim = 0 for 2D output tensor (#150935)
## Summary of changes

1. Change assertion to a warning, when no all gather or reduce scatter patterns are found, and remove the corresponding unit test. It seems some valid TP graphs may not have any pattern matches, from what I can see.
2. Fix wrong variable name being referenced (`A_with_scatter_dim_0` instead of just `A`)
3. Simplify reshaping to target output shape (don't need to recalculate output shape)
4. When "A" tensor is 2D, so we are doing doing a 2D x 2D scaled mm, we need to fix our handling of the case where the scatter dim is 0. When scatter dim is 0 for the 2D scaled mm output shape, this is actually dim 1 in the unreduced stacked partial scaled mm outputs, which has a (logical) shape of `(group_size, M//group_size, N)`. To summarize:
    - Unreduced stacked partials are of shape `(M, N)`
    - We view as `(group size, M//group_size, N)` and reduce along the scatter dim (`group_size` / dim 0).
    - Reduced output (`reduced_out`) has shape (M//group_size, N)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150935
Approved by: https://github.com/lw
2025-04-10 18:25:48 +00:00
08831f30bb [Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114)
This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d.
Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path.
After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149114
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-04-10 17:49:27 +00:00
37812009fd [profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957)
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604

Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.

This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.

Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
2025-04-10 17:45:01 +00:00
6720d23969 Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-04-10 17:33:26 +00:00
1250106630 [pytorch] Remove numpy dependency from Knapsack Evaluator (#150825)
Summary:
The two implementations are functionally equivalent. They both calculate the memory budget at the knee point in the Pareto frontier using the same algorithm.

1. np.linspace -> basic list comprehension
2. runtime and memory values -> lists instead of numpy arrays
3. np.ptp -> max - min
4. np.norm -> diff with min value / range
5. np.sqrt -> **0.5
5. np.argmin -> .index(min(_))

Test Plan:
# Unit Testing

```
buck test mode/opt //caffe2/test/functorch:test_ac_knapsack; pingme "tests done"
Buck UI: https://www.internalfb.com/buck2/f4e41eb8-e775-4f04-b4e7-8e567599deb8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099236155875
Network: Up: 24KiB  Down: 1.9GiB  (reSessionID-7cd11487-f3e7-43ab-982a-805510771c8d)
Executing actions. Remaining      0/259826                                                                                                  98:15:40.5s exec time total
Command: test.     Finished 3 local, 5 remote, 103467 cache (99% hit)                                                                       98:15:14.8s exec time cached (99%)
Time elapsed: 1:09.9s
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

# End to End Testing

### Baseline Run with DP

Let's confirm everything we are running on works.

- Optimization Algo: DP
- Memory Budget: 0.05
- AIX Link: apf_local-basilwong-2025-03-22_20:39:10
- TLParse rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
- TLParse rank 1:  https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### Dynamic Memory Budget (Before Change)

- Revision: 2c95489b7f79
- Optimization Algo: Dynamic Memory Budget
- Memory Budget: 0.05
- AIX Link: https://www.internalfb.com/mlhub/pipeline/4088035428184866
- TLParse:
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### Dynamic Memory Budget (After Change)

- Revision: 14353eef3c9e
- Optimization Algo: Dynamic Memory Budget
- Memory Budget: 0.05
- AIX Link: https://www.internalfb.com/mlhub/pipeline/1613558749306737
- TLParse Links:
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
    - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

As a sanity check lets take the AC information for the following compile id: 7_0_0 from the rank 0 of each TLParse.

 {F1976883124}

* Baseline: P1779400819
   * Saved node values show we are storing much more compared to dynamic memory:

```
  "Knapsack Saved Nodes": [
    16,
    17,
    19,
    20,
    21,
    22,
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    49,
    50,
    51,
    52,
    53,
    54,
    55,
    56,
    57,
    58,
    59,
    60
  ]
```

* Before Change: P1779401775
   * Saved nodes are similar to after change but not exactly.

```
  "Knapsack Saved Nodes": [
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    49,
    50
  ]
```

* After Change: P1779402106
   * Here we se the largest nodes that are saved are around the same, but there is a small discrepancy for the smallest nodes.

```
  "Knapsack Saved Nodes": [
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    50,
    51,
    57,
    58,
    59,
    60,
    61,
    62
  ],
```

The discrepancy can be explained by looking at the estimated memory values. This is the non-deterministic part(below are the top 5 memory values for considered candidates):

```
    0.05774741703905514,
    0.007333005338292718,
    0.007333005338292718,
    0.007333005338292718,
    0.007333005338292718,
```

vs

```
    0.049254204820440746,
    0.006254502199421049,
    0.006254502199421049,
    0.006254502199421049,
    0.006254502199421049,
```

Based on that the dynamic memory implementations performed  similarly in an E2E test and that memory is non-deterministic we should be good to go to land.

Differential Revision: D71692245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150825
Approved by: https://github.com/seemethere, https://github.com/jansel
2025-04-10 17:07:03 +00:00
5471e80fb4 Remove guard_size_oblivious from vector_norm decomposition. (#148809)
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
2025-04-10 16:19:00 +00:00
e6969c1bd8 [export] Symint support (nonstrict, Dim.DYNAMIC) (#150198)
Fixes https://github.com/pytorch/pytorch/issues/113682 only in the non-strict export case. Also we only support Dim.DYNAMIC/AUTO, not named-Dims

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150198
Approved by: https://github.com/pianpwk
2025-04-10 15:06:23 +00:00
596e44d26a [inductor] Enable docstring_linter on _inductor (#144622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144622
Approved by: https://github.com/eellison
ghstack dependencies: #144621
2025-04-10 14:32:26 +00:00
ba35793226 [inductor] Add tests for new docstring_linter features (fix #142496) (#144621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144621
Approved by: https://github.com/eellison
2025-04-10 14:32:26 +00:00
73f3d6d9aa Revert "ProcessGroupGloo: support lazy_init (#150801)"
This reverts commit f237ee54bfb35d16cd10e358d4b78578c88a5781.

Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))
2025-04-10 13:44:31 +00:00
7b7b9d707e [CI] Add XPU compiled check in CICD (#150771)
Address the suggestion from https://github.com/pytorch/pytorch/issues/150001#issuecomment-2753407421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150771
Approved by: https://github.com/malfet, https://github.com/atalman
2025-04-10 13:33:27 +00:00
4273e5d15c Expose is_available API for torch.backends.mkldnn (#147432)
As the title stated.

Like torch.backends.mkl, torch.backends.openmp and so on, they all expose
is_available API for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147432
Approved by: https://github.com/albanD
2025-04-10 05:05:37 +00:00
1a1a32ce5a [elastic][test] fix race condition in test_barrier_timeout_rank_tracing (#150768)
# Root cause
The barrier timeout set to 0.1 is too short, some threads may not have enough time to reach the barrier.

# How to reproduce
Adding some sleep will be easy to reproduce.
```python
    def test_barrier_timeout_rank_tracing(self):
        N = 3

        store = dist.HashStore()

        def run_barrier_for_rank(i: int):
            if i != 0:
                import time;time.sleep(1)  # Let some thread sleep for a while
            try:
                store_util.barrier(
                    store,
                    N,
                    key_prefix="test/store",
                    barrier_timeout=0.1,
                    rank=i,
                    rank_tracing_decoder=lambda x: f"Rank {x} host",
                    trace_timeout=0.01,
                )
            except Exception as e:
                return str(e)
            return ""

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150768
Approved by: https://github.com/d4l3k
2025-04-10 04:40:16 +00:00
a6933a1c42 [Inductor] Remove triton dtype patch which has landed (#149611)
As this [pr][0] has already landed, we should remove its patch.

Having [mentioned][1] this before, I am making this change now to avoid omissions.

[0]: https://github.com/triton-lang/triton/pull/3342
[1]: https://github.com/pytorch/pytorch/pull/147583/files#r1970440062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149611
Approved by: https://github.com/eellison
2025-04-10 03:42:55 +00:00
b80bb87689 cpp_wrapper: Miscellaneous fixups (#150143)
1. Revisit preprocessing code in cpp_bulider.py, removing a hack that channels it through stdout.
2. Fix ops that return None.

Differential Revision: [D72053414](https://our.internmc.facebook.com/intern/diff/D72053414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150143
Approved by: https://github.com/desertfire
2025-04-10 03:31:12 +00:00
cd80778ac8 Fix issue in optimized_add issue: make_optimized should be called on non args only (#150955)
PR https://github.com/pytorch/pytorch/pull/149665 did a change to the optimized_add that is causing an issue internally.
In general make_optimized should be only be called with valid new_args,  new_args can become None
when elements already exists also, we should break out of the loop in that case.

Note that I also only maintained the optimized summation when both lhs and rhs lengths are <=2.
This is ok because the optimization is based on the inductive property of adding one symbol at a time.
the [2]+[2] here is serving as base case ( i feel we can also remove it ) .

Note that keeping it for all sizes while correct, I am not sure if tis as efficient (we will do N log(n) insertions).
there is no current justification for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150955
Approved by: https://github.com/Mingming-Ding, https://github.com/atalman, https://github.com/bobrenjc93
2025-04-10 03:00:21 +00:00
bf7d8ef10d [Docs] Clarify behavior when integer dtype is used with requires_grad=True in tensor.to() (#150913)
Fixes #150618

Related comment: https://github.com/pytorch/pytorch/issues/3226#issuecomment-489362234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150913
Approved by: https://github.com/janeyx99, https://github.com/soulitzer, https://github.com/cyyever

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-04-10 02:52:58 +00:00
78b3d71ece Docs: Add missing whitespace in the cmake warning message (#150929)
A trailing whitespace is needed to be concatenated to the following string correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929
Approved by: https://github.com/Skylion007
2025-04-10 02:50:56 +00:00
3d3fcaaf7b Delegate torch.accelerator.device_count to torch.xxx.device_count for multi-process usage (#149924)
# Motivation
Adapt `torch.accelerator.device_count` for multi-process usage. For example, `torch.cuda.device_count` avoids poisoning fork, then `torch.accelerator.device_count` should meet the same requirement.
Now that `torch.get_device_module(device).device_count` supports this, `torch.accelerator.device_count` should align with this behavior as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149924
Approved by: https://github.com/albanD
ghstack dependencies: #147507
2025-04-10 02:37:37 +00:00
6972255dad Document poison fork note for accelerator APIs (#147507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147507
Approved by: https://github.com/sraikund16, https://github.com/kwen2501, https://github.com/albanD
2025-04-10 02:37:37 +00:00
83bd0b63b5 Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-10 02:34:53 +00:00
cyy
322f883c0c Remove unneeded CUDA logic from _create_build_env (#145822)
Because FindCUDAToolkit.cmake has that logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145822
Approved by: https://github.com/albanD
2025-04-10 02:17:28 +00:00
cyy
54827752a4 [5/N] Remove unnecessary once flag usage (#147445)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147445
Approved by: https://github.com/albanD
2025-04-10 01:48:10 +00:00
205881ea4a c10d/Store: add clone feature (#150966)
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Test plan:

```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966
Approved by: https://github.com/fduwjj
2025-04-10 01:41:50 +00:00
061832bc7a Gracefully handle optree less than minimum version (#150956)
Summary:
- We are saying the minimum version of pytree that PyTorch can use is
  0.13.0
- If a user imports torch.utils._cxx_pytree, it will raise an
  ImportError if optree doesn't exist or exists and is less than the
  minimum version.

Fixes https://github.com/pytorch/pytorch/issues/150889. There are
actually two parts to that issue:
1. dtensor imports torch.utils._cxx_pytree, but the optree installed in
   the environment might be too old. Instead, raising ImportError in
   torch.utils._cxx_pytree solves the issue.
2. We emit an "optree too low version" warning. I've deleted the
   warning in favor of the more explicit ImportError.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan
2025-04-10 01:22:50 +00:00
9d1528186f Fix static functions when using module in MSVC (#148675)
If you try to use torch in c++ using modules then it will not compile due to static function not being supported in MSVC when using modules https://developercommunity.visualstudio.com/t/10323558.

It's also aligned with [C++20 standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf) (ISO/IEC 14882:2020) 10.2.7 Export declaration [module.interface]: "Exported names have either external linkage or no linkage".

Fixes https://github.com/pytorch/pytorch/issues/71309
Tested using the following code.

```c++
export module testModule;

import <torch/torch.h>;
import <memory>;
import <string>;
import <tuple>;
import <iostream>;

export namespace testModule
{

    export void test()
    {
        torch::Tensor tensor1 = torch::rand({ 2, 3 });
        torch::Tensor tensor2 = torch::rand({ 3, 2 });
        // Perform tensor multiplication
        torch::Tensor result = torch::matmul(tensor1, tensor2);

        // Print the tensors
        std::cout << "Tensor 1: " << tensor1 << std::endl;
        std::cout << "Tensor 2: " << tensor2 << std::endl;
        std::cout << "Result of multiplication: " << result << std::endl;
    }
}
```

```c++
import testModule;

int main()
{
	testModule::test();
	return 0;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148675
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: mantaionut <ionut@janeasystems.com>
2025-04-10 01:19:54 +00:00
69cee91a55 Code Clean: Using the new builtin function provides by python 3.8 later (#150839)
Changes:
- reversed
- math.perm
- inspect.getfile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150839
Approved by: https://github.com/Skylion007
2025-04-10 01:17:39 +00:00
f3cf3ec591 [AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276)
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-10 00:15:44 +00:00
92e81cf41a Add real_tensor to the FakeTensor in node.meta["val"] (#150948)
Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True.

This also fixes real tensor propagation in `run_decomposition()`.

Test Plan:
```
 buck2 run @mode/dev-nosan  caffe2/test:test_export -- -r test_dedup_data_dependent_failure
```

Differential Revision: D72732714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948
Approved by: https://github.com/angelayi
2025-04-10 00:11:46 +00:00
91d1826539 Add dynamic version for mm_loop benchmark (#150865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865
Approved by: https://github.com/eellison
2025-04-09 23:37:43 +00:00
a8b48ff14c [DTensor] clean up _local_shard_size_and_offset (#150650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150650
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #150490
2025-04-09 22:07:48 +00:00
3532dd4f1e [DTensor] StridedShard support uneven sharding (#150490)
This enables using FSDP+TP on parameters with dimensions that aren't
evenly divisible by the DP/TP mesh sizes.

- this may not support all possible combinations of strided shardings
  and shardings, but the support before this PR is not complete anyway

This contains several fixes for different aspects of DTensor behavior
relating to uneven strided sharding:
- original creation of the strided tensor requires fixes in
  StridedShard._split_tensor
- full_tensor() reconstruction requries fixes in
  StridedShard._to_replicate_tensor to correctly reshuffle the data into
  the original pre-sharded order
- Distributed Checkpointing support requires correct computation of the
  compute_local_shape_and_global_offset util so it knows how a local
  shard maps to the global tensor, for reconstruction during
  load/reshard.

This PR also adds a util `_explicit_order_placements` which converts a list of
placements with StridedSharding into a list of placements with only
regular sharding, with the order shuffled such that it is equivalent.

Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894

Uneven Sharding Example
-------
(copied from _StridedShard._to_replicate_tensor docstring)

mesh = (DP=2, TP=2)
original = torch.arange(5)

**Applying Sharding**

Step 1 - Apply TP sharding
`tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])`

local_tensors:
rank0: [0,1,2]    rank1: [3,4]
rank1: [0,1,2]    rank3: [3,4]

Step 2 - Apply FSDP sharding
`dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated)
dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0))
local_tensors:
rank0: [0,1]  rank1: [3]
rank1: [2]    rank3: [4]

**Reconstructing the Full Tensor**
Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate.
redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the
StridedShard placement second, which is implemented by this function.
So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the
TP dim, which looks like this.

Note the discrepancy with the 'tp sharded tensor' line above!  We'll fix it by locally shuffling data.

local_tensors:
rank0: [0,1,3]  rank1: [0,1,3]
rank1: [2,4]    rank3: [2,4]

Step 1: replicate over the DP dimension.  Afterwards, each rank can locally sort the values.
  note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later
	local_tensors:
rank0: [0,1,3,2,4]    rank1: [0,1,3,2,4]
rank1: [0,1,3,2,4]    rank3: [0,1,3,2,4]

Step 2: chunk and shuffle values around to account for the wrong order of operations above
and get the original tensor content back

01324#       <- our allgather includes padding, if padding was applied in step 1
01324        <- Remove the padding
013, 24      <- chunk once, 'undoing' the DP allgather
01, 3, 2, 4  <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate()
012, 34      <- interleave with stride=TP mesh dim size
01234        <- concatenate

Co-authored-by: Luca Wehrstedt <lw@meta.com>
Co-authored-by: Will Constable <whc@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-04-09 22:07:48 +00:00
cc2decdb25 [CI][CUDA][Distributed]Update test_composability.py (#148578)
world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs.  skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs.

skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578
Approved by: https://github.com/atalman
2025-04-09 21:57:05 +00:00
786422a4d7 Remove a workaround added in #149381 (#150693)
Remove a workaround added in https://github.com/pytorch/pytorch/pull/149381.

Fixes https://github.com/pytorch/xla/issues/8934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150693
Approved by: https://github.com/albanD
2025-04-09 21:48:03 +00:00
087e8587cd support backed_size_oblivious in guard_or_false/guard_or_true (#150231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150231
Approved by: https://github.com/pianpwk
2025-04-09 21:47:20 +00:00
31fe258efc [inductor] Add features to docstring_linter (see #142496) (#145834)
## Improvements to `docstring_linter`

* Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`)
* In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on)
* Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on)
* New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json)

## Help text

```
$ python tools/linter/adapters/docstring_linter.py --help
usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init]
                           [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF]
                           [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather]
                           [files ...]

`docstring_linter` reports on long functions, methods or classes without docstrings

positional arguments:
  files                 A list of files or directories to lint

optional arguments:
  -h, --help            show this help message and exit
  -l, --lintrunner      Run for lintrunner and print LintMessages which aren't edits
  -v, --verbose         Print more debug info
  --grandfather GRANDFATHER, -g GRANDFATHER
                        Set the grandfather list
  --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE
                        Tolerance for grandfather sizes, in percent
  --lint-init, -i       Lint __init__ and class separately
  --lint-local, -o      Lint definitions inside other functions
  --lint-protected, -p  Lint functions, methods and classes that start with _
  --max-class MAX_CLASS, -c MAX_CLASS
                        Maximum number of lines for an undocumented class
  --max-def MAX_DEF, -d MAX_DEF
                        Maximum number of lines for an undocumented function
  --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING
                        Minimum number of characters for a docstring
  --no-grandfather, -n  Disable the grandfather list
  --report, -r          Print a report on all classes and defs
  --write-grandfather, -w
                        Rewrite the grandfather list
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145834
Approved by: https://github.com/amjames, https://github.com/eellison
2025-04-09 21:38:36 +00:00
357814c85c [AOTI] Remove typedef for half and bfloat16 (#150657)
Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150657
Approved by: https://github.com/malfet
2025-04-09 21:21:17 +00:00
d751698a36 Support negative values for fill with uint tensors (#144458)
Fixes https://github.com/pytorch/pytorch/issues/144188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144458
Approved by: https://github.com/amjames, https://github.com/eellison
2025-04-09 21:08:06 +00:00
860765d621 update benchamark result due to <1% regression (#150937)
<img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" />

PR https://hud.pytorch.org/pr/148104
which is acceptable but we have to update this to avoid  flakiness in the future .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937
Approved by: https://github.com/zou3519
2025-04-09 20:25:48 +00:00
2b9d8a5633 Fix -Wmissing-braces in a few files (#150802)
Test Plan: Sandcastle

Reviewed By: wenxin0319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150802
Approved by: https://github.com/Skylion007
2025-04-09 20:15:34 +00:00
ea0cbba1fc [export] Refine draft-export CVE with Dim.AUTO (#150876)
Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150876
Approved by: https://github.com/pianpwk
2025-04-09 19:44:30 +00:00
f237ee54bf ProcessGroupGloo: support lazy_init (#150801)
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801
Approved by: https://github.com/fduwjj
2025-04-09 19:29:50 +00:00
a4545f09da [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/test/export (#150884)
Differential Revision: D72667175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150884
Approved by: https://github.com/ydwu4
2025-04-09 19:18:33 +00:00
cfab04d01b Fix aten.div type promotion for FakeTensor (#150874)
Summary:
When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer.

```
FAST = get_fast_op_impls()
fast_div = FAST[torch.ops.aten.div.Tensor]
fast_div(fake_tensor, some_int)
```

Test Plan:
```
python test/test_fake_tensor.py -k test_fast_div
```

Differential Revision: D72667430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874
Approved by: https://github.com/angelayi
2025-04-09 18:52:01 +00:00
d3a2872c67 Hipify global scrach defintion in AOTI codegen (#150893)
Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method

Test Plan: P1780296121

Differential Revision: D72683568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150893
Approved by: https://github.com/davidberard98
2025-04-09 18:35:36 +00:00
d04a6ec021 add reduce_scatter to symm mem ops (#150813)
+ a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813
Approved by: https://github.com/xw285cornell
2025-04-09 17:59:17 +00:00
cc185c32e0 [aoti] Use generate_fake_kernels_from_real_mismatches config for draft exported programs (#150651)
Summary:
Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on.

https://fburl.com/code/9qzytl6q
 torch._functorch.config.generate_fake_kernels_from_real_mismatches

If we set this flag to True, then aoti compilation would work.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts
```

Differential Revision: D72345085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150651
Approved by: https://github.com/angelayi
2025-04-09 17:28:29 +00:00
6fb089f2a2 [AO] fix per token block size calculation (#150890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150890
Approved by: https://github.com/jerryzh168
2025-04-09 17:07:31 +00:00
c59aaa03ff [DTensor] add _explicit_order_placements util (#150493)
The util converts a list of placements in the traditional DTensor format
(e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding
is always applied left-to-right (from dim 0 to higher dims))

to a more explicitly ordered format, also replacing '_StridedShard' with
simple 'Shard' placements in the process.
(e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first
item in the tuple is the mesh_dim and the ordering of the tuples is the
sharding order.

This is useful so far as a helper for fixing local shape computation for
strided sharding in the uneven shape case, in the following PR- but may
also be useful more broadly if we can use explicit orderings to simplify
other parts of DTensor logic.

This skips implementing some combinations of _StridedSharding that are
not currently used in the wild today, but could be supported easily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-04-09 16:55:24 +00:00
01568cb17a Revert "Refactor layout constraint selection logic (#148104)"
This reverts commit 2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c.

Reverted https://github.com/pytorch/pytorch/pull/148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
a0e796df03 Revert "Inductor respects exact strides on custom ops by default (#150511)"
This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f.

Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
a4bb2f106f Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #150495, #148104
2025-04-09 16:46:48 +00:00
c714d2fc0e [hop] support base_hop._gen_schema (#149688)
This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple.
1. HopArgumentInfoGen creates an argument or an output schema with mutation information.
2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema.

is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149688
Approved by: https://github.com/zou3519
2025-04-09 16:42:55 +00:00
72755a4b7a Avoid circular imports in tracing_state_functions (#150325)
tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import.

(discovered in https://github.com/pytorch/pytorch/pull/149646)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150325
Approved by: https://github.com/zou3519
2025-04-09 16:32:11 +00:00
8aaf296efc [c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881)
Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions.

This PR mostly is shuffle around the code a bit.

Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881
Approved by: https://github.com/wz337
2025-04-09 16:10:19 +00:00
c8d37b9c85 [ez][c10d] Disable start event recording for coalesced col and improve profile title (#150863)
While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring.

And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150863
Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501
2025-04-09 16:09:56 +00:00
1a56609e75 [ONNX] Supporting different opset versions for torchlib registry (#149901)
- Allows opset_version to determine which onnx decomposition to choose
- Adds a cleanup function to modify the registry after it is built

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149901
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-04-09 16:03:46 +00:00
97a5e5c6b3 Added _fused_sdp_choice_stub dispatcher support for HPU device (#149512)
Currently for HPU device we don't have any support for _fused_sdp_choice_stub dispatcher function, so for `scaled_dot_product_attention` function by default selecting the `MATH Backend` using `_fused_sdp_choice_stub` for HPU device. With this PR we have enabled support for `_fused_sdp_choice_stub` dispatcher function, so that we can invoke any backend (for example math, flash_attention, efficient_attention, cudnn_attention, overrideable) according to user choice for HPU device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149512
Approved by: https://github.com/drisspg
2025-04-09 15:48:09 +00:00
d0e3482266 Update triton wheel build, setuptools pin (#150931)
Observing failure in release workflow:
https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374

```
Traceback (most recent call last):
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module>
    from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel
ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module>
    from wheel.bdist_wheel import bdist_wheel
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module>
    raise ImportError(ERROR) from exc
ImportError: The 'wheel.bdist_wheel' module has been removed.
Please update your setuptools to v70.1 or later.
If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931
Approved by: https://github.com/Skylion007
2025-04-09 15:26:07 +00:00
5a422150c3 Add torch.triu_indices, torch.tril_indices dtype description (#150749)
Fixes #150675

## Test Result

![image](https://github.com/user-attachments/assets/f30a0de0-6475-4d07-b441-15fffd453ba1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150749
Approved by: https://github.com/bdhirsh
2025-04-09 15:03:24 +00:00
246f3b6530 [Quant][PT2E][X86] enable qconv1d-relu fusion (#150751)
**Summary**
As the title.
- The `conv1d - relu` pattern will be annotated by the `X86InductorQuantizer`.
- The pattern will be fused as `qconv_pointwise` during lowering.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv1d_relu_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150751
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-09 14:42:02 +00:00
2299087220 [ROCm] Introduce AMD specific inductor gemm tuning (#147315)
Replaces https://github.com/pytorch/pytorch/pull/143286

Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.

Dynamo huggingface inference benchmarks:
`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`

GEOMEAN speedup (before): | 1.35x
GEOMEAN speedup (after): | 1.42x

name | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup
-- | -- | -- | -- | -- | --
AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%
AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%
AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%
BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%
BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%
BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%
BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%
BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%
BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%
BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%
CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%
DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%
DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%
DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%
DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%
DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%
DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%
DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%
ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%
ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%
GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%
GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%
LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%
LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%
M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%
MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%
MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%
MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%
MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%
MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%
MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%
MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%
OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%
PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%
PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%
PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%
PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%
RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%
RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%
T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%
T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%
TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%
XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%
XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%
YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%

We are also seeing improvement ~9% on internal addmm benchmark

This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.

No CI to test the max-autotune perf currently but this will be enabled via https://github.com/pytorch/pytorch/pull/148672 after which we can investigate more tuning updates and config pruning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147315
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-09 14:34:30 +00:00
886d9acb0d [docs] Add 32-bit complex to the list of dtypes (#144590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144590
Approved by: https://github.com/janeyx99
2025-04-09 13:10:21 +00:00
64ac41f68d [pytorch] add header docs for TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150854)
Summary: Add header docs for the experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature, and guard behind C10_MOBILE.

Reviewed By: albanD

Differential Revision: D72572345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150854
Approved by: https://github.com/larryliu0820, https://github.com/zou3519
2025-04-09 12:59:24 +00:00
cyy
142f0f86ce Enable modernize-use-default-member-init (#149046)
``modernize-use-default-member-init`` prefers initialisation in class members, that make more ``= default`` constructors possible. Some violations or modernize rules have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149046
Approved by: https://github.com/zou3519
2025-04-09 11:57:24 +00:00
81f60f3880 Expand allowed_getattr_types_for_subgm to torch.Tensor (#150867)
Summary:
att

regular weight has the type of torch.nn.parameter.Parameter
buffer and tensor constant has the type of torch.Tensor

both types are valid.

Test Plan: CI

Differential Revision: D72657275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150867
Approved by: https://github.com/zhxchen17
2025-04-09 11:01:45 +00:00
604467de20 Code Clean: Remove specific bytecode support in dynamo for python3.8 (#150838)
Related Bytecode:
- CALL_FINALLy
- END_FINALLy
- POP_FINALLy

The bytecodes above were removed before python3.9, refer to [this](53908bd790/Misc/NEWS.d/3.9.0a2.rst) for more infos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150838
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #150834
2025-04-09 07:16:52 +00:00
b01877aa13 Fix addbmm & addmv & baddbmm out dtype check (#148176)
----

- torch.addbmm
- torch.addmv
- torch.baddbmm

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148176
Approved by: https://github.com/jansel
ghstack dependencies: #148174
2025-04-09 07:02:56 +00:00
4d6ff6ca5c Fill config2launcher with correct launchers during cache hit coordinate descent (#150860)
This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging.

Here's a short TLDR of the bug:

- Due to D71983456(OSS: https://github.com/pytorch/pytorch/pull/149910), we cache CachingAutotuners in memory.
- Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state.
- It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code.
- Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code)
- CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: aafc4b6188/torch/_inductor/runtime/coordinate_descent_tuner.py (L69))
- Because we are caching these in memory and not on disk, this cache is **not cleared** between runs.
- However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje
(OSS: aafc4b6188/torch/_inductor/runtime/triton_heuristics.py (L933))
- `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with:

```
AttributeError: 'NoneType' object has no attribute 'store_cubin'
```

This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now.

Note that this error only reproduces if:
- None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory
- We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner.
- The autotune cache doesn't already have the best config stored in it

So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only.

Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150860
Approved by: https://github.com/oulgen
2025-04-09 04:39:37 +00:00
bc47d539fc [MPS] Support ArgumentBuffer bindings from C++/Python (#150780)
To workaround limitation of 32-arguments per kernel and being able to eventually compile something like
```python
import torch

def foo(*args):
  rc = torch.empty_like(args[0])
  for arg in args:
      rc += arg
  return rc

tensors = torch.rand(100, 32, device='mps').unbind(0)
print(torch.compile(foo)(*tensors))
```

For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU.

Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer):
> To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them.

Add both C++ and PyThon unittests that validate that this works.
Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780
Approved by: https://github.com/dcci
2025-04-09 04:24:37 +00:00
2e7c9d33e7 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #150495
2025-04-09 02:09:18 +00:00
44deb67830 Fix _del_library (#150495)
On library deletion, we need to clear fx's schema cache.

Test Plan:
- top PR in the stack, I don't have a good test case for this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150495
Approved by: https://github.com/eellison
2025-04-09 02:09:18 +00:00
5f18b7d877 [docs] remove --recursive flag from readme (#150785)
Fixes #150745

See https://github.com/pytorch/pytorch/issues/150745#issuecomment-2784216663

Cloning with `--recursive` as shown in the docs prevents users from checking out commits from before NCCL was removed as a submodule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150785
Approved by: https://github.com/atalman
2025-04-09 02:07:48 +00:00
d9f47c75de Revert "Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)"
This reverts commit 91173ff89aab5f632d483c736d11d5dcf60decac.

Reverted https://github.com/pytorch/pytorch/pull/150690 on behalf of https://github.com/atalman due to failing internal test ([comment](https://github.com/pytorch/pytorch/pull/150690#issuecomment-2787905966))
2025-04-09 00:06:32 +00:00
27ded359a5 Fix inplacing with multiple, fused uses (#150845)
We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845
Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel
2025-04-09 00:05:07 +00:00
89505f4498 [AOTI] Always use oss schema for ExternKernelNodes serialization (#150197)
Summary: Added a field `protocol` to `ExternKernelNodes` and all the lowering pass will always use the oss schema to serialize external kernel nodes from now on.

Test Plan: CI

Differential Revision: D72020444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150197
Approved by: https://github.com/zhxchen17
2025-04-08 22:35:28 +00:00
5744 changed files with 222972 additions and 90184 deletions

View File

@ -3,9 +3,7 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
fi
@ -27,6 +25,7 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -31,33 +31,47 @@ def build_ArmComputeLibrary() -> None:
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = "ComputeLibrary"
os.makedirs(acl_install_dir)
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]
+ acl_build_flags,
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src"]:
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path, desired_cuda) -> None:
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("Tag:"):
lines[i] = line.replace("-linux_", "-manylinux_2_28_")
print(f"Updated tag from {line} to {lines[i]}")
break
with open(filename, "w") as f:
f.writelines(lines)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
@ -74,7 +88,6 @@ def update_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
@ -88,30 +101,19 @@ def update_wheel(wheel_path, desired_cuda) -> None:
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if enable_cuda:
if "129" in desired_cuda:
libs_to_copy += [
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
@ -120,6 +122,13 @@ def update_wheel(wheel_path, desired_cuda) -> None:
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
if f.is_dir() and f.name.endswith(".dist-info"):
replace_tag(f"{f.path}/WHEEL")
break
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
@ -194,8 +203,10 @@ if __name__ == "__main__":
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars = "MAX_JOBS=5 " + build_vars
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
@ -242,6 +253,6 @@ if __name__ == "__main__":
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path, desired_cuda)
package_cuda_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
built on Jenkins and are used in triggered builds already have this
environment variable set in their manifest. Also see
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

View File

@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
echo 'Skipping tests'
exit 0
fi
if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
fi
# These additional packages are needed for circleci ROCm builds.
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

View File

@ -34,5 +34,5 @@ See `build.sh` for valid build environments (it's the giant switch).
./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
# Set flags (see build.sh) and build image
sudo bash -c 'PROTOBUF=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```

View File

@ -1,6 +1,7 @@
ARG CUDA_VERSION=12.4
ARG CUDA_VERSION=12.6
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM amd64/almalinux:8 as base
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
FROM amd64/almalinux:8.10-20250519 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
@ -8,12 +9,10 @@ ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y update
RUN yum -y install epel-release
# install glibc-langpack-en make sure en_US.UTF-8 locale is available
RUN yum -y install glibc-langpack-en
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
@ -41,33 +40,36 @@ RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=12.4
ARG CUDA_VERSION=12.6
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
# Preserve CUDA_VERSION for the builds
ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
ENV DESIRED_CUDA=12.1
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
ENV DESIRED_CUDA=12.8
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
ENV DESIRED_CUDA=12.9
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
ENV MKLROOT /opt/intel
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
@ -75,9 +77,9 @@ RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
COPY --from=cuda12.8 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.9 /usr/local/cuda-12.9 /usr/local/cuda-12.9
# Final step
FROM ${BASE_TARGET} as final

View File

@ -1,82 +1,70 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -exou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGENAME:ARCHTAG"
exit 1
fi
DOCKER_IMAGE_NAME="pytorch/${image}"
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
CUDA_VERSION=""
ROCM_VERSION=""
EXTRA_BUILD_ARGS=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name and tag. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
CUDA_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
EXTRA_BUILD_ARGS="--build-arg CUDA_VERSION=${CUDA_VERSION}"
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name and tag. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
ROCM_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
EXTRA_BUILD_ARGS="--build-arg ROCM_IMAGE=rocm/dev-almalinux-8:${ROCM_VERSION}-complete"
fi
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
CUDA_VERSION=${CUDA_VERSION:-12.1}
case ${CUDA_VERSION} in
case ${DOCKER_TAG_PREFIX} in
cpu)
BASE_TARGET=base
DOCKER_TAG=cpu
;;
all)
BASE_TARGET=all_cuda
DOCKER_TAG=latest
cuda*)
BASE_TARGET=cuda${CUDA_VERSION}
;;
rocm*)
BASE_TARGET=rocm
;;
*)
BASE_TARGET=cuda${CUDA_VERSION}
DOCKER_TAG=cuda${CUDA_VERSION}
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"
exit 1
;;
esac
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=11" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \
${TOPDIR}/.ci/docker/
)
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "DEVTOOLSET_VERSION=11" \
${EXTRA_BUILD_ARGS} \
-t ${tmp_tag} \
$@ \
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \
${TOPDIR}/.ci/docker/
if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then
if [ -n "${CUDA_VERSION}" ]; then
# Test that we're using the right CUDA compiler
(
set -x
docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"
)
fi
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH:-}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE_NAME}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
docker run --rm "${tmp_tag}" nvcc --version | grep "cuda_${CUDA_VERSION}"
fi

View File

@ -50,30 +50,21 @@ if [[ "$image" == *xla* ]]; then
exit 0
fi
if [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *-jammy* ]]; then
if [[ "$image" == *-jammy* ]]; then
UBUNTU_VERSION=22.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
elif [[ "$image" == *centos* ]]; then
extract_version_from_image_name centos CENTOS_VERSION
fi
if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
else
echo "Unable to derive operating system base..."
exit 1
fi
DOCKERFILE="${OS}/Dockerfile"
# When using ubuntu - 22.04, start from Ubuntu docker image, instead of nvidia/cuda docker image.
if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
if [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *xpu* ]]; then
DOCKERFILE="${OS}-xpu/Dockerfile"
@ -85,9 +76,6 @@ elif [[ "$image" == *linter* ]]; then
DOCKERFILE="linter/Dockerfile"
fi
# CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.18.5
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
if [[ "$image" == *rocm* ]]; then
@ -95,260 +83,219 @@ if [[ "$image" == *rocm* ]]; then
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
fi
tag=$(echo $image | awk -F':' '{print $2}')
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
case "$tag" in
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
pytorch-linux-jammy-py3-clang12-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
CLANG_VERSION=12
VISION=yes
CONDA_CMAKE=yes
ONNX=yes
;;
pytorch-linux-focal-py3.9-clang10)
pytorch-linux-jammy-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
CLANG_VERSION=12
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.11-clang10)
pytorch-linux-jammy-py3.11-clang12)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=10
PROTOBUF=yes
CLANG_VERSION=12
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.9-gcc9)
pytorch-linux-jammy-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
PROTOBUF=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-1-py3)
pytorch-linux-jammy-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
ROCM_VERSION=6.2.4
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
ROCM_VERSION=6.3
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
XPU_VERSION=0.5
ROCM_VERSION=6.4
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
XPU_VERSION=2025.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2025.1-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.1
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
CLANG_VERSION=12
PROTOBUF=yes
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
PROTOBUF=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang15-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=15
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
UNINSTALL_DILL=yes
@ -356,14 +303,12 @@ case "$image" in
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
TRITON=yes
;;
@ -371,28 +316,23 @@ case "$image" in
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
PIP_CMAKE=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
CUDA_VERSION=11.8
PIP_CMAKE=yes
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -401,9 +341,7 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -411,7 +349,6 @@ case "$image" in
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
@ -427,8 +364,7 @@ case "$image" in
TRITON=yes
# To ensure that any ROCm config will build using conda cmake
# and thus have LAPACK/MKL enabled
CONDA_CMAKE=yes
fi
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
fi
@ -444,22 +380,11 @@ case "$image" in
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
if [[ "$image" == *cmake* ]]; then
extract_version_from_image_name cmake CMAKE_VERSION
fi
;;
esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
no_cache_flag=""
progress_flag=""
# Do not use cache and progress=plain when in CI
@ -473,11 +398,9 @@ docker build \
${no_cache_flag} \
${progress_flag} \
--build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "VISION=${VISION:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
@ -488,7 +411,6 @@ docker build \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
@ -496,8 +418,6 @@ docker build \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "PIP_CMAKE=${PIP_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "TRITON_CPU=${TRITON_CPU}" \
--build-arg "ONNX=${ONNX}" \
@ -506,6 +426,7 @@ docker build \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "HALIDE=${HALIDE}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
@ -582,3 +503,12 @@ elif [ "$HAS_TRITON" = "yes" ]; then
echo "expecting triton to not be installed, but it is"
exit 1
fi
# Sanity check cmake version. Executorch reinstalls cmake and I'm not sure if
# they support 4.0.0 yet, so exclude them from this check.
CMAKE_VERSION=$(drun cmake --version)
if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then
echo "CMake version is not 4.0.0:"
drun cmake --version
exit 1
fi

View File

@ -17,9 +17,8 @@ RUN bash ./install_base.sh && rm install_base.sh
# Update CentOS git version
RUN yum -y remove git
RUN yum -y remove git-*
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm || \
(yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \
sed -i "s/packages.endpoint/packages.endpointdev/" /etc/yum.repos.d/endpoint.repo)
RUN yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \
sed -i 's/packages.endpoint/packages.endpointdev/' /etc/yum.repos.d/endpoint.repo
RUN yum install -y git
# Install devtoolset
@ -40,7 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
@ -48,13 +47,6 @@ COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -82,12 +74,6 @@ ENV MAGMA_HOME /opt/rocm/magma
ENV LANG en_US.utf8
ENV LC_ALL en_US.utf8
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh

View File

@ -1 +1 @@
7e487c24e1c20c3f4606c2d8aca2778873b00b4c
56392aa978594cc155fa8af48cd949f5b5f1823a

View File

@ -1 +1 @@
v2.26.2-1
v2.27.3-1

View File

@ -1 +1 @@
0bcc8265e677e5321606a3311bf71470f14456a8
ae324eeac8e102a2b40370e341460f3791353398

View File

@ -1 +1 @@
96316ce50fade7e209553aba4898cd9b82aab83b
c8757738a7418249896224430ce84888e8ecdd79

View File

@ -30,18 +30,6 @@ install_ubuntu() {
maybe_libomp_dev=""
fi
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
@ -70,7 +58,6 @@ install_ubuntu() {
libasound2-dev \
libsndfile-dev \
${maybe_libomp_dev} \
${maybe_libnccl_dev} \
software-properties-common \
wget \
sudo \
@ -99,9 +86,6 @@ install_centos() {
ccache_deps="asciidoc docbook-dtds docbook-style-xsl libxslt"
numpy_deps="gcc-gfortran"
# Note: protobuf-c-{compiler,devel} on CentOS are too old to be used
# for Caffe2. That said, we still install them to make sure the build
# system opts to build/use protoc and libprotobuf from third-party.
yum install -y \
$ccache_deps \
$numpy_deps \

View File

@ -9,7 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
git clone https://github.com/mozilla/sccache -b v0.9.1
git clone https://github.com/mozilla/sccache -b v0.10.0
cd sccache
echo "Building sccache"
cargo build --release

View File

@ -1,31 +0,0 @@
#!/bin/bash
set -ex
[ -n "$CMAKE_VERSION" ]
# Remove system cmake install so it won't get used instead
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get remove cmake -y
;;
centos)
yum remove cmake -y
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# Turn 3.6.3 into v3.6
path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')
file="cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz"
# Download and install specific CMake version in /usr/local
pushd /tmp
curl -Os --retry 3 "https://cmake.org/files/${path}/${file}"
tar -C /usr/local --strip-components 1 --no-same-owner -zxf cmake-*.tar.gz
rm -f cmake-*.tar.gz
popd

View File

@ -6,8 +6,8 @@ set -ex
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
@ -64,6 +64,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge
# Miniforge installer doesn't install sqlite by default
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
conda_install sqlite
fi
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.29=*openmp*"
@ -75,19 +80,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# and libpython-static for torch deploy
conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"
# Use conda cmake in some cases. Conda cmake will be newer than our supported
# min version (3.5 for xenial and 3.10 for bionic), so we only do it in those
# following builds that we know should use conda. Specifically, Ubuntu bionic
# and focal cannot find conda mkl with stock cmake, so we need a cmake from conda
if [ -n "${CONDA_CMAKE}" ]; then
conda_install cmake
fi
# Magma package names are concatenation of CUDA major and minor ignoring revision
# I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
# Magma is installed from a tarball in the ossci-linux bucket into the conda env
if [ -n "$CUDA_VERSION" ]; then
${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}
conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})
fi
# Install some other packages, including those needed for Python test reporting

View File

@ -3,7 +3,7 @@
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads # @lint-ignore
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO

View File

@ -2,184 +2,72 @@
set -ex
CUDNN_VERSION=9.5.1.17
arch_path=''
targetarch=${TARGETARCH:-$(uname -m)}
if [ ${targetarch} = 'amd64' ] || [ "${targetarch}" = 'x86_64' ]; then
arch_path='x86_64'
else
arch_path='sbsa'
fi
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
function install_cuda {
version=$1
runfile=$2
major_minor=${version%.*}
rm -rf /usr/local/cuda-${major_minor} /usr/local/cuda
if [[ ${arch_path} == 'sbsa' ]]; then
runfile="${runfile}_sbsa"
fi
runfile="${runfile}.run"
wget -q https://developer.download.nvidia.com/compute/cuda/${version}/local_installers/${runfile} -O ${runfile}
chmod +x ${runfile}
./${runfile} --toolkit --silent
rm -f ${runfile}
rm -f /usr/local/cuda && ln -s /usr/local/cuda-${major_minor} /usr/local/cuda
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run --toolkit --silent
rm -f cuda_11.8.0_520.61.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
CUDA_VERSION=11.8 bash install_nccl.sh
install_cusparselt_040
ldconfig
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
chmod +x cuda_12.4.1_550.54.15_linux.run
./cuda_12.4.1_550.54.15_linux.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
function install_cudnn {
cuda_major_version=$1
cudnn_version=$2
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
filepath="cudnn-linux-${arch_path}-${cudnn_version}_cuda${cuda_major_version}-archive"
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-${arch_path}/${filepath}.tar.xz
tar xf ${filepath}.tar.xz
cp -a ${filepath}/include/* /usr/local/cuda/include/
cp -a ${filepath}/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
CUDA_VERSION=12.4 bash install_nccl.sh
install_cusparselt_062
ldconfig
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
chmod +x cuda_12.6.3_560.35.05_linux.run
./cuda_12.6.3_560.35.05_linux.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
install_cudnn 12 $CUDNN_VERSION
CUDA_VERSION=12.6 bash install_nccl.sh
install_cusparselt_063
CUDA_VERSION=12.6 bash install_cusparselt.sh
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
function install_129 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.9.1 in the same container
install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
CUDA_VERSION=12.9 bash install_nccl.sh
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
CUDA_VERSION=12.9 bash install_cusparselt.sh
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
ldconfig
}
function prune_126 {
@ -218,27 +106,16 @@ function prune_126 {
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
chmod +x cuda_12.8.0_570.86.10_linux.run
./cuda_12.8.0_570.86.10_linux.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
install_cudnn 12 $CUDNN_VERSION
CUDA_VERSION=12.8 bash install_nccl.sh
install_cusparselt_063
CUDA_VERSION=12.8 bash install_cusparselt.sh
ldconfig
}
@ -247,13 +124,11 @@ function install_128 {
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
12.6|12.6.*) install_126; prune_126
;;
12.4) install_124; prune_124
12.8|12.8.*) install_128;
;;
12.6) install_126; prune_126
;;
12.8) install_128;
12.9|12.9.*) install_129;
;;
*) echo "bad argument $1"; exit 1
;;

View File

@ -1,55 +0,0 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
CUDNN_VERSION=9.8.0.87
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_128 {
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run
chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run
./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
CUDA_VERSION=12.8 bash install_nccl.sh
install_cusparselt_063
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -4,12 +4,10 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"
if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else

View File

@ -5,25 +5,14 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi

View File

@ -13,7 +13,7 @@ clone_executorch() {
# and fetch the target commit
pushd executorch
git checkout "${EXECUTORCH_PINNED_COMMIT}"
git submodule update --init
git submodule update --init --recursive
popd
chown -R jenkins executorch

View File

@ -17,7 +17,7 @@ if [ -n "${UBUNTU_VERSION}" ];then
libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev
fi
conda_install numpy scipy imageio cmake ninja
pip_install numpy scipy imageio cmake ninja
git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git
cmake -DCMAKE_BUILD_TYPE=Release \

View File

@ -14,16 +14,9 @@ function install_timm() {
local commit
commit=$(get_pinned_commit timm)
# TODO (huydhn): There is no torchvision release on 3.13 when I write this, so
# I'm using nightly here instead. We just need to package to be able to install
# TIMM. Removing this once vision has a release on 3.13
if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then
pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
fi
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y cmake torch torchvision triton
conda_run pip uninstall -y torch torchvision triton
}
# Pango is needed for weasyprint which is needed for doctr

View File

@ -1,26 +1,23 @@
#!/usr/bin/env bash
# Script that replaces the magma install from a conda package
# Script that installs magma from tarball inside conda environment.
# It replaces anaconda magma-cuda package which is no longer published.
# Execute it inside active conda environment.
# See issue: https://github.com/pytorch/pytorch/issues/138506
set -eou pipefail
function do_install() {
cuda_version_nodot=${1/./}
anaconda_python_version=$2
cuda_version_nodot=${1/./}
anaconda_dir=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)
}
do_install $1 $2
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)

View File

@ -8,16 +8,6 @@ retry () {
"$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
}
# A bunch of custom pip dependencies for ONNX
pip_install \
beartype==0.15.0 \
filelock==3.9.0 \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.5 \
numpy==1.24.2
# ONNXRuntime should be installed before installing
# onnx-weekly. Otherwise, onnx-weekly could be
# overwritten by onnx.
@ -29,12 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes
pip_install onnxscript==0.3.0
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -4,8 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128

View File

@ -1,19 +0,0 @@
#!/bin/bash
set -ex
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd
rm -rf $pb_dir

View File

@ -13,6 +13,3 @@ source /var/lib/jenkins/ci_env/bin/activate
python -mpip install --upgrade pip
python -mpip install -r /opt/requirements-ci.txt
if [ -n "${PIP_CMAKE}" ]; then
python -mpip install cmake==3.31.6
fi

View File

@ -19,6 +19,18 @@ install_ubuntu() {
apt-get install -y libc++1
apt-get install -y libc++abi1
# Make sure rocm packages from repo.radeon.com have highest priority
cat << EOF > /etc/apt/preferences.d/rocm-pin-600
Package: *
Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.1"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
@ -59,17 +71,29 @@ install_ubuntu() {
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b rocm-6.3.x
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-6.3-statco-hotfix
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.6.3.* /opt/rocm/lib/libamdhip64.so.6.3.*
cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
popd
rm -rf HIP clr
fi

View File

@ -25,9 +25,7 @@ python3 -m pip install meson ninja
###########################
### clone repo
###########################
# TEMPORARY FIX: https://gitlab.freedesktop.org/mesa/drm.git is down until 2025/03/22
# GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
GIT_SSL_NO_VERIFY=true git clone git://anongit.freedesktop.org/mesa/drm
GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
pushd drm
###########################

View File

@ -17,10 +17,14 @@ function do_install() {
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${rocm_dir}/magma"
mv include "${rocm_dir}/magma/include"
mv lib "${rocm_dir}/magma/lib"
if tar -xvf "${magma_archive}"
then
mkdir -p "${rocm_dir}/magma"
mv include "${rocm_dir}/magma/include"
mv lib "${rocm_dir}/magma/lib"
else
echo "${magma_archive} not found, skipping magma install"
fi
popd
)
}

View File

@ -10,12 +10,8 @@ fi
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
get_conda_version() {
as_jenkins conda list -n py_$ANACONDA_PYTHON_VERSION | grep -w $* | head -n 1 | awk '{print $2}'
}
conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
get_pip_version() {
conda_run pip list | grep -w $* | head -n 1 | awk '{print $2}'
}
if [ -n "${XPU_VERSION}" ]; then
@ -37,11 +33,9 @@ if [ -n "${UBUNTU_VERSION}" ];then
apt-get install -y gpg-agent
fi
if [ -n "${CONDA_CMAKE}" ]; then
# Keep the current cmake and numpy version here, so we can reinstall them later
CMAKE_VERSION=$(get_conda_version cmake)
NUMPY_VERSION=$(get_conda_version numpy)
fi
# Keep the current cmake and numpy version here, so we can reinstall them later
CMAKE_VERSION=$(get_pip_version cmake)
NUMPY_VERSION=$(get_pip_version numpy)
if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
@ -57,7 +51,12 @@ as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
# Old versions of python have setup.py in ./python; newer versions have it in ./
if [ ! -f setup.py ]; then
cd python
fi
pip_install pybind11==2.13.6
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
@ -83,17 +82,22 @@ cp dist/*.whl /opt/triton
# Install the wheel for docker builds that don't use multi stage
pip_install dist/*.whl
if [ -n "${CONDA_CMAKE}" ]; then
# TODO: This is to make sure that the same cmake and numpy version from install conda
# script is used. Without this step, the newer cmake version (3.25.2) downloaded by
# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,
# this can be removed.
#
# The correct numpy version also needs to be set here because conda claims that it
# causes inconsistent environment. Without this, conda will attempt to install the
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
conda_reinstall cmake="${CMAKE_VERSION}"
# Note that we install numpy with pip as conda might not have the version we want
pip_install --force-reinstall numpy=="${NUMPY_VERSION}"
# TODO: This is to make sure that the same cmake and numpy version from install conda
# script is used. Without this step, the newer cmake version (3.25.2) downloaded by
# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,
# this can be removed.
#
# The correct numpy version also needs to be set here because conda claims that it
# causes inconsistent environment. Without this, conda will attempt to install the
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
# Note that we install numpy with pip as conda might not have the version we want
if [ -n "${CMAKE_VERSION}" ]; then
pip_install "cmake==${CMAKE_VERSION}"
fi
if [ -n "${NUMPY_VERSION}" ]; then
pip_install "numpy==${NUMPY_VERSION}"
fi
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion
fi

View File

@ -26,7 +26,7 @@ function install_ubuntu() {
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg.gpg] \
https://apt.repos.intel.com/${XPU_REPO_NAME} all main" \
https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
# Update the packages list and repository index
@ -74,7 +74,7 @@ function install_rhel() {
tee > /etc/yum.repos.d/oneAPI.repo << EOF
[oneAPI]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/${XPU_REPO_NAME}
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
@ -118,7 +118,7 @@ function install_sles() {
https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/${XPU_REPO_NAME} oneAPI
zypper addrepo https://yum.repos.intel.com/oneapi oneAPI
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
@ -141,10 +141,10 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
XPU_DRIVER_VERSION=""
fi
XPU_REPO_NAME="intel-for-pytorch-gpu-dev"
XPU_PACKAGES="intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9"
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_REPO_NAME="oneapi"
# Default use Intel® oneAPI Deep Learning Essentials 2025.0
if [[ "$XPU_VERSION" == "2025.1" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
else
XPU_PACKAGES="intel-deep-learning-essentials-2025.0"
fi

View File

@ -51,18 +51,9 @@ ADD ./common/install_cuda.sh install_cuda.sh
ADD ./common/install_magma.sh install_magma.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
@ -73,6 +64,11 @@ RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
RUN bash ./install_magma.sh 12.9
RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda
FROM cpu as rocm
ARG ROCM_VERSION
ARG PYTORCH_ROCM_ARCH
@ -89,7 +85,7 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
# gfortran and python needed for building magma from source for ROCm
RUN apt-get update -y && \
apt-get install gfortran -y && \
apt-get install python -y && \
apt-get install python3 python-is-python3 -y && \
apt-get clean
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

View File

@ -1,83 +1,63 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -eoux pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGENAME:ARCHTAG"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
TOPDIR=$(git rev-parse --show-toplevel)
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
WITH_PUSH=${WITH_PUSH:-}
DOCKER=${DOCKER:-docker}
case ${GPU_ARCH_TYPE} in
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
GPU_ARCH_VERSION=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
fi
case ${DOCKER_TAG_PREFIX} in
cpu)
BASE_TARGET=cpu
DOCKER_TAG=cpu
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
cuda)
cuda*)
BASE_TARGET=cuda${GPU_ARCH_VERSION}
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
rocm)
rocm*)
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
echo "ERROR: Unrecognized DOCKER_TAG_PREFIX: ${DOCKER_TAG_PREFIX}"
exit 1
;;
esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
(
set -x
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
${DOCKER} push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"
${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${tmp_tag}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"

View File

@ -32,7 +32,8 @@ ARG CUDA_VERSION
COPY ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

View File

@ -16,7 +16,6 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG PYTHON_VERSION
ARG PIP_CMAKE
ENV PATH /var/lib/jenkins/ci_env/bin:$PATH
ENV VIRTUAL_ENV /var/lib/jenkins/ci_env
COPY requirements-ci.txt /opt/requirements-ci.txt

View File

@ -1,202 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=centos:7
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms
# patch is required once again. Somehow this steps adds mirror.centos.org
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum --enablerepo=extras install -y epel-release
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
RUN yum install -y autoconf aclocal automake make sudo
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
ENV PATH /opt/conda/bin:$PATH
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake is already installed inside the rocm base image, so remove if present
RUN rpm -e cmake || true
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
# ninja
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM cpu_final as rocm_final
ARG ROCM_VERSION=3.7
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
ENV MKLROOT /opt/intel
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# cmake3 is needed for the MIOpen build
RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -7,8 +7,8 @@ ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
ARG DEVTOOLSET_VERSION=13
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-gcc gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran gcc-toolset-${DEVTOOLSET_VERSION}-gdb
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -26,19 +26,20 @@ ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=11.8
ARG BASE_CUDA_VERSION=12.6
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu*
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
FROM base as intel
# MKL
@ -46,7 +47,7 @@ ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
ARG BASE_CUDA_VERSION=12.6
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
@ -63,7 +64,7 @@ ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
ARG DEVTOOLSET_VERSION=11
ARG DEVTOOLSET_VERSION=13
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
@ -86,13 +87,12 @@ RUN yum install -y \
wget \
which \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
glibc-langpack-en \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${DEVTOOLSET_VERSION}-gdb
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
@ -103,6 +103,7 @@ ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /usr/local/lib/ /usr/local/lib/
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
@ -116,8 +117,8 @@ COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
ARG BASE_CUDA_VERSION=12.6
ARG DEVTOOLSET_VERSION=13
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
@ -156,8 +157,11 @@ ENV ROCM_PATH /opt/rocm
# and avoid 3.21.0 cmake+ninja issues with ninja inserting "-Wl,--no-as-needed" in LINK_FLAGS for static linker
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
# replace the libdrm in /opt/amdgpu with custom amdgpu.ids lookup path
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# ROCm 6.4 rocm-smi depends on system drm.h header
RUN yum install -y libdrm-devel
ENV MKLROOT /opt/intel
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
@ -171,6 +175,6 @@ ENV XPU_DRIVER_TYPE ROLLING
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
ENV XPU_VERSION 2025.0
ENV XPU_VERSION 2025.1
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -1,9 +1,8 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
ARG GCCTOOLSET_VERSION=11
ARG GCCTOOLSET_VERSION=13
# Language variabes
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
@ -36,7 +35,10 @@ RUN yum install -y \
yasm \
zstd \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
gcc-toolset-${GCCTOOLSET_VERSION}-gcc \
gcc-toolset-${GCCTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${GCCTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${GCCTOOLSET_VERSION}-gdb
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
@ -56,12 +58,13 @@ RUN git config --global --add safe.directory "*"
FROM base as openblas
# Install openblas
ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -1,94 +0,0 @@
FROM quay.io/pypa/manylinux2014_aarch64 as base
# Graviton needs GCC 10 for the build
ARG DEVTOOLSET_VERSION=10
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
devtoolset-${DEVTOOLSET_VERSION}-gcc \
devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
devtoolset-${DEVTOOLSET_VERSION}-binutils
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
###############################################################################
# libglfortran.a hack
#
# libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
# This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
# ubuntu's libgfortran.a which is compiled with -fPIC
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
# install cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -1,7 +1,7 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Cuda ARM build needs gcc 11
ARG DEVTOOLSET_VERSION=11
ARG DEVTOOLSET_VERSION=13
# Language variables
ENV LC_ALL=en_US.UTF-8
@ -34,7 +34,10 @@ RUN yum install -y \
zstd \
libgomp \
sudo \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${DEVTOOLSET_VERSION}-gdb
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
@ -57,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
@ -66,10 +69,11 @@ RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION
# Install CUDA
ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./common/install_cusparselt.sh install_cusparselt.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh install_nccl.sh ci_commit_pins/nccl-cu*
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
FROM base as magma
ARG BASE_CUDA_VERSION

View File

@ -5,7 +5,9 @@ ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
ARG DEVTOOLSET_VERSION=13
# there is a bugfix in gcc >= 14 for precompiled headers and s390x vectorization interaction.
# with earlier gcc versions test/inductor/test_cpu_cpp_wrapper.py will fail.
ARG DEVTOOLSET_VERSION=14
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
@ -58,7 +60,8 @@ RUN yum install -y \
libxslt-devel \
libxml2-devel \
openssl-devel \
valgrind
valgrind \
ninja-build
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -103,9 +106,6 @@ CMD ["/bin/bash"]
# install test dependencies:
# - grpcio requires system openssl, bundled crypto fails to build
RUN dnf install -y \
protobuf-devel \
protobuf-c-devel \
protobuf-lite-devel \
hdf5-devel \
python3-h5py \
git
@ -120,15 +120,20 @@ RUN python3 -mpip install cmake==3.28.0
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# h5py 3.11.0 doesn't build with numpy >= 2.3.0.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \
./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
./build.sh --config Release --parallel 0 --enable_pybind \
--build_wheel --enable_training --enable_training_apis \
--enable_training_ops --skip_tests --allow_running_as_root \
--compile_no_warning_as_error && \
pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
cd .. && /bin/rm -rf ./onnxruntime

View File

@ -1,7 +1,7 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -exou pipefail
TOPDIR=$(git rev-parse --show-toplevel)
@ -9,152 +9,111 @@ image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGE:ARCHTAG"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"
GPU_ARCH_VERSION=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
fi
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
WITH_PUSH=${WITH_PUSH:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
case ${GPU_ARCH_TYPE} in
cpu)
case ${image} in
manylinux2_28-builder:cpu)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
;;
cpu-manylinux_2_28)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
cpu-aarch64)
manylinux2_28_aarch64-builder:cpu-aarch64)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"
MANY_LINUX_VERSION="aarch64"
;;
cpu-aarch64-2_28)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.29"
;;
cpu-cxx11-abi)
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
DOCKER_TAG=cpu-cxx11-abi
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
cpu-s390x)
manylinuxs390x-builder:cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=s390x/almalinux:8
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
cuda)
manylinux2_28-builder:cuda11*)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
# Keep this up to date with the minimum version of CUDA we currently support
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"
;;
cuda-manylinux_2_28)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cuda-aarch64)
manylinux2_28-builder:cuda12*)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinuxaarch64-builder:cuda*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm|rocm-manylinux_2_28)
manylinux2_28-builder:rocm*)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
DEVTOOLSET_VERSION="9"
if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)
manylinux2_28-builder:xpu)
TARGET=xpu_final
DOCKER_TAG=xpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
echo "ERROR: Unrecognized image name: ${image}"
exit 1
;;
esac
IMAGES=''
if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}
fi
(
set -x
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
fi
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-"dev")}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
fi
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"

View File

@ -97,7 +97,7 @@ find /opt/_internal -type f -print0 \
| xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true
# We do not need the Python test suites, or indeed the precompiled .pyc and
# .pyo files. Partially cribbed from:
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile # @lint-ignore
find /opt/_internal \
\( -type d -a -name test -o -name tests \) \
-o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \

View File

@ -2,7 +2,7 @@
# Helper utilities for build
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/ # @lint-ignore
CURL_DOWNLOAD_URL=https://curl.se/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

View File

@ -41,14 +41,11 @@ fbscribelogger==0.1.7
#Pinned versions: 0.1.6
#test that import:
flatbuffers==2.0 ; platform_machine != "s390x"
flatbuffers==24.12.23
#Description: cross platform serialization library
#Pinned versions: 2.0
#Pinned versions: 24.12.23
#test that import:
flatbuffers ; platform_machine == "s390x"
#Description: cross platform serialization library; Newer version is required on s390x for new python version
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
@ -93,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.14.0
mypy==1.16.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -166,10 +163,10 @@ pillow==11.0.0
#Pinned versions: 10.3.0
#test that import:
protobuf==3.20.2
#Description: Googles data interchange format
#Pinned versions: 3.20.1
#test that import: test_tensorboard.py
protobuf==5.29.4
#Description: Google's data interchange format
#Pinned versions: 5.29.4
#test that import: test_tensorboard.py, test/onnx/*
psutil
#Description: information on running processes and system utilization
@ -337,12 +334,12 @@ sympy==1.13.3
#Pinned versions:
#test that import:
onnx==1.17.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
onnx==1.18.0
#Description: Required by onnx tests, and mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.2.2
onnxscript==0.2.6
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -379,3 +376,13 @@ dataclasses_json==0.6.7
#Description: required for data pipeline and scripts under tools/stats
#Pinned versions: 0.6.7
#test that import:
cmake==4.0.0
#Description: required for building
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -1,15 +1,24 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# something related to Docker setup. We can investigate this later
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.8.6
sphinxext-opengraph==0.9.1
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.9.1
sphinx_sitemap==2.6.0
#Description: This is used to generate sitemap for PyTorch docs
#Pinned versions: 2.6.0
matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
@ -46,5 +55,6 @@ myst-nb==0.17.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5
sphinx-copybutton==0.5.0
sphinx-panels==0.4.1
sphinx-design==0.4.0
sphinxcontrib-mermaid==1.0.0
myst-parser==0.18.1

View File

@ -1 +1 @@
3.3.0
3.3.1

View File

@ -0,0 +1 @@
3.4.0

View File

@ -1,184 +0,0 @@
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG IMAGE_NAME
FROM ${IMAGE_NAME} as base
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc
ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
ARG TRITON
FROM base as triton-builder
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN bash ./install_triton.sh
FROM base as final
COPY --from=triton-builder /opt/triton /opt/triton
RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
RUN rm -rf /opt/triton
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
# See https://github.com/pytorch/pytorch/issues/82174
# TODO(sdym@fb.com):
# check if this is needed after full off Xenial migration
ENV CARGO_NET_GIT_FETCH_WITH_CLI true
RUN bash ./install_cache.sh && rm install_cache.sh
ENV CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
# Add jni.h for java host build
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
# Install Open MPI for CUDA
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
# Install CUDNN
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
ARG CUDA_VERSION
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install NCCL
ARG CUDA_VERSION
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash install_nccl.sh
RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*
ENV USE_SYSTEM_NCCL=1
ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh
RUN bash install_cudss.sh
RUN rm install_cudss.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi
USER jenkins
CMD ["bash"]

View File

@ -25,9 +25,9 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
@ -43,13 +43,6 @@ ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -108,12 +101,6 @@ COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh

View File

@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
@ -73,7 +72,7 @@ ARG TRITON
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt
COPY triton_version.txt triton_version.txt
COPY triton_xpu_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt
@ -84,12 +83,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh

View File

@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
@ -54,7 +53,8 @@ ARG CUDA_VERSION
COPY ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
# No effect if cuda not installed
@ -74,13 +74,6 @@ ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -88,12 +81,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh

View File

@ -1,7 +1,7 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 6.3
DESIRED_ROCM ?= 6.4
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
@ -12,24 +12,24 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-w /builder \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_ROCM_SHORT} \
-e DESIRED_ROCM=${DESIRED_ROCM} \
"pytorch/manylinux2_28-builder:rocm${DESIRED_ROCM}-main" \
"pytorch/almalinux-builder:rocm${DESIRED_ROCM}" \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm64
all: magma-rocm63
all: magma-rocm624
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm64
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:
$(DOCKER_RUN)
.PHONY: magma-rocm63
magma-rocm63: DESIRED_ROCM := 6.3
magma-rocm63:
$(DOCKER_RUN)
.PHONY: magma-rocm624
magma-rocm624: DESIRED_ROCM := 6.2.4
magma-rocm624:
$(DOCKER_RUN)

View File

@ -1,7 +1,7 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_CUDA ?= 11.8
DESIRED_CUDA ?= 12.8
DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))
PACKAGE_NAME = magma-cuda
CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
@ -12,20 +12,25 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \
-e DESIRED_CUDA=${DESIRED_CUDA} \
-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \
"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \
"pytorch/almalinux-builder:cuda${DESIRED_CUDA}-main" \
magma/build_magma.sh
.PHONY: all
all: magma-cuda129
all: magma-cuda128
all: magma-cuda126
all: magma-cuda124
all: magma-cuda118
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda129
magma-cuda129: DESIRED_CUDA := 12.9
magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda129:
$(DOCKER_RUN)
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
@ -36,14 +41,3 @@ magma-cuda128:
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
$(DOCKER_RUN)
.PHONY: magma-cuda124
magma-cuda124: DESIRED_CUDA := 12.4
magma-cuda124:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37
magma-cuda118:
$(DOCKER_RUN)

View File

@ -18,12 +18,10 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
PLATFORM="manylinux2014_x86_64"
PLATFORM=""
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
@ -33,9 +31,11 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# We use the package name to test the package by passing this to 'pip install'
@ -79,8 +79,6 @@ if [[ -e /opt/openssl ]]; then
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
mkdir -p /tmp/$WHEELHOUSE_DIR
export PATCHELF_BIN=/usr/local/bin/patchelf
@ -99,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
@ -152,7 +151,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
@ -321,8 +320,8 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
# ROCm workaround for roctracer dlopens
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
# Keep the so number for XPU dependencies
elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then
# Keep the so number for XPU dependencies and libgomp.so.1 to avoid twice load
elif [[ "$DESIRED_CUDA" == *"xpu"* || "$filename" == "libgomp.so.1" ]]; then
patchedpath=$destpath
else
patchedpath=$(fname_with_sha256 $destpath)

View File

@ -15,6 +15,9 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
export USE_SYSTEM_NCCL=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
@ -36,10 +39,8 @@ if [[ -n "$DESIRED_CUDA" ]]; then
if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then
CUDA_VERSION=${DESIRED_CUDA}
else
# cu90, cu92, cu100, cu101
if [[ ${#DESIRED_CUDA} -eq 4 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"
elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then
# cu126, cu128 etc...
if [[ ${#DESIRED_CUDA} -eq 5 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"
fi
fi
@ -53,22 +54,18 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
12.8|12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$DESIRED_CUDA" == "cu129" && "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
*)
echo "unknown cuda version $CUDA_VERSION"
exit 1
@ -91,14 +88,15 @@ fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
DEPS_LIST=(
@ -108,31 +106,12 @@ DEPS_SONAME=(
"libgomp.so.1"
)
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
# CUDA_VERSION 12.6, 12.8, 12.9
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
@ -148,9 +127,10 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
@ -165,19 +145,15 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libcufile.so.0"
"libcufile_rdma.so.1"
)
if [[ $USE_CUFILE == 1 ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
# Add libnvToolsExt only if CUDA version is not 12.9
if [[ $CUDA_VERSION != 12.9* ]]; then
DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")
DEPS_SONAME+=("libnvToolsExt.so.1")
fi
else
echo "Using nvidia libs from pypi."
@ -191,94 +167,21 @@ if [[ $CUDA_VERSION == 12* ]]; then
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/cusparselt/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvshmem/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.11"
"/usr/local/cuda/lib64/libcublasLt.so.11"
"/usr/local/cuda/lib64/libcudart.so.11.0"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.11.2" # this is not a mistake, it links to more specific cuda version
"/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.11"
"libcublasLt.so.11"
"libcudart.so.11.0"
"libnvToolsExt.so.1"
"libnvrtc.so.11.2"
"libnvrtc-builtins.so.11.8"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cufile/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
else
echo "Unknown cuda version $CUDA_VERSION"

View File

@ -22,9 +22,7 @@ retry () {
# TODO move this into the Docker images
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
@ -35,6 +33,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
@ -91,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1

View File

@ -95,6 +95,7 @@ ROCM_SO_FILES=(
"libroctracer64.so"
"libroctx64.so"
"libhipblaslt.so"
"libhipsparselt.so"
"libhiprtc.so"
)
@ -186,20 +187,28 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library
HIPBLASLT_LIB_DST=lib/hipblaslt/library
ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
HIPBLASLT_ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
HIPBLASLT_OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($HIPBLASLT_ARCH_SPECIFIC_FILES $HIPBLASLT_OTHER_FILES)
# hipsparselt library files
HIPSPARSELT_LIB_SRC=$ROCM_HOME/lib/hipsparselt/library
HIPSPARSELT_LIB_DST=lib/hipsparselt/library
HIPSPARSELT_ARCH_SPECIFIC_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -E $ARCH)
#HIPSPARSELT_OTHER_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -v gfx)
HIPSPARSELT_LIB_FILES=($HIPSPARSELT_ARCH_SPECIFIC_FILES $HIPSPARSELT_OTHER_FILES)
# ROCm library files
ROCM_SO_PATHS=()
@ -234,12 +243,14 @@ DEPS_SONAME=(
DEPS_AUX_SRCLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_SRC/}"
"/opt/amdgpu/share/libdrm/amdgpu.ids"
)
DEPS_AUX_DSTLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_DST/}"
"share/libdrm/amdgpu.ids"
)

View File

@ -20,7 +20,11 @@ fi
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
source /opt/intel/oneapi/umf/latest/env/vars.sh
source /opt/intel/oneapi/ccl/latest/env/vars.sh
source /opt/intel/oneapi/mpi/latest/env/vars.sh
export USE_STATIC_MKL=1
export USE_ONEMKL=1
export USE_XCCL=1
WHEELHOUSE_DIR="wheelhousexpu"
LIBTORCH_HOUSE_DIR="libtorch_housexpu"

View File

@ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
built on Jenkins and are used in triggered builds already have this
environment variable set in their manifest. Also see
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

View File

@ -27,6 +27,12 @@ cmake --version
echo "Environment variables:"
env
# The sccache wrapped version of nvcc gets put in /opt/cache/lib in docker since
# there are some issues if it is always wrapped, so we need to add it to PATH
# during CI builds.
# https://github.com/pytorch/pytorch/blob/0b6c0898e6c352c8ea93daec854e704b41485375/.ci/docker/common/install_cache.sh#L97
export PATH="/opt/cache/lib:$PATH"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
@ -52,12 +58,6 @@ fi
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
export BUILD_EXECUTORCH=ON
export USE_CUDA=0
fi
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
# intel cpu and later run tests on machines with amd cpu.
@ -171,6 +171,12 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# shellcheck disable=SC1091
source /opt/intel/oneapi/ccl/latest/env/vars.sh
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Enable XCCL build
export USE_XCCL=1
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
@ -301,6 +307,18 @@ else
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
echo "Checking that xpu is compiled"
pushd dist/
if python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'; then
echo "XPU support is compiled in."
else
echo "XPU support is NOT compiled in."
exit 1
fi
popd
fi
# TODO: I'm not sure why, but somehow we lose verbose commands
set -x

View File

@ -216,6 +216,14 @@ else
fi
fi
###############################################################################
# Check XPU configured correctly
###############################################################################
if [[ "$DESIRED_CUDA" == 'xpu' && "$PACKAGE_TYPE" != 'libtorch' ]]; then
echo "Checking that xpu is compiled"
python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'
fi
###############################################################################
# Check CUDA configured correctly
###############################################################################
@ -294,19 +302,22 @@ except RuntimeError as e:
fi
###############################################################################
# Check for C++ ABI compatibility to GCC-11
# Check for C++ ABI compatibility to GCC-11 - GCC 13
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
# Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html gcc-11 is ABI16
# Though manylinux_2.28 should have been build with gcc-14, per
# https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based
# On s390x gcc 14 is used because it contains fix for interaction
# between precompiled headers and vectorization builtins.
# This fix is not available in earlier gcc versions.
# gcc-14 uses ABI19.
if [[ "$(uname -m)" != "s390x" ]]; then
python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1016' else 1)"
# Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
# gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19
# gcc 11 - CUDA 11.8, xpu, rocm
# gcc 13 - CUDA 12.6, 12.8 and cpu
# Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426
if [[ "$(uname -m)" == "s390x" ]]; then
cxx_abi="19"
elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
cxx_abi="18"
else
cxx_abi="16"
fi
python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"
popd
fi

View File

@ -13,12 +13,8 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
# HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors
unset HIP_PLATFORM
export PYTORCH_TEST_WITH_ROCM=1
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
# improve rccl performance for distributed tests
export HSA_FORCE_FINE_GRAIN_PCIE=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0

View File

@ -159,11 +159,6 @@ function install_torchvision() {
fi
}
function install_tlparse() {
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)

View File

@ -1,31 +1,50 @@
#!/bin/bash
# Script for installing sccache on the xla build job, which uses xla's docker
# image and doesn't have sccache installed on it. This is mostly copied from
# .ci/docker/install_cache.sh. Changes are: removing checks that will always
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
# image, which has sccache installed but doesn't write the stubs. This is
# mostly copied from .ci/docker/install_cache.sh. Changes are: removing checks
# that will always return the same thing, ex checks for for rocm, CUDA, changing
# the path where sccache is installed, not changing /etc/environment, and not
# installing/downloading sccache as it is already in the docker image.
set -ex -o pipefail
install_binary() {
echo "Downloading sccache binary from S3 repo"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache
}
mkdir -p /tmp/cache/bin
mkdir -p /tmp/cache/lib
export PATH="/tmp/cache/bin:$PATH"
install_binary
chmod a+x /tmp/cache/bin/sccache
function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
# shellcheck disable=SC2086
# shellcheck disable=SC2059
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"
if [ "$1" == "gcc" ]; then
# Do not call sccache recursively when dumping preprocessor argument
# For some reason it's very important for the first cached nvcc invocation
cat >"/tmp/cache/bin/$1" <<EOF
#!/bin/sh
# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively
for arg in "\$@"; do
if [ "\$arg" = "-E" ]; then
exec $(which "$1") "\$@"
fi
done
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which "$1") "\$@"
else
exec $(which "$1") "\$@"
fi
EOF
else
cat >"/tmp/cache/bin/$1" <<EOF
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which "$1") "\$@"
else
exec $(which "$1") "\$@"
fi
EOF
fi
chmod a+x "/tmp/cache/bin/$1"
}

View File

@ -34,11 +34,14 @@ if which sccache > /dev/null; then
fi
print_cmake_info
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
fi
if which sccache > /dev/null; then
print_sccache_stats
fi

View File

@ -20,14 +20,4 @@ print_cmake_info() {
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}

View File

@ -42,6 +42,16 @@ test_python_all() {
assert_git_not_dirty
}
test_python_mps() {
setup_test_python
time python test/run_test.py --verbose --mps
MTL_CAPTURE_ENABLED=1 ${CONDA_RUN} python3 test/test_mps.py --verbose -k test_metal_capture
assert_git_not_dirty
}
test_python_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
@ -155,6 +165,7 @@ test_jit_hooks() {
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
brew install jpeg-turbo libpng
pushd torchvision
git fetch
@ -169,7 +180,8 @@ torchbench_setup_macos() {
git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
#TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp
USE_OPENMP=0 python setup.py develop
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
@ -177,9 +189,8 @@ torchbench_setup_macos() {
checkout_install_torchbench
}
conda_benchmark_deps() {
conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn
conda install -y -c conda-forge librosa
pip_benchmark_deps() {
python -mpip install --no-input astunparse requests cython scikit-learn
}
@ -187,7 +198,7 @@ test_torchbench_perf() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -214,32 +225,61 @@ test_torchbench_smoketest() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local backend=eager
local dtype=notset
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for backend in eager inductor; do
echo "Setup complete, launching torchbench training performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
done
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
if [ "$dtype" == notset ]; then
for dtype_ in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype_}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv"
local dtype_arg="--${dtype_}"
if [ "$dtype_" == notset ]; then
dtype_arg="--float32"
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv" || true
done
done
fi
echo "Launching torchbench inference performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
done
echo "Pytorch benchmark on mps device completed"
@ -249,7 +289,7 @@ test_hf_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching HuggingFace training perf run"
@ -265,7 +305,7 @@ test_timm_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching timm training perf run"
@ -277,8 +317,6 @@ test_timm_perf() {
echo "timm benchmark on mps device completed"
}
install_tlparse
if [[ $TEST_CONFIG == *"perf_all"* ]]; then
test_torchbench_perf
test_hf_perf
@ -290,7 +328,9 @@ elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then
elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
test_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"mps"* ]]; then
test_python_mps
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then

View File

@ -1,22 +0,0 @@
#!/bin/bash
set -e
run_test () {
rm -rf test_tmp/ && mkdir test_tmp/ && cd test_tmp/
"$@"
cd .. && rm -rf test_tmp/
}
get_runtime_of_command () {
TIMEFORMAT=%R
# runtime=$( { time ($@ &> /dev/null); } 2>&1 1>/dev/null)
runtime=$( { time "$@"; } 2>&1 1>/dev/null)
if [[ $runtime == *"Error"* ]]; then
exit 1
fi
runtime=${runtime#+++ $@}
runtime=$(python -c "print($runtime)")
echo "$runtime"
}

View File

@ -1,91 +0,0 @@
import argparse
import json
import math
import sys
parser = argparse.ArgumentParser()
parser.add_argument(
"--test-name", dest="test_name", action="store", required=True, help="test name"
)
parser.add_argument(
"--sample-stats",
dest="sample_stats",
action="store",
required=True,
help="stats from sample",
)
parser.add_argument(
"--update",
action="store_true",
help="whether to update baseline using stats from sample",
)
args = parser.parse_args()
test_name = args.test_name
if "cpu" in test_name:
backend = "cpu"
elif "gpu" in test_name:
backend = "gpu"
data_file_path = f"../{backend}_runtime.json"
with open(data_file_path) as data_file:
data = json.load(data_file)
if test_name in data:
mean = float(data[test_name]["mean"])
sigma = float(data[test_name]["sigma"])
else:
# Let the test pass if baseline number doesn't exist
mean = sys.maxsize
sigma = 0.001
print("population mean: ", mean)
print("population sigma: ", sigma)
# Let the test pass if baseline number is NaN (which happened in
# the past when we didn't have logic for catching NaN numbers)
if math.isnan(mean) or math.isnan(sigma):
mean = sys.maxsize
sigma = 0.001
sample_stats_data = json.loads(args.sample_stats)
sample_mean = float(sample_stats_data["mean"])
sample_sigma = float(sample_stats_data["sigma"])
print("sample mean: ", sample_mean)
print("sample sigma: ", sample_sigma)
if math.isnan(sample_mean):
raise Exception("""Error: sample mean is NaN""") # noqa: TRY002
elif math.isnan(sample_sigma):
raise Exception("""Error: sample sigma is NaN""") # noqa: TRY002
z_value = (sample_mean - mean) / sigma
print("z-value: ", z_value)
if z_value >= 3:
raise Exception( # noqa: TRY002
f"""\n
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run
`cd .ci/pytorch/perf_test/ && bash {test_name}.sh` on your local machine
and compare the runtime before/after your code change.
"""
)
else:
print("z-value < 3, no perf regression detected.")
if args.update:
print("We will use these numbers as new baseline.")
new_data_file_path = f"../new_{backend}_runtime.json"
with open(new_data_file_path) as new_data_file:
new_data = json.load(new_data_file)
new_data[test_name] = {}
new_data[test_name]["mean"] = sample_mean
new_data[test_name]["sigma"] = max(sample_sigma, sample_mean * 0.1)
with open(new_data_file_path, "w") as new_data_file:
json.dump(new_data, new_data_file, indent=4)

View File

@ -1,18 +0,0 @@
import json
import sys
import numpy
sample_data_list = sys.argv[1:]
sample_data_list = [float(v.strip()) for v in sample_data_list]
sample_mean = numpy.mean(sample_data_list)
sample_sigma = numpy.std(sample_data_list)
data = {
"mean": sample_mean,
"sigma": sample_sigma,
}
print(json.dumps(data))

View File

@ -1,43 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_cpu_speed_mini_sequence_labeler () {
echo "Testing: mini sequence labeler, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/benchmark.git
cd benchmark/
git checkout 726567a455edbfda6199445922a8cfee82535664
cd scripts/mini_sequence_labeler
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py)
SAMPLE_ARRAY+=("${runtime}")
done
cd ../../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_mini_sequence_labeler "$@"
fi

View File

@ -1,45 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_cpu_speed_mnist () {
echo "Testing: MNIST, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/examples.git -b perftests
cd examples/mnist
conda install -c pytorch torchvision-cpu
# Download data
python main.py --epochs 0
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py --epochs 1 --no-log)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_mnist "$@"
fi

View File

@ -1,29 +0,0 @@
#!/bin/bash
. ./common.sh
test_cpu_speed_torch () {
echo "Testing: torch.*, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/yf225/perf-tests.git
if [ "$1" == "compare_with_baseline" ]; then
export ARGS=(--compare ../cpu_runtime.json)
elif [ "$1" == "compare_and_update" ]; then
export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)
elif [ "$1" == "update_only" ]; then
export ARGS=(--update ../new_cpu_runtime.json)
fi
if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_torch "$@"
fi

View File

@ -1,29 +0,0 @@
#!/bin/bash
. ./common.sh
test_cpu_speed_torch_tensor () {
echo "Testing: torch.Tensor.*, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/yf225/perf-tests.git
if [ "$1" == "compare_with_baseline" ]; then
export ARGS=(--compare ../cpu_runtime.json)
elif [ "$1" == "compare_and_update" ]; then
export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)
elif [ "$1" == "update_only" ]; then
export ARGS=(--update ../new_cpu_runtime.json)
fi
if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_torch_tensor "$@"
fi

View File

@ -1,44 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_gpu_speed_cudnn_lstm () {
echo "Testing: CuDNN LSTM, GPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/benchmark.git
cd benchmark/
git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0
cd scripts/
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python cudnn_lstm.py --skip-cpu-governor-check)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_gpu_speed_cudnn_lstm "$@"
fi

View File

@ -1,44 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_gpu_speed_lstm () {
echo "Testing: LSTM, GPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/benchmark.git
cd benchmark/
git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0
cd scripts/
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python lstm.py --skip-cpu-governor-check)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_gpu_speed_lstm "$@"
fi

View File

@ -1,44 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_gpu_speed_mlstm () {
echo "Testing: MLSTM, GPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/benchmark.git
cd benchmark/
git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0
cd scripts/
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python mlstm.py --skip-cpu-governor-check)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_gpu_speed_mlstm "$@"
fi

View File

@ -1,48 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_gpu_speed_mnist () {
echo "Testing: MNIST, GPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/examples.git -b perftests
cd examples/mnist
conda install -c pytorch torchvision
# Download data
python main.py --epochs 0
SAMPLE_ARRAY=()
NUM_RUNS=$1
# Needs warm up to get accurate number
python main.py --epochs 1 --no-log
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py --epochs 1 --no-log)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_gpu_speed_mnist "$@"
fi

View File

@ -1,53 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_gpu_speed_word_language_model () {
echo "Testing: word language model on Wikitext-2, GPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/examples.git -b perftests
cd examples/word_language_model
cd data/wikitext-2
# Reduce dataset size, so that we can have more runs per test
sed -n '1,200p' test.txt > test_tmp.txt
sed -n '1,1000p' train.txt > train_tmp.txt
sed -n '1,200p' valid.txt > valid_tmp.txt
mv test_tmp.txt test.txt
mv train_tmp.txt train.txt
mv valid_tmp.txt valid.txt
cd ../..
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py --cuda --epochs 1)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_gpu_speed_word_language_model "$@"
fi

View File

@ -1,14 +0,0 @@
import json
import sys
data_file_path = sys.argv[1]
commit_hash = sys.argv[2]
with open(data_file_path) as data_file:
data = json.load(data_file)
data["commit"] = commit_hash
with open(data_file_path, "w") as data_file:
json.dump(data, data_file)

View File

@ -119,12 +119,6 @@ popd
git rm -rf "$install_path" || true
mv "$pt_checkout/docs/build/html" "$install_path"
# Prevent Google from indexing $install_path/_modules. This folder contains
# generated source files.
# NB: the following only works on gnu sed. The sed shipped with mac os is different.
# One can `brew install gnu-sed` on a mac and then use "gsed" instead of "sed".
find "$install_path/_modules" -name "*.html" -print0 | xargs -0 sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'
git add "$install_path" || true
git status
git config user.email "soumith+bot@pytorch.org"

View File

@ -76,7 +76,7 @@ fi
# Environment initialization
if [[ "$(uname)" == Darwin ]]; then
# Install the testing dependencies
retry conda install -yq future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
else
retry pip install -qr requirements.txt || true
retry pip install -q hypothesis protobuf pytest setuptools || true
@ -91,7 +91,6 @@ fi
echo "Testing with:"
pip freeze
conda list || true
##############################################################################
# Smoke tests

View File

@ -1,71 +0,0 @@
#!/bin/bash
SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}")
# shellcheck source=.ci/pytorch/common.sh
source "$SCRIPT_PARENT_DIR/common.sh"
cd .ci/pytorch/perf_test
echo "Running CPU perf test for PyTorch..."
pip install -q awscli
# Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read
# More info at https://github.com/aws/aws-cli/issues/2321
aws configure set default.s3.multipart_threshold 5GB
UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# Get current default branch commit hash
DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)
export DEFAULT_BRANCH_COMMIT_ID
fi
# Find the default branch commit to test against
git remote add upstream https://github.com/pytorch/pytorch.git
git fetch upstream
IFS=$'\n'
while IFS='' read -r commit_id; do
if aws s3 ls s3://ossci-perf-test/pytorch/cpu_runtime/"${commit_id}".json; then
LATEST_TESTED_COMMIT=${commit_id}
break
fi
done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")
aws s3 cp s3://ossci-perf-test/pytorch/cpu_runtime/"${LATEST_TESTED_COMMIT}".json cpu_runtime.json
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# Prepare new baseline file
cp cpu_runtime.json new_cpu_runtime.json
python update_commit_hash.py new_cpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"
fi
# Include tests
# shellcheck source=./perf_test/test_cpu_speed_mini_sequence_labeler.sh
. ./test_cpu_speed_mini_sequence_labeler.sh
# shellcheck source=./perf_test/test_cpu_speed_mnist.sh
. ./test_cpu_speed_mnist.sh
# shellcheck source=./perf_test/test_cpu_speed_torch.sh
. ./test_cpu_speed_torch.sh
# shellcheck source=./perf_test/test_cpu_speed_torch_tensor.sh
. ./test_cpu_speed_torch_tensor.sh
# Run tests
export TEST_MODE="compare_with_baseline"
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
export TEST_MODE="compare_and_update"
fi
# Operator tests
run_test test_cpu_speed_torch ${TEST_MODE}
run_test test_cpu_speed_torch_tensor ${TEST_MODE}
# Sample model tests
run_test test_cpu_speed_mini_sequence_labeler 20 ${TEST_MODE}
run_test test_cpu_speed_mnist 20 ${TEST_MODE}
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# This could cause race condition if we are testing the same default branch commit twice,
# but the chance of them executing this line at the same time is low.
aws s3 cp new_cpu_runtime.json s3://ossci-perf-test/pytorch/cpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read
fi

View File

@ -1,76 +0,0 @@
#!/bin/bash
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
pushd .ci/pytorch/perf_test
echo "Running GPU perf test for PyTorch..."
# Trying to uninstall PyYAML can cause problem. Workaround according to:
# https://github.com/pypa/pip/issues/5247#issuecomment-415571153
pip install -q awscli --ignore-installed PyYAML
# Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read
# More info at https://github.com/aws/aws-cli/issues/2321
aws configure set default.s3.multipart_threshold 5GB
UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# Get current default branch commit hash
DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)
export DEFAULT_BRANCH_COMMIT_ID
fi
# Find the default branch commit to test against
git remote add upstream https://github.com/pytorch/pytorch.git
git fetch upstream
IFS=$'\n'
while IFS='' read -r commit_id; do
if aws s3 ls s3://ossci-perf-test/pytorch/gpu_runtime/"${commit_id}".json; then
LATEST_TESTED_COMMIT=${commit_id}
break
fi
done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")
aws s3 cp s3://ossci-perf-test/pytorch/gpu_runtime/"${LATEST_TESTED_COMMIT}".json gpu_runtime.json
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# Prepare new baseline file
cp gpu_runtime.json new_gpu_runtime.json
python update_commit_hash.py new_gpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"
fi
# Include tests
# shellcheck source=./perf_test/test_gpu_speed_mnist.sh
. ./test_gpu_speed_mnist.sh
# shellcheck source=./perf_test/test_gpu_speed_word_language_model.sh
. ./test_gpu_speed_word_language_model.sh
# shellcheck source=./perf_test/test_gpu_speed_cudnn_lstm.sh
. ./test_gpu_speed_cudnn_lstm.sh
# shellcheck source=./perf_test/test_gpu_speed_lstm.sh
. ./test_gpu_speed_lstm.sh
# shellcheck source=./perf_test/test_gpu_speed_mlstm.sh
. ./test_gpu_speed_mlstm.sh
# Run tests
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
run_test test_gpu_speed_mnist 20 compare_and_update
run_test test_gpu_speed_word_language_model 20 compare_and_update
run_test test_gpu_speed_cudnn_lstm 20 compare_and_update
run_test test_gpu_speed_lstm 20 compare_and_update
run_test test_gpu_speed_mlstm 20 compare_and_update
else
run_test test_gpu_speed_mnist 20 compare_with_baseline
run_test test_gpu_speed_word_language_model 20 compare_with_baseline
run_test test_gpu_speed_cudnn_lstm 20 compare_with_baseline
run_test test_gpu_speed_lstm 20 compare_with_baseline
run_test test_gpu_speed_mlstm 20 compare_with_baseline
fi
if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then
# This could cause race condition if we are testing the same default branch commit twice,
# but the chance of them executing this line at the same time is low.
aws s3 cp new_gpu_runtime.json s3://ossci-perf-test/pytorch/gpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read
fi
popd

View File

@ -93,7 +93,7 @@ def check_lib_symbols_for_abi_correctness(lib: str) -> None:
f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"
)
if num_cxx11_symbols < 100:
raise RuntimeError("Didn't find enought cxx11 symbols")
raise RuntimeError("Didn't find enough cxx11 symbols")
def main() -> None:

View File

@ -0,0 +1,77 @@
import ctypes
import os
import sys
from pathlib import Path
def get_gomp_thread():
"""
Retrieves the maximum number of OpenMP threads after loading the `libgomp.so.1` library
and the `libtorch_cpu.so` library. It then queries the
maximum number of threads available for OpenMP parallel regions using the
`omp_get_max_threads` function.
Returns:
int: The maximum number of OpenMP threads available.
Notes:
- The function assumes the default path for `libgomp.so.1` on AlmaLinux OS.
- The path to `libtorch_cpu.so` is constructed based on the Python executable's
installation directory.
- This function is specific to environments where PyTorch and OpenMP are used
together and may require adjustments for other setups.
"""
python_path = Path(sys.executable).resolve()
python_prefix = (
python_path.parent.parent
) # Typically goes to the Python installation root
# Get the additional ABI flags (if any); it may be an empty string.
abiflags = getattr(sys, "abiflags", "")
# Construct the Python directory name correctly (e.g., "python3.13t").
python_version = (
f"python{sys.version_info.major}.{sys.version_info.minor}{abiflags}"
)
libtorch_cpu_path = (
python_prefix
/ "lib"
/ python_version
/ "site-packages"
/ "torch"
/ "lib"
/ "libtorch_cpu.so"
)
# use the default gomp path of AlmaLinux OS
libgomp_path = "/usr/lib64/libgomp.so.1"
# if it does not exist, try Ubuntu path
if not os.path.exists(libgomp_path):
libgomp_path = f"/usr/lib/{os.uname().machine}-linux-gnu/libgomp.so.1"
os.environ["GOMP_CPU_AFFINITY"] = "0-3"
libgomp = ctypes.CDLL(libgomp_path)
libgomp = ctypes.CDLL(libtorch_cpu_path)
libgomp.omp_get_max_threads.restype = ctypes.c_int
libgomp.omp_get_max_threads.argtypes = []
omp_max_threads = libgomp.omp_get_max_threads()
return omp_max_threads
def main():
omp_max_threads = get_gomp_thread()
print(
f"omp_max_threads after loading libgomp.so and libtorch_cpu.so: {omp_max_threads}"
)
if omp_max_threads == 1:
raise RuntimeError(
"omp_max_threads is 1. Check whether libgomp.so is loaded twice."
)
if __name__ == "__main__":
main()

View File

@ -276,7 +276,7 @@ def smoke_test_cuda(
torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())
print(f"Torch nccl; version: {torch_nccl_version}")
# Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.
# Pypi dependencies are installed on linux only and nccl is available only on Linux.
if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:
compare_pypi_to_torch_versions(
"cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

View File

@ -191,8 +191,12 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/umf/latest/env/vars.sh
fi
# shellcheck disable=SC1091
source /opt/intel/oneapi/ccl/latest/env/vars.sh
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
timeout 30 xpu-smi discovery || true
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
@ -208,8 +212,6 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export VALGRIND=OFF
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
# if you're not careful. Check this if you made some changes and the
# ASAN test is not working
@ -222,7 +224,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-18/bin/llvm-symbolizer
export TORCH_USE_RTLD_GLOBAL=1
# NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our
# default behavior.
@ -314,6 +316,20 @@ test_python() {
assert_git_not_dirty
}
test_python_smoke() {
# Smoke tests for H100
time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 inductor/test_max_autotune $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_h100_distributed() {
# Distributed tests at H100
time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
# This test requires multicast support
time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
@ -398,8 +414,15 @@ test_inductor_aoti() {
# We need to hipify before building again
python3 tools/amd_build/build_amd.py
fi
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
if [[ "$BUILD_ENVIRONMENT" == *sm86* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop
# TODO: Replace me completely, as one should not use conda libstdc++, nor need special path to TORCH_LIB
LD_LIBRARY_PATH=/opt/conda/envs/py_3.10/lib/:${TORCH_LIB_DIR}:$LD_LIBRARY_PATH
CPP_TESTS_DIR="${BUILD_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
else
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
fi
}
test_inductor_cpp_wrapper_shard() {
@ -414,10 +437,11 @@ test_inductor_cpp_wrapper_shard() {
if [[ "$1" -eq "2" ]]; then
# For now, manually put the opinfo tests in shard 2, and all other tests in
# shard 1. Test specific things triggering past bugs, for now.
# shard 1. Run all CPU tests, as well as specific GPU tests triggering past
# bugs, for now.
python test/run_test.py \
--include inductor/test_torchinductor_opinfo \
-k 'linalg or to_sparse' \
-k 'linalg or to_sparse or TestInductorOpInfoCPU' \
--verbose
exit
fi
@ -572,7 +596,9 @@ test_perf_for_dashboard() {
local device=cuda
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then
device=zen_cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
@ -802,16 +828,7 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_get_core_number() {
if [[ "${TEST_CONFIG}" == *aarch64* ]]; then
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
else
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
fi
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
@ -823,14 +840,23 @@ test_inductor_set_cpu_affinity(){
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
# Set number of cores to 16 on Aarch64 for performance runs.
# Use nproc here instead of lscpu because it takes into account cgroups slice
cpus=$(nproc)
thread_per_core=$(lscpu | grep 'Thread(s) per core:' | awk '{print $4}')
cores=$((cpus / thread_per_core))
# Set number of cores to 16 on aarch64 for performance runs
if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then
cores=16
fi
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
# Handle cgroups slice start and end CPU
start_cpu=$(python -c 'import os; print(min(os.sched_getaffinity(0)))')
# Leaving one physical CPU for other tasks
end_cpu=$(($(python -c 'import os; print(max(os.sched_getaffinity(0)))') - thread_per_core))
export TASKSET="taskset -c $start_cpu-$end_cpu"
}
test_inductor_torchbench_cpu_smoketest_perf(){
@ -1113,6 +1139,12 @@ test_custom_backend() {
test_custom_script_ops() {
echo "Testing custom script operators"
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
echo "Skipping custom script operators until it's fixed"
return 0
fi
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
pushd test/custom_operator
cp -a "$CUSTOM_OP_BUILD" build
@ -1175,7 +1207,6 @@ build_xla() {
# These functions are defined in .circleci/common.sh in pytorch/xla repo
retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE
CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR
retry install_post_deps_pytorch_xla
assert_git_not_dirty
}
@ -1477,8 +1508,6 @@ test_executorch() {
export PYTHON_EXECUTABLE=python
export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# For llama3
bash examples/models/llama3_2_vision/install_requirements.sh
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
bash .ci/scripts/setup-linux.sh --build-tool cmake
@ -1505,7 +1534,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1521,7 +1550,7 @@ test_linux_aarch64() {
inductor/test_inplacing_pass inductor/test_kernel_benchmark inductor/test_layout_optim \
inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_split_cat_fx_passes inductor/test_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \
inductor/test_triton_cpu_backend inductor/test_triton_extension_backend inductor/test_mkldnn_pattern_matcher inductor/test_cpu_cpp_wrapper \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
@ -1539,7 +1568,8 @@ test_operator_benchmark() {
cd "${TEST_DIR}"/benchmarks/operator_benchmark
$TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \
--output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"
--output-csv "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.json" \
pip_install pandas
python check_perf_csv.py \
@ -1624,7 +1654,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
install_torchaudio cuda
fi
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git
TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
@ -1707,6 +1737,10 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_python
test_aten
test_xpu_bin
elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
else
install_torchvision
install_monkeytype

View File

@ -31,7 +31,7 @@ PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"
echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the corresponding APIs instead."
echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"
echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"
echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

View File

@ -10,7 +10,7 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
:: able to see what our cl.exe commands are (since you can actually
:: just copy-paste them into a local Windows setup to just rebuild a
:: single file.)
:: log sizes are too long, but leaving this here incase someone wants to use it locally
:: log sizes are too long, but leaving this here in case someone wants to use it locally
:: set CMAKE_VERBOSE_MAKEFILE=1
@ -37,6 +37,11 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Update CMake
call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=System' --apply-install-arguments-to-dependencies --version=3.27.9
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
@ -88,7 +93,7 @@ set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
:cuda_build_end
set DISTUTILS_USE_SDK=1
set PATH=%TMP_DIR_WIN%\bin;%PATH%
set PATH=%TMP_DIR_WIN%\bin;C:\Program Files\CMake\bin;%PATH%
:: The latest Windows CUDA test is running on AWS G5 runner with A10G GPU
if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=8.6

View File

@ -24,7 +24,7 @@ if "%CUDA_SUFFIX%" == "" (
if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z & REM @lint-ignore
) else (
aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
)

View File

@ -52,7 +52,7 @@ if __name__ == "__main__":
if os.path.exists(debugger):
command_args = [debugger, "-o", "-c", "~*g; q"] + command_args
command_string = " ".join(command_args)
print("Reruning with traceback enabled")
print("Rerunning with traceback enabled")
print("Command:", command_string)
subprocess.run(command_args, check=False)
sys.exit(e.returncode)

View File

@ -38,7 +38,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 pytest-subtests==0.13.1
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0

View File

@ -7,7 +7,7 @@ if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
cd %DEPENDENCIES_DIR%

View File

@ -7,7 +7,7 @@ if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: Clone OpenBLAS

Some files were not shown because too many files have changed in this diff Show More