Commit Graph

93017 Commits

Author SHA1 Message Date
b3ad8f4a9c [BUG] Fix nonzero_static crash on CUDA when the input is a empty tensor (#162578)
Fixes #162473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162578
Approved by: https://github.com/ngimel
2025-09-15 05:44:15 +00:00
755cf90672 Redirect all use of filesystem to c10/utils/FileSystem.h (#162914)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162914
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever
2025-09-15 04:30:41 +00:00
76e5df3866 [BE] Use fmt::format to define Conv key (#162925)
Also use `getArrayRefString` instead of having separate cases for 2D and 3D Conv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162925
Approved by: https://github.com/Skylion007
ghstack dependencies: #162921
2025-09-15 02:44:12 +00:00
7fe1f5ea49 [BE] Delete [Ventura|Sonoma]Ops header (#162921)
Was a temp solution to make PyTorch+MPS buildable on MacOS-12, but it's no longer needed, as in 2.9+ MPS is only supported on MacOS Sonoma+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162921
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-09-15 02:44:12 +00:00
e156a07171 [Precompile] [RFC] Implement aot_compile_module (#162171)
This PR adds a new interface _aot_compile to `OptimizedModule`, so that the following is possible:

```
mod = SimpleLinearModule()
inputs = [
            ModelInput(
                args=(torch.randn(3, 3),),
                kwargs={},
                contexts=[torch.no_grad(), eval_mode(model)],
            ),
            ModelInput(
                args=(torch.randn(3, 3),), kwargs={}, contexts=[train_mode(model)]
            ),
        ]
        assert isinstance(model, torch._dynamo.eval_frame.OptimizedModule)
        model._aot_compile(
            inputs,
        )
```

After this PR, you can AOT precompile NanoGPT and use it to train directly. I'll share my fork of the repo to make this work.

## ModelInput
The `ModelInput` API is a work in progress; for now it represents a set of inputs and contexts to instruct the compiler to compile. Most commonly, this is "compile an eval mode with no grad, and  a training mode with grad", but also contains things like autocasting contexts, etc.

## Dispatch
Dispatching is super simple here, we just iterate through all the precompiled fullgraphs and check guards for each one until there's one htat passes. I'm a bit worried that having this in python code is going to be too expensive. The guard checks are happening in C++ anyway, though, so the only python bottlenecked step here is just the for loop, so perhaps the overhead will not be high. I'll work on measuring this, though.

## TODOs

This PR does not support `mod.compile()`, only `torch.compile(mod)`. In order to support `mod.compile()`, we'll need to update torch.nn.Module with an updated implementation — I can add that frontend later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162171
Approved by: https://github.com/zhxchen17
2025-09-14 23:32:28 +00:00
ba5ca31676 [MPS] sparse mps any (#162885)
Add SparseMPS key for any op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162885
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 18:57:53 +00:00
8e1db46493 [MPS] enable empty like and unsqueeze for SparseMPS (#162910)
Enable empty like and unsqueeze for SparseMPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162910
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 17:47:06 +00:00
aff2438554 QoL: add pip to requirements-build.txt (#162896)
uv venvs by default don't come with pip, but for example setup.py assumes it is available.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162896
Approved by: https://github.com/Skylion007
2025-09-14 17:08:05 +00:00
3f8a2e62ea Fix rebind_unbacked in torch.fx.experimental.symbolic_shapes (#162788)
## Description
Fix a float type handling in `torch.fx.experimental.symbolic_shapes` function. [#162480](https://github.com/pytorch/pytorch/issues/162480)

## Issue
When I use AOTInductor to compile the YOLOv10, I encounter the bug `'float' object has no attribute 'node'`.
[Torch AOTInductor Ahead-Of-Time Compilation Fail](https://github.com/opendatalab/DocLayout-YOLO/issues/177)

The problem is due to missing float type handling.
https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/symbolic_shapes.py#L597
```
            if isinstance(u1, int):
                log.info(
                    "rebind_unbacked: discard %s %s %s -> %s",
                    n.target,
                    raw_u0,
                    path,
                    u1,
                )
                continue
```

## Solution
Change the code `if isinstance(u1, float)` to `if isinstance(u1, (int,float))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162788
Approved by: https://github.com/ezyang
2025-09-14 17:07:14 +00:00
6d64bc3990 [data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888)
Summary:
The check is introduced in D82262053
- `scalar_value` could be a numpy object
  - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`.

Test Plan:
```
vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10
```

Rollback Plan:

Differential Revision: D82383428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888
Approved by: https://github.com/xush6528
2025-09-14 08:09:08 +00:00
972140b7e9 [benchmark] Add HF LLM benchmarks (#156967)
Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-14 07:41:06 +00:00
84186c39ed [NVRTC] Enable compiling templated kernels (#162875)
Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps

NVRTC side
- (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>`
- `nvrtcCompileProgram`
- (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed
- `nvrtcDestroyProgram`

CUDA side
- use mangled name instead of normal name -> profit
- `extern "C"` is not even needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875
Approved by: https://github.com/msaroufim
2025-09-14 06:17:36 +00:00
74a35c6344 [Triton] [Inductor] Enable TMA store for TMA mm templates (#160480)
Summary:
Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support.

This also includes a couple notable changes:
- Adds support in the TMA template support for checking the output layout.
- Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation.
- Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches.

Test Plan:
Tested with test_max_autotune.py unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480
Approved by: https://github.com/NikhilAPatel
2025-09-14 04:56:49 +00:00
d2f6daf6a7 [audio hash update] update the pinned audio hash (#162892)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162892
Approved by: https://github.com/pytorchbot
2025-09-14 04:27:37 +00:00
e74b21d66a [vllm hash update] update the pinned vllm hash (#162891)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162891
Approved by: https://github.com/pytorchbot
2025-09-14 04:27:35 +00:00
f01bf0f64b Do not use // but use CleanDiv or FloorDiv instead (#162869)
Summary:
When rewriting sympy expressions in the compiler codebase we want to generate
FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1))

For symnodes we automatically handle that conversions in the symnode op dispatch.
I will follow up with an issue to track all other usages of //.
Block internal Model.

Test Plan:
add test
run existing tests.
dakechen1993 testing on the model.

Rollback Plan:

Differential Revision: D82362241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869
Approved by: https://github.com/ezyang
2025-09-14 01:30:33 +00:00
886699bc5c Port shared_ptr optimization in std::shared_ptr to intrusive_ptr (#162784)
Summary:
Please see D21021645 for details about the optimization and why it's beneficial.

A similar change has been added to libstdc++ as well, see dbf8bd3c2f

Rollback Plan:

Reviewed By: yfeldblum

Differential Revision: D81960754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784
Approved by: https://github.com/swolchok
2025-09-13 21:01:00 +00:00
72b5159782 [flatbuffer] Fix compile error due to discarded result (#162767)
Summary:
One of our builds fails because the return value of fread is discarded. Explicit cast to void fixes the build.

```log
In file included from fbcode/caffe2/torch/csrc/jit/mobile/import.cpp:15:
fbcode/caffe2/torch/csrc/jit/mobile/file_format.h:156:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
  156 |   fread(data.get(), size, 1, f);
      |   ^~~~~ ~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
...
BUILD FAILED
Failed to build 'fbcode//caffe2:libtorch (cfg:opt-linux-x86_64-clang19-no-san-opt-by-default#fef256f7ee896871)'
```

Test Plan:
No runtime behavior change. CI.

Rollback Plan:

Differential Revision: D82265002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162767
Approved by: https://github.com/Skylion007
2025-09-13 20:24:43 +00:00
f37eaebed1 Add missing tags parameter to custom_op overload signatures (#162047)
It appears to be an omission in #149782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162047
Approved by: https://github.com/zou3519, https://github.com/BoyuanFeng

Co-authored-by: Boyuan Feng <fby.1994@gmail.com>
2025-09-13 19:57:23 +00:00
5b9114bf19 Revert "[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)"
This reverts commit 62843c14bbf694f5722fd6e1075da4792507fe42.

Reverted https://github.com/pytorch/pytorch/pull/162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see https://github.com/pytorch/pytorch/issues/162881 ([comment](https://github.com/pytorch/pytorch/pull/162330#issuecomment-3288544921))
2025-09-13 15:43:50 +00:00
deb7ebe0a3 Revert "[Reland] Use std::string_view in torchgen (#158625)"
This reverts commit 972e409829343cc2062aeee0994a9c1c735d216a.

Reverted https://github.com/pytorch/pytorch/pull/158625 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break a couple of ExecuTorch tests for Vulkan backend ([comment](https://github.com/pytorch/pytorch/pull/158625#issuecomment-3287754275))
2025-09-13 07:52:50 +00:00
9c93dc8123 Revert "Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)"
This reverts commit a956c4ab1cb13079203a8f07eb26218724f54dc8.

Reverted https://github.com/pytorch/pytorch/pull/160532 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160532#issuecomment-3287745165))
2025-09-13 07:42:12 +00:00
31040b6357 Revert "port some distributed tensor test files for Intel GPU (#161703)"
This reverts commit 179f10621b418427fc6e92f58ea2b0bbe4cc9c52.

Reverted https://github.com/pytorch/pytorch/pull/161703 on behalf of https://github.com/huydhn due to Sorry for reverting your change but these tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/161703#issuecomment-3287720713))
2025-09-13 07:22:14 +00:00
aa41d3e49c Claude loves making these files in top level, ignore them for sanity. (#162806)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162806
Approved by: https://github.com/albanD
2025-09-13 04:59:00 +00:00
f0fcf436c5 [audio hash update] update the pinned audio hash (#162864)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162864
Approved by: https://github.com/pytorchbot
2025-09-13 04:17:21 +00:00
5663910472 [vllm hash update] update the pinned vllm hash (#162751)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162751
Approved by: https://github.com/pytorchbot
2025-09-13 04:16:51 +00:00
da669d51bf fusion of large accumulated reads only at ir level (#161978)
This is to revert some of the changes in https://github.com/pytorch/pytorch/pull/158667

In particular, we only disallow fusion of large accumulate read at IR level and not at scheduler level, as users can create their own custom fusion logics for the scheduler level.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161978
Approved by: https://github.com/yf225
2025-09-13 04:07:25 +00:00
783985e9fe kjt pytree registration (#161114)
Differential Revision: D80656182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161114
Approved by: https://github.com/henryoier
2025-09-13 03:57:43 +00:00
49d30f9a23 Fix boxcox to return same result for same input in one batch (#162772)
Summary:
The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`.  The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.

Rollback Plan:

Differential Revision: D82265247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772
Approved by: https://github.com/swolchok
2025-09-13 03:57:35 +00:00
66133b1ab7 Build vLLM aarch64 nightly wheels (#162664)
PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664
Approved by: https://github.com/atalman
2025-09-13 03:43:55 +00:00
543d50db2b Fix torch export with dict input nested in args (#162618)
Investigated together with @pyemma and @taotaohuang001

## Problem
when calling exported module with dict nested in the args tuple, it will make following complaits
```
Traceback (most recent call last):
  File "/home/chzhu/infinitrain/test_torch_export.py", line 32, in <module>
    print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 424, in __call__
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 411, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1806, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 81, in _check_input_constraints_pre_hook
    flat_args_with_path = _check_inputs_match(args, kwargs, self._in_spec)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 64, in _check_inputs_match
    raise ValueError(  # noqa: B904
ValueError: Trying to flatten user inputs with exported input tree spec:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a1', 'a2'], [*,
      *])]),
  TreeSpec(dict, [], [])])
but actually got inputs with tree spec of:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a2', 'a1'], [*,
      *])]),
  TreeSpec(dict, [], [])]).
Please check that the inputs have the same number and type of args and kwargs as the ones you used when tracing.

```

## How to reproduce the issue
```python
import torch

# create a nn.Module with data_batch as input and output as output
class MyModel(torch.nn.Module):
   def __init__(self):
       super(MyModel, self).__init__()
       self.linear = torch.nn.Linear(10, 1)

   def forward(self, data_batch):
       h1 = self.linear(data_batch["a1"])
       h2 = self.linear(data_batch["a2"])
       return h1 + h2

# torch export this module
model = MyModel()
example_args_forward = (
   {
       "a1": torch.randn(10),
       "a2": torch.randn(10),
   },
)
exported_model = torch.export.export(model, example_args_forward, strict=True)

# save the exported model
torch.export.save(exported_model, "exported_model.pt2")

# load the exported model
exported_model = torch.export.load("exported_model.pt2").module()

# run the exported model
print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))

```

## Root Cause
Input spec is encoded as [TreeSpec](582d278983/torch/utils/_pytree.py (L1059)) in torch export. With (args, kwargs) at the top level. When we call the exported model, it has a pre-execution [hook](582d278983/torch/export/_unlift.py (L66)) to check the input TreeSpec matches the received TreeSpec, where in Treespec, the dict key order is preserved. Something like

TreeSpec(dict, ['a2', 'a1'], [*,*])

To workaround this, the input check reorders [kwargs](582d278983/torch/export/_unlift.py (L67)), that is why kwargs can be out of order. But the dict nested in the args is not re-ordered, so any re-ordering of the keys will throw errors.

## Solution
Update eq_spec to handle the dict case, where we only guarantee that key set is the same without ordering constraints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162618
Approved by: https://github.com/angelayi
2025-09-13 03:24:30 +00:00
7dd5f7b125 Revert "python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580)"
This reverts commit 4b2d297eec425475a82934a52e0edd96805524a1.

Reverted https://github.com/pytorch/pytorch/pull/160580 on behalf of https://github.com/bdhirsh due to this broke shampoo, yanking ([comment](https://github.com/pytorch/pytorch/pull/160580#issuecomment-3287372891))
2025-09-13 02:04:36 +00:00
a956c4ab1c Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)
Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Test Plan:
CI

Rollback Plan:

Differential Revision: D80181887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-13 01:50:51 +00:00
0925c644ed [DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)
Summary:
Sometimes checkpoint background process creation times out during gloo pg init.
Attempting to destroy the process during that time can block the trainer thread until the timeout completes.

This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time.

Test Plan:
CI

Rollback Plan:

Differential Revision: D81724668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760
Approved by: https://github.com/meetv18
2025-09-13 01:50:40 +00:00
b2553a6ec4 [AOTI] raise PyTorchStreamWriter open failed error code on windows (#162799)
When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`.  I found it didn't output the verbose error code, when PyTorchStreamWriter open failed.

This PR add the verbose error code output for debug. Local test shows as below:
<img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075" />

The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
```
ERROR_SHARING_VIOLATION
32 (0x20)
The process cannot access the file because it is being used by another process.
```

This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799
Approved by: https://github.com/jansel
2025-09-13 01:41:14 +00:00
a749c40342 [Bilinear] move check to reset_parameters (#160952)
Fixes #160407

### Summary:
Moved the check to reset_parameters to make `Bilinear` module lazy. Lazy modules have in_features initialized to 0 and a pre forward hook that initializes these to the appropriate shape, then calls reset parameters,

### Impact:
module: nn, linear.py

### Test:

<img width="903" height="182" alt="Screenshot From 2025-08-19 13-27-12" src="https://github.com/user-attachments/assets/bc04b0d6-5174-4dc9-8b21-9e019b3822a5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160952
Approved by: https://github.com/mikaylagawarecki
2025-09-13 01:17:10 +00:00
595e13feb7 [BE] [Inductor] Update NoValidChoicesError logic (#162814)
Summary: Updates the NoValidChoicesError logic to include some additional context for if not choices exists or if no choices compiled.

Test Plan:
NFC. Depending on CI.

Rollback Plan:

Differential Revision: D82312035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162814
Approved by: https://github.com/mlazos
2025-09-13 00:45:50 +00:00
ddc5107601 An improved heuristic for operator reordering for peak memory + debugging logs (#161810)
Revisiting the idea in https://github.com/pytorch/pytorch/pull/140195

For the lpmf algorithm in the memory reorder pass, in some cases, when all the nodes that can be scheduled are quite large, it is beneficial to switch the scheduling strategy. So instead of using size as the criterion, we choose a node that can unlock more nodes to become schedulable by analyzing their successor nodes.

For an internal use case, we observe up to 20 GiB memory difference and here are the before and after memory snapshot. More information can be found in [D81270682](https://www.internalfb.com/diff/D81270682) (internal only).

<img width="348" height="227" alt="image" src="https://github.com/user-attachments/assets/fb71e840-1508-44ed-bc9d-5eb4d364607d" />

In addition, add the functionality to upload the graph to tlparse for offline debugging. The format of the json is in consistency with the simulator [here](https://fburl.com/code/3l3d3qi4) (internal only).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161810
Approved by: https://github.com/yf225
2025-09-13 00:42:32 +00:00
a94ddd9b00 [OpenReg] Fix the docs of Accelerator Intergration (#162826)
----

- Fixed the redirect link about step 1
- Formatted the autoload and added necessary links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162826
Approved by: https://github.com/albanD
ghstack dependencies: #161917, #161918, #160101
2025-09-12 23:53:17 +00:00
29f84b0f61 [OpenReg] Improve the Event and Stream capabilities of DeviceGuardImplInterface (#160101)
**Changes:**

- Based on `OpenRegStream` and `OpenRegEvent`, we improve the implementation of Device Guard for `OpenReg`
- Add some related testcases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160101
Approved by: https://github.com/albanD
ghstack dependencies: #161917, #161918
2025-09-12 23:53:17 +00:00
27daa6af6a [OpenReg] Strengthen Openreg's execution limits to minimize the waste of computing resources (#161918)
Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS).

Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918
Approved by: https://github.com/albanD
ghstack dependencies: #161917
2025-09-12 23:53:17 +00:00
9b429846e8 [OpenReg] Migrate OpenReg Tests from tests/test_openreg.py into torch_openreg/tests (#161917)
**Background:**

Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic.

**How to do:**

So how do we verify the quality of the third-party device integration mechanism?
We will maintain a `test_openreg` entrypoint in `test/run_test.py`.

This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available.

**Next:**

We will also improve `torch_openreg's` test coverage in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917
Approved by: https://github.com/albanD
2025-09-12 23:53:17 +00:00
cdfa298a3b Revert "[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)"
This reverts commit a3f01f6418667f791f36d928f7e912eb89be2e67.

Reverted https://github.com/pytorch/pytorch/pull/162732 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162732#issuecomment-3287163750))
2025-09-12 23:52:43 +00:00
d25c35d2b2 [MPS] Fix [nan]median output for empty tensors (#162846)
It should be `NaN` rather than 0

Added respective checks to `test_empty_tensor`

Fixes https://github.com/pytorch/pytorch/issues/162798
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846
Approved by: https://github.com/dcci
2025-09-12 22:26:29 +00:00
ee53ad2dd0 xpu: test py_limited_api with SyclExtension (#162546)
Commit extends existing CUDA test to cover XPU SyclExtension case for the same feature - `py_limited_api`. Commit required a fix for xpu to install some Aten header files (#145902) which got resolved after the merge of #159621.

See: https://github.com/pytorch/pytorch/issues/145902
Requires: https://github.com/pytorch/pytorch/pull/159621
Requires: https://github.com/intel/torch-xpu-ops/pull/1743

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162546
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/janeyx99
2025-09-12 21:57:01 +00:00
0dcd9304aa fix high=0 bug in nll_loss test (#162763)
Minor bug fix for the `nll_loss` test.
Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0.

The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype.
However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug.

To my understanding, these OpInfo should be general enough to support different hardware.
Fixing this obvious bug would make it more general cross different hardware.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763
Approved by: https://github.com/soulitzer
2025-09-12 21:48:18 +00:00
25f1a5d8d1 [inductor][ez] add src_hash property for Templates (#161468)
# why

enable caching/overriding/filtering based on src hash later

# what

- KernelTemplate has a src_hash that is None by default
- sha256 on TritonTemplate of the template src code
- None on ExternKernelChoice to have same API

# testing

n/a (not in use in this change)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350, #162293
2025-09-12 21:10:45 +00:00
269c9907a0 [inductor][choices] rename get_mm_configs to get_template_configs (#162293)
# why

- eventually we want all templates to go through this
- we're exposing this through diode as a sort of interface/API
- avoid later renaming

# what

- rename get_mm_configs to get_template_configs
- rename _finalize_mm_configs to _finalize_template_configs

# testing

- lintrunner
- ci

Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350
2025-09-12 21:10:45 +00:00
a326ef37e6 [inductor] leverage template stacking in V.choices.get_mm_configs (#161350)
# why

- now everything is in place to just gather templates and run
  the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
  have a full view of the options for an op, not just for
  one template

# what

- replace multiple calls to V.choices.get_mm_configs with
  calls to gather the active templates, and then using those
  in a single call

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
2025-09-12 21:10:38 +00:00
cdb2d1838a [inductor] FlexibleLayout for ExternKernelChoice for mms (#161351)
# why

- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
  here, and provide deeper passes more chances to change it

# what

- if all the kernel template choices (KTC) are with a ExternKernelChoice
  template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
  only extern, and FixedLayout if Triton is involved)

- caveats:
    - because CPP, CUTLASS, and CK are not using
       V.choices.get_mm_configs yet, we turn off the optimization
       if either of those backends are in use. This will be relaxed
       once they support this too
    - because Triton templates are still using their own calls
       (not a single call) to get_mm_configs, it's also turned
       off there. The next diff unifies Triton + ATEN to a single
       call to get_mm_configs and that in turn allows the optimization
       there too

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-12 21:10:31 +00:00