2592 Commits

Author SHA1 Message Date
85012fe167 Remove unnecessary list comprehensions (#164103)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-09-30 03:56:54 +00:00
7c7ae86991 [Fix] Adding missing f prefixes to formatted strings [2/N] (#164066)
As stated in the title.

* #164068
* #164067
* __->__ #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066
Approved by: https://github.com/Skylion007
2025-09-29 04:40:44 +00:00
3cda34ebde [2/N] Apply ruff UP035 check in torch files (#164054)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
`TypeAlias` is also imported from `typing`.
This PR is the follow-up of #163947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-29 03:35:32 +00:00
b1a4efc302 [amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007)
Summary: See title

Test Plan:
```
buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test
```
After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2

Before fix: failed in D83294731 (version 1):
https://www.internalfb.com/sandcastle/workflow/1792432651703947243

Differential Revision: D83375414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007
Approved by: https://github.com/llxxee
2025-09-27 06:09:50 +00:00
8d474bdc14 Change python grid calc for MTIA back to python mode (#163601)
Differential Revision: D83000165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601
Approved by: https://github.com/blaine-rister
2025-09-27 00:12:53 +00:00
5daa79fd6e Remove dataclass_slots (#163623)
`dataclass` now has `slots` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623
Approved by: https://github.com/Skylion007
2025-09-26 00:54:42 +00:00
f9821b1be7 DebugMode supports_higher_order_operators=True (#163824)
Make DebugMode supports HOP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824
Approved by: https://github.com/ydwu4
2025-09-25 17:11:43 +00:00
ad869c58f5 remove allow-untyped-defs from ./torch/utils/benchmark/op_fuzzers/sparse_unary.py (#163476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163476
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #163478, #163475, #163471
2025-09-25 06:48:44 +00:00
cde5c9aebd fix pickling for BitwiseFn (#163571)
Summary:
ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn'

looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395

Fixes #147841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571
Approved by: https://github.com/jamesjwu
2025-09-25 04:52:11 +00:00
4c2c401ccf Record redistribute_local_tensor in DebugMode (#163704)
Explicit redistribute_local_tensor API call could also results in communication, record it!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163704
Approved by: https://github.com/ezyang
2025-09-24 16:11:26 +00:00
a635505a99 [Code Clean] Remove deadcodes about Python3.9 [6/N] (#163645)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163645
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627, #163629, #163643, #163644
2025-09-24 07:30:50 +00:00
ec0cd81c38 [Code Clean] Remove deadcodes about Python3.9 [4/N] (#163643)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163643
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627, #163629
2025-09-24 07:30:50 +00:00
f9fa138a39 [BE] Delete all pre py-3.10 checks (#163653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653
Approved by: https://github.com/jansel
ghstack dependencies: #163648, #163649
2025-09-23 23:22:53 +00:00
e3b392bdfd [BC breaking] Remove deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438)
This PR removes import tricks of `SHARDING_PRIORITIES` and  `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not.
Before change:
```
import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES
import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe
```
works
After change:
there is an import error exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438
Approved by: https://github.com/janeyx99
2025-09-23 05:02:06 +00:00
95ac7d724e Rename to _debug_mode.py to make it private (#163534)
rename debug_mode.py to _debug_mode.py to make it private, per @alban's request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534
Approved by: https://github.com/albanD
2025-09-23 04:27:10 +00:00
6a48f57d2f [1/N] Remove 'type: ignore' suppressions (#163468)
Remove some unnecessary 'type: ignore' suppressions from python code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163468
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-09-23 03:53:11 +00:00
46e1b7d70b remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/fileopener.py (#163469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163469
Approved by: https://github.com/aorenste, https://github.com/Skylion007
ghstack dependencies: #163246
2025-09-22 20:29:09 +00:00
f591bb5056 Remove data_source argument from Sampler (#163134)
`data_source` is declared being removed in PT 2.2 but not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163134
Approved by: https://github.com/ezyang
2025-09-21 05:44:41 +00:00
2887f3fde4 [BE] Slight improvements to documentation in python_dispatch (#162963)
I was briefly confused which way I should iterate stack, here's the
comments I wanted.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963
Approved by: https://github.com/albanD, https://github.com/SherlockNoMad
2025-09-21 01:45:46 +00:00
a87aea03f7 Update RandomSampler docstring. data_source must be Sized not Dataset (#158857)
Fixes #158631

The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857
Approved by: https://github.com/divyanshk
2025-09-20 04:05:25 +00:00
a1df0b42ce Lazy import to avoid circular import issue for DebugMode (#163381)
as title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163381
Approved by: https://github.com/dolpm
2025-09-20 01:54:57 +00:00
04842ac2b0 Change DebugMode record_torchfunction default to False (#163293)
Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163293
Approved by: https://github.com/zpcore
2025-09-19 00:30:53 +00:00
c43ccfbc2d [BE] Remove bottleneck (#163210)
Some cleanup related to this RFC: https://github.com/pytorch/pytorch/issues/68742
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163210
Approved by: https://github.com/ezyang
2025-09-18 12:08:13 +00:00
df4ebddbe0 DisableTorchFunction in debug_string (#163096)
debug_string() invokes some torch functions under the hood.
Use DisableTorchFunction() to avoid re-invoking __torch_function__ when calling debug_sting() inside DebugMode()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163096
Approved by: https://github.com/zpcore
2025-09-17 00:19:49 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
f8d379d29e [DTensor] Introduce DebugMode (#162665)
Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor.

- Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)`
- Optionally tracks torch-level functions, via `__torch_function__`
- Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation
- Optionally tracks real tensor operations, including functional c10d op, and regular ops
- Calls are shown in the hierarchical structure!
- shorthand representation
  - dt: DTesnor, ft: FakeTensor, t: Tensor
  - DM(2, 2) == DeviceMesh(shape = [2, 2])
  - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)]
  - f32[8,8] == float32 with shape[8, 8]

```
  debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True)
  with debug_mode:
      torch.mm(x_dtensor, y_dtensor)
  print(debug_mode.debug_string())
```
produces:
```
  torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
    aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
      redistribute_input(1, [S(0)], [R])
        _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0)
        _c10d_functional::wait_tensor(t: f32[8, 32])
      aten::mm(t: f32[1, 8], t: f32[8, 32])
```

Another example, for torch.einsum
```
  torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P])
    aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3)
      aten::unsqueeze(t: f32[16, 6, 8], 3)
    aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4)
      aten::unsqueeze(t: f32[16, 6, 8, 1], 4)
    aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2])
    aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3)
      aten::unsqueeze(t: f32[8, 4, 4], 3)
    aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4)
      aten::unsqueeze(t: f32[8, 4, 4, 1], 4)
    aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0])
      aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0])
    aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3])
      aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3])
    aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8])
      aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8])
    aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1])
      aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1])
    aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16])
      aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16])
    aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P])
      redistribute_input(0, [P, R], [S(2), S(2)])
        aten::chunk(t: f32[1, 96, 8], 4, 2)
        aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2)
        aten::clone(t: f32[1, 96, 1])
      redistribute_input(1, [R, P], [S(1), S(1)])
        aten::chunk(t: f32[1, 8, 16], 4, 1)
        aten::clone(t: f32[1, 2, 16])
        aten::chunk(t: f32[1, 2, 16], 2, 1)
        aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3)
        _c10d_functional::wait_tensor(t: f32[1, 1, 16])
      aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16])
    aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4])
      aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4])
    aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2])
    aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4])
      aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665
Approved by: https://github.com/ezyang
2025-09-16 07:30:05 +00:00
b68a5115a4 Workaround for mtia double init issue in has_triton (#162974)
Summary:
This change adds a new environment variable (`TORCHINDUCTOR_TRITON_DISABLE_DEVICE_DETECTION`) and configuration in `torch._inductor.config` which can be set to `"1"` to allow a user to disable triton's device detection logic in [torch/utils/_triton.py:has_triton()](c9e57d7e9f/torch/utils/_triton.py (L128)). This function is used at import scope in several places but the function has a side effect of initializing the mtia device if it is available which is causing some of our autotuning workflows to crash.

Worth noting that when enabled this configuration disables all device detection not just mtia and this is because the logic in has_triton will initialize the mtia device as a side effect even when checking for a cuda or other device via the [get_interface_for_device()](c9e57d7e9f/torch/_dynamo/device_interface.py (L570)) function.

I've tagged it `topic: not user facing` since I don't anticipate any outside of meta users making use of this, however this is my first PR here, so please indicate if it should be handled differently.

Test Plan: This has been tested in the context of internal workflows.

Differential Revision: D82347853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162974
Approved by: https://github.com/xmfan
2025-09-16 04:46:11 +00:00
955e195c7d [Triton] [Inductor] Add a Blackwell specific Template for persistent matmul (#162916)
Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162916
Approved by: https://github.com/NikhilAPatel
2025-09-15 23:23:04 +00:00
6d64bc3990 [data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888)
Summary:
The check is introduced in D82262053
- `scalar_value` could be a numpy object
  - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`.

Test Plan:
```
vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10
```

Rollback Plan:

Differential Revision: D82383428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888
Approved by: https://github.com/xush6528
2025-09-14 08:09:08 +00:00
52af91e4c1 [ROCm/Windows] Support load_inline on windows (#162577)
Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm.
Tested on Windows with gfx1201.

Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577
Approved by: https://github.com/ezyang
2025-09-12 08:10:07 +00:00
636a511084 [aoti] add config for libtorch free so (#162655)
Users can specify the following to get a libtorch_free `.so`.

"aot_inductor.use_libtorch": False,

The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files.

"aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers)

The main motivating use case is for executorch to produce a libtorch free `.so`.

TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655
Approved by: https://github.com/angelayi
2025-09-12 07:31:04 +00:00
f7e8321961 fix cpp extension distributed warning spew (#162764)
With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead

Test plan

```python
# torchrun --nproc_per_node=2 demo_fix.py

import os
import logging

logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)

import torch
if 'RANK' in os.environ:
    torch.distributed.init_process_group('nccl')

from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()

print(f"Rank {os.environ.get('RANK', '0')} done")
```

Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py

W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```

But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`

Then we get

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-12 06:12:46 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
0c0e056a9e [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-04 17:21:26 +00:00
98efc9e93d [ROCm] Bump AOTriton to 0.11b (#161754)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:

* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
  - AITER ASM kernels deliver over 500TFLOPS training performance. See
    [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
    details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
  - PR #156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`

The build system changes drastically along with new packaging scheme of
AOTriton 0.11

* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
  `PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
  matches the ROCM in the build environment. For PyTorch builds with ROCm
  versions not listed in the file, the build process will build AOTriton
  runtime without GPU images from source
  - This avoids any further ABI breaks like ROCM 6.4 -> 7.0
  - recursive git clone is disabled since building AOTriton runtime does not
    require submodules.

Bug fixes:

* Fix a kernel bug introduced when implementing SWA

Known Problems:

* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
  due to accuracy issues. Triton compiler fixes are needed to restore the
  support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
  This issue is under investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-09-03 20:45:44 +00:00
850e1382a9 [hipify] Replace cudaStreamCaptureStatusNone (#161992)
Replacing additional cuda symbols to hip symbols

Differential Revision: D81420086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2025-09-03 20:23:32 +00:00
789d494212 Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-02 15:27:37 +00:00
13b65196db Revert "Defer loading hipify until it is needed (#160824)"
This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d.

Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](403a3a393c) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))
2025-09-01 21:34:13 +00:00
403a3a393c Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-01 20:57:41 +00:00
63a9c23fe9 Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)"
This reverts commit 190c391a28845a14df26abb228d26aa813efb20c.

Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))
2025-09-01 16:27:03 +00:00
190c391a28 [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-01 09:25:01 +00:00
302d860157 Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317)
This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317
Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308, #161315
2025-08-30 06:55:42 +00:00
0c459f2921 Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315)
Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation.

This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308
2025-08-30 06:55:42 +00:00
b96bcb9fdb Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308)
- Empty containers are Falsey
- Hoist cheap checks first
- Microbenchmarked single-element set access method

Benchmark code:
```
import timeit

to_test = [
    ('list(x)', 'x = set([3])'),
    ('x[0]', 'x = [3]'),
    ('list(x)[0]', 'x = set([3])'),
    ('next(iter(x))', 'x = set([3])'),
]

for (stmt, setup) in to_test:
    res = timeit.timeit(stmt=stmt, setup=setup)
    print(f"Time for `{stmt}`: {res}")
```

Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux)
```
Time for `list(x)`: 0.03418
Time for `x[0]`: 0.00852
Time for `list(x)[0]`: 0.03561
Time for `next(iter(x))`: 0.02278
```

FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304
2025-08-30 06:55:42 +00:00
2089ed3d5e Use is, not ==, to check exact type matches in _python_dispatch (#161304)
`is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292
2025-08-30 06:55:42 +00:00
1a64bf2636 Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292)
func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292
Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301
2025-08-30 06:55:42 +00:00
3519969e4f [Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600)
As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-28 12:39:58 +00:00
5edc3d814f Add option for TorchDispatchMode to ignore torch.compile internals (#161648)
If TorchDispatchMode.ignore_compile_internals() is True, then we turn
off the TorchDispatchMode during the compilation process, instead
turning it back on during runtime of the compiled artifact.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648
Approved by: https://github.com/bdhirsh
2025-08-28 02:41:33 +00:00
cbc53b7696 Update pybind11 submodule to 3.0.1 (#160754)
Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling.

Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754
Approved by: https://github.com/Skylion007
2025-08-27 21:15:01 +00:00
80bf883d21 Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286)
In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286
Approved by: https://github.com/wconstab
2025-08-27 00:17:51 +00:00