Compare commits

..

1513 Commits

Author SHA1 Message Date
bc67bce2e5 Working setup with runnable PyTorch on Codex.
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 132668d46021090fe3ef197fb25ba762ce42667c
Pull-Request: https://github.com/pytorch/pytorch/pull/159968
2025-08-06 14:56:40 -07:00
79eca4677b [precompile] Skip serializing unnecesssary objects for guards. (#158926)
Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.

Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects

Rollback Plan:

Differential Revision: D78816888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926
Approved by: https://github.com/jamesjwu
2025-08-06 15:00:28 +00:00
2855688a1d Revert "Replace C array with std::array in formatSockAddr (#159812)"
This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072.

Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see 2231c3ca3a/1 ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))
2025-08-06 14:55:48 +00:00
2231c3ca3a [CI][CD] Fix install_nvshem function (#159907)
When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder

Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907):
```
2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to:  ''
2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at:  ''
2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB:  '/usr/local/cuda/lib64/libnvshmem_host.so'
2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB:  '/usr/local/cuda/lib64/libnvshmem_device.a'
2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR:  '/usr/local/cuda/include'
2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-08-06 14:44:37 +00:00
c03a734ba1 [OpenReg] Disable automatic inclusion of data files (#159845)
# Background

After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime.

# Motivation

This PR aims to remove the stub.c file and any unused file when running torch_openreg.

**Changes:**

- Setting **include_package_data** keyword to false in the setup function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845
Approved by: https://github.com/albanD
2025-08-06 10:35:13 +00:00
98316e5896 [WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325)
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849

**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads

**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path

Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769

```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:

Differential Revision: D79042656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
2025-08-06 10:28:08 +00:00
23cf241039 [aoti][mps] Initialize mps kernels first (#159753)
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
    auto mps_lib_0_func = ...
    mps_lib_0_func->run()

// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run()  // error, mps_lib_0_func not initialized
```

So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
    static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
    return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
    return handle;
}
...
if ...
    get_mps_lib_0()->run()

get_mps_lib_0()->run()  // success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695
2025-08-06 07:54:29 +00:00
e7feedf6a9 Replace C array with std::array in formatSockAddr (#159812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812
Approved by: https://github.com/Skylion007
2025-08-06 07:44:29 +00:00
dad2a05bec [DTensor] Set up DTensorContinuousTestBase (#159885)
Also migrate `test_common_rules.py` since it was a short file

`python test/distributed/tensor/test_common_rules.py`

Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
2025-08-06 07:40:31 +00:00
0495cab545 Wire in pt2_triton_builds (#159897)
Summary:
This allows us to start seeing the failure rate on these models (and
potentially alert on it).

Test Plan:
```
FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 | tee out
```
P1889607054

Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon.

Rollback Plan:

Reviewed By: masnesral

Differential Revision: D79308333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897
Approved by: https://github.com/masnesral
2025-08-06 07:39:51 +00:00
abfe403981 [AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391)
Summary:
This feature is Meta internal only
Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub .

The rollout will be controlled by JK.

Test Plan:

MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21

 {F1980593060}

* If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility'
* The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page.
* Test fails because of the export checks. I'll export after addressing all the comments from reviewers.

Rollback Plan:

Reviewed By: pianpwk

Differential Revision: D78526522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391
Approved by: https://github.com/jingsh
2025-08-06 07:39:39 +00:00
1690c0c3a0 [Reland] Migrate ScalarType to headeronly (#159911)
The non ghstack version of #159416, to make sure we don't get reverted again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911
Approved by: https://github.com/mikaylagawarecki
2025-08-06 07:36:37 +00:00
e9d27aa8fd [CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854)
CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854
Approved by: https://github.com/eqy
2025-08-06 06:03:58 +00:00
2457e62c90 Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382)"
This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2.

Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see d0fccbc99c/1 ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))
2025-08-06 05:30:20 +00:00
d0fccbc99c [CI] Delete sm86 tests from pull (#159903)
And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903
Approved by: https://github.com/huydhn
2025-08-06 05:16:55 +00:00
3461988a4b [audio hash update] update the pinned audio hash (#159823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823
Approved by: https://github.com/pytorchbot
2025-08-06 05:02:35 +00:00
9764981116 Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814)
Allow overriding nop compilers with real ones when using this flow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814
Approved by: https://github.com/fmassa
2025-08-06 04:50:56 +00:00
704594eb23 [Dynamo] make HOPs hashable (#159910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910
Approved by: https://github.com/yf225
2025-08-06 04:02:17 +00:00
eqy
bfc27cf468 [Distributed] Fix @parametrize on unordered iterable in distributed test (#159793)
seems to fix https://github.com/pytorch/pytorch/issues/145807

sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings

originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2025-08-06 03:51:42 +00:00
311f74089a remove print (#159917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917
Approved by: https://github.com/laithsakka
2025-08-06 03:48:23 +00:00
14c7358c64 Enable fr_trace to read local traces from multiple hosts. (#159490)
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.

Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```

Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```

After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
    dump = pickle.load(infile)
           ^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).

Rollback Plan:

Differential Revision: D79224727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
2025-08-06 03:15:34 +00:00
8ce81bcee1 [Torch Package] Make get names of OrderedImporters support fallback to importers (#155743)
Summary:
OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class.
This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers.

Differential Revision: D76463252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743
Approved by: https://github.com/jcwchen, https://github.com/jingsh
2025-08-06 02:26:10 +00:00
4604f0482c Add UT for torch.accelerator memory-related API (#155200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200
Approved by: https://github.com/albanD
ghstack dependencies: #138222, #152932
2025-08-06 02:22:18 +00:00
15f1173e5d Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-06 02:22:18 +00:00
e16c48ae97 [BE] Fix type hint in AOTIRunnerUtil (#159577)
Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577
Approved by: https://github.com/Skylion007
2025-08-06 01:20:45 +00:00
f7a66da5f9 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-06 00:40:29 +00:00
3eb3da9b4b [dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888
Approved by: https://github.com/williamwen42
2025-08-06 00:36:43 +00:00
3ddfd46bd2 Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-08-06 00:29:56 +00:00
6a82da392e [export] Fix generated schema for C++20/23 (#159871)
Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838

Test Plan:
buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/

Rollback Plan:

Differential Revision: D79647167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871
Approved by: https://github.com/malfet
2025-08-06 00:23:05 +00:00
22bedc429f Extract some HOP utils to be importable (#159705)
Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705
Approved by: https://github.com/zou3519
ghstack dependencies: #159134
2025-08-05 23:59:47 +00:00
49abc0e3f8 [Take 2] Setup TorchBench in Docker (#159300)
Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker.

### Testing

MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300
Approved by: https://github.com/ZainRizvi
2025-08-05 23:47:42 +00:00
1052604acd fix logging setup issue for Windows.. (#159887)
When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html
Such as:
    TORCH_LOGS="+schedule,+inductor,+output_code"
On Linux, it shows as:
```cmd
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TORCH_LOGS="+schedule,+inductor,+output_code"
declare -x USER="xu"
```
On Windows, it shows as:
```cmd
TORCHINDUCTOR_WINDOWS_TESTS=1
TORCH_LOGS="+schedule,+inductor,+output_code"
UCRTVersion=10.0.22000.0
```
For Linux, it shows quotes by default, And Windows is not shows quotes.
Besides that, Windows would auto assemble quotes when env var processing.

On Linux, we will get variable: "+schedule,+inductor,+output_code"
On Windows, we will get variable: '"+schedule,+inductor,+output_code"'

So, we need remove the outer quotes for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887
Approved by: https://github.com/angelayi
2025-08-05 23:44:38 +00:00
fe8984a9f4 Set PYTHONHOME for inductor subprocesses using torch (#159382)
Summary:
This is needed for subprocesses that are trying to call back into torch
functionality, i.e. anything that's also setting `PYTHONPATH`.  There are more
`sys.executable` subprocesses in torch/ but it seems like they're fine.

Test Plan: Local inference runs.

Reviewed By: aorenste

Differential Revision: D79124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382
Approved by: https://github.com/aorenste
2025-08-05 23:32:48 +00:00
74a754aae9 Add meta kernel for sdpa_math_for_mps (#159695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695
Approved by: https://github.com/malfet
ghstack dependencies: #159456
2025-08-05 22:27:06 +00:00
b1ec088113 [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-05 22:27:06 +00:00
fb35a9ea4a [export] Improve error messages (#159881)
Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881
Approved by: https://github.com/yushangdi
2025-08-05 22:26:48 +00:00
8034b2a732 [inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730)
Summary:

- debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
- test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730
Approved by: https://github.com/yushangdi
ghstack dependencies: #159190
2025-08-05 22:06:32 +00:00
64cc6f06b1 [Inductor] Revert minimal changes to avoid internal test failures (#159809)
The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures.

I'll continue the debugging, and re-land the reverted changes once we find out a forward fix.

Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-05 22:05:26 +00:00
410812763b Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)"
This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231.

Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))
2025-08-05 22:00:24 +00:00
bdb07a2bc5 [Cutlass] Allow offsets to be passed as arguments to kernel (#159761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #159760
2025-08-05 21:59:07 +00:00
8085edc8f9 [autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840)
This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix.

FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840
Approved by: https://github.com/albanD
2025-08-05 21:57:49 +00:00
882d50c5bf [C10] Add Scalar::isUnsigned() method (#159877)
That returns true if Scalar hold unsigned integral value

With the implications of `Tag::HAS_u` semantic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-05 21:43:21 +00:00
b52a4d0821 [ez][CI] Remove some unused docker images (#159171)
Removes unused docker images from the docker build workflow
Then removes unused definitions in build.sh

The only one I left is the vllm one because I'm pretty sure it's going to be used in the future

I assume everything not mentioned is old and we forgot to remove them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171
Approved by: https://github.com/yangw-dev
2025-08-05 21:31:53 +00:00
a45a840926 [CI] Disable check-labels and check_mergeability (#159900)
See https://github.com/pytorch/pytorch/issues/159825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900
Approved by: https://github.com/clee2000
2025-08-05 21:16:12 +00:00
9b953bb3fb [BE] Update TensorPipe pin (#159834)
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
-  Avoid using deprecated enums for MacOS builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
2025-08-05 20:45:09 +00:00
eb25a95a6e Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569)
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:

```
    When an operation mutates a buffer in-place, the scheduler creates a new buffer name
    to track the "before" and "after" states, even though they share the same memory.
    The mutated buffer represents a rename with zero allocation and deallocation cost.
    During dependency tracking, we transfer dependencies from the mutated name back to
    the original buffer, ensuring the original memory is only freed when all aliases
    are done.
    This handles cases where a buffer has multiple non-overlapping aliases - rather than
    trying to assign free costs to individual aliases, we forward all alias dependencies
    to the original buffer.
    Consider:
        buf0 = op0()
        buf1 = mutation_op_(buf0)
        del buf0
        ...
        op(buf1)
        del buf1
    The only memory events are the creation prior to op0, and the deletion following buf1.
```

As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.

This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
2025-08-05 19:58:11 +00:00
eqy
9884d0351e [CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522)
Otherwise we see `CUDA error: too many resources requested for launch`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522
Approved by: https://github.com/janeyx99
2025-08-05 19:26:25 +00:00
d7c83972d5 tools: Add mode to find python automatically (#159820)
Add support for automatically finding Python interpreters in manylinux
environments to our wheel building script. Scaffolding for sequential builds

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820
Approved by: https://github.com/malfet
2025-08-05 19:19:22 +00:00
e06b110f73 [Testing] Add MPS to NATIVE_DEVICES (#153835)
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`

&lt;sarcasm&gt; I love how well documented this variable are &lt;/sarcasm&gt;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
2025-08-05 18:57:35 +00:00
0ba09a6d34 fix link for tutorial of inductor on windows (#159853)
fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853
Approved by: https://github.com/sekyondaMeta

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>
2025-08-05 18:37:47 +00:00
aeb5321b63 Allow controlling PG backend and options via init_device_mesh (#159371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol
2025-08-05 12:44:14 +00:00
625108ede2 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-08-05 11:42:25 +00:00
09e5a93fcb Improve graph output alias with subclass error message (#159619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619
Approved by: https://github.com/albanD
2025-08-05 06:47:31 +00:00
908c5cc4c0 Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312, #156165
2025-08-05 04:08:42 +00:00
c1145852a5 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312
2025-08-05 04:08:42 +00:00
ae1a706444 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #159629
2025-08-05 04:08:04 +00:00
56d19a5ced Fix AllocatorConfig potential SIO issue (#159629)
# Motivation
As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process.

Co-authored-by: @ScottTodd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629
Approved by: https://github.com/ScottTodd, https://github.com/albanD
2025-08-05 04:07:51 +00:00
b6c53383fe [Dynamo][Better Engineering] Type annotation for torch/_dynamo/output_graph.py (#159602)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/output_graph.py`

Running
```
mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602
Approved by: https://github.com/Skylion007
2025-08-05 03:50:54 +00:00
4fd5fabee9 skip XPU for dataloader CPU only unit test (#159811)
Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811
Approved by: https://github.com/izaitsevfb
2025-08-05 03:44:01 +00:00
bbc0df1094 [Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Relying on CI. Should be a NFC.

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D79378792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
2025-08-05 03:29:13 +00:00
33ec6e3e9a Remove pin on libuv from instructions (#159504)
This package doesn't exist at conda-forge and causes some confusion for users.
see https://anaconda.org/conda-forge/libuv/files?version=1.39.0

libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge.

see: https://github.com/conda-forge/libuv-feedstock/issues/80

Hopefully this can help future users.

Fixes https://github.com/conda-forge/libuv-feedstock/issues/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504
Approved by: https://github.com/seemethere
2025-08-05 03:18:42 +00:00
efc4b460b3 Add cascade sum support for Inductor CPP backend (#156296)
Fixes #154703

Add cascade summation support for Inductor CPP backend to improve precision for large size summation.

Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss:
```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = tmp_acc0_vec + tmp0;
                    }
                }
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```

After adding cascade sum support:

```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0);
            CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L));
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0);
                    }
                }
            }
            tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0);
            tmp_acc0_vec = cascade_sum_final(&cascade_helper0);
            masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0);
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```
This will inevitably reduce performance when cascade sum is turned on.
For the case shown in #154703: performance reduced by ~3%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-05 02:54:32 +00:00
1ca8388442 [BE][MPS] Remove unused size12 variable (#159832)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable]
  auto size12 = input_sizes[1] * input_sizes[2];
       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832
Approved by: https://github.com/dcci
2025-08-05 02:32:06 +00:00
b69497351d [nativert] force resize to zero. (#159683)
Summary:
this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB.

Rollback Plan:

Differential Revision: D79476454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-08-05 02:25:31 +00:00
482f069c41 [C10D] fix slow init due to repeated dns resolution failure (#159596)
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596
Approved by: https://github.com/d4l3k
2025-08-05 02:15:26 +00:00
85d931f29e Use uppercase OR when checking for system XNNPACK (#159527)
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.

---

For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:

```
CMake Error at cmake/Dependencies.cmake:566 (if):
  if given arguments:

    "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"

  Unknown arguments specified
Call Stack (most recent call first):
  CMakeLists.txt:868 (include)
```

Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
2025-08-05 02:10:53 +00:00
8a2f53c523 Recursively sync fbgemm submodules before build (#159477)
ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622
```
2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z   389 |         x86::Xmm partial_sum_xmm(partial_sum_vreg.id());
```

It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477
Approved by: https://github.com/pruthvistony, https://github.com/eqy
2025-08-05 02:00:54 +00:00
b59b61a099 Add avg_pool3d backward pass for MPS (#159089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089
Approved by: https://github.com/malfet
2025-08-05 01:55:38 +00:00
57ab39f7e4 Update torch-xpu-ops commit pin (#159621)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:

- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
2025-08-05 01:46:15 +00:00
182975e01a [Dynamo] Enable torch function dispatch on HOPs (#159708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708
Approved by: https://github.com/zou3519, https://github.com/XilunWu
ghstack dependencies: #159707
2025-08-05 01:43:22 +00:00
9f8cfe7476 [Dynamo] Fix arg ordering in tf modes (#159707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707
Approved by: https://github.com/zou3519
2025-08-05 01:43:21 +00:00
e273ff028a Fix failing test (#159800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800
Approved by: https://github.com/aorenste
2025-08-05 00:28:51 +00:00
5e0fc2c9a9 [AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433)
**Motivation / Context**: (what I _think_ is happening here)

In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation.

But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA.

**Solution space**

Export allows users to specify which dimension are dynamic, and to provide **ranges of valid sizes**.

One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32.

However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above.

To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`.

If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges.

Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-08-05 00:17:09 +00:00
bc4b04e058 DeviceCopy should have the same layout as input (#159615)
Summary: Fix https://github.com/pytorch/pytorch/issues/159612

- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```

Rollback Plan:

Differential Revision: D79411407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
2025-08-04 23:56:58 +00:00
6b414f56a4 Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)" (#159798)
This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1.

Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model)

```
import torch
def ri(inp, repeats, output_size):
    return torch.repeat_interleave(inp, repeats, output_size=output_size)
inp = torch.arange(0, 4, device="cuda").reshape(-1, 1)
x = torch.tensor([1, 2, 3, 4], device="cuda")
ri_c = torch.compile(ri)
print(ri(inp, x, 10))
print(ri_c(inp, x, 10))
```

which leads to errors like

```
/tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed.
```

Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798
Approved by: https://github.com/danzimm
2025-08-04 23:39:20 +00:00
fb8f32ef52 Revert "[mps] Turn on inductor dynamic shapes tests (#159456)"
This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf.

Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))
2025-08-04 23:11:05 +00:00
7ba996bbaa [Cutlass] Fix wrapper code generation breakage (#159760)
Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355

The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760
Approved by: https://github.com/angelayi
2025-08-04 23:03:03 +00:00
ddbdcdc710 [cutlass backend][test] Expand FP8 tests to FP16 (#159538)
Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538
Approved by: https://github.com/mlazos
2025-08-04 23:01:55 +00:00
19f1f9960d [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-04 22:44:31 +00:00
fd6655a0f5 Feature: Implement support for cudnn_batch_norm_out kernel to replace the autogen approach. (#123020)
Fixes #115611

Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency.

Test Case:

```c++
#include <torch/torch.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

int main() {
    auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA));
    auto weight = torch::randn({3}, torch::device(torch::kCUDA));
    auto bias = torch::randn({3}, torch::device(torch::kCUDA));
    auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA));
    auto running_var = torch::ones({3}, torch::device(torch::kCUDA));

    bool training = true;
    double exponential_average_factor = 0.1;
    double epsilon = 1e-5;

    auto output = torch::empty_like(input);
    auto save_mean = torch::empty({3}, torch::device(torch::kCUDA));
    auto save_var = torch::empty({3}, torch::device(torch::kCUDA));
    auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder

    at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve);
    auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon);

    bool is_close_output = torch::allclose(output, std::get<0>(outputs));
    bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs));
    bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs));
    bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs));

    std::cout << "Is output close: " << is_close_output << std::endl;
    std::cout << "Is save_mean close: " << is_close_save_mean << std::endl;
    std::cout << "Is save_var close: " << is_close_save_var << std::endl;
    std::cout << "Is reserve close: " << is_close_reserve << std::endl;

    return 0;
}
```

Please CC @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020
Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD
2025-08-04 22:40:33 +00:00
a7f3bdf550 [Dynamo][Better Engineering] Type coverage for torch/_dynamo/utils.py (#159580)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/utils.py`

Running
```
mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580
Approved by: https://github.com/williamwen42
2025-08-04 21:51:53 +00:00
510e8b4ae0 [inductor] use writable temp file on windows (#159738)
Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-08-04 21:51:02 +00:00
83ba3f1101 Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)"
This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4.

Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
1fad16aacb Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)"
This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e.

Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
444e2381d0 [inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983
Approved by: https://github.com/eellison
ghstack dependencies: #158758
2025-08-04 21:42:05 +00:00
6085bf7565 [inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)
Fixes #155121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang, https://github.com/eellison
2025-08-04 21:22:11 +00:00
8201dbf4bc check driver to be >=12.4 to use fabric handles (#159697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697
Approved by: https://github.com/malfet
2025-08-04 21:05:39 +00:00
26d045bb60 Linux py 3.14 wheel builds (#157559)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-04 20:55:19 +00:00
356ac3103a Revert "Stop parsing command line arguments every time common_utils is imported. (#156703)"
This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3.

Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))
2025-08-04 20:37:39 +00:00
d4109a0f99 [MPS] Add max_unpool1d/2d/3d (#159789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789
Approved by: https://github.com/malfet
2025-08-04 20:00:59 +00:00
7ea789ccfb Revert #156868: Bring back symint check for sharding propagation cache (#159671)
Fixes #159601

Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671
Approved by: https://github.com/xmfan
2025-08-04 19:58:48 +00:00
7e8197e34d Revert "Migrate ScalarType to headeronly (#159416)"
This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449.

Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))
2025-08-04 19:55:09 +00:00
50eac811a6 [typing] Constrain OrderedSet generic to be Hashable (#159684)
Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684
Approved by: https://github.com/Skylion007
2025-08-04 18:08:01 +00:00
4e0f179d0b Update the signature and test of torch.hamming_window() (#152682)
Fixes #146590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682
Approved by: https://github.com/albanD
2025-08-04 17:50:42 +00:00
36e59d9b12 [c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557)
Summary:
Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK

```
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  154 | virtual at::Tensor get_buffer(int
      |                    ^
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
   56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
```

Test Plan:
Build test + CI

Rollback Plan:

Differential Revision: D78813586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557
Approved by: https://github.com/kwen2501
2025-08-04 17:46:30 +00:00
fc340d0ca3 [export] Allow comparing device w/o index with device w/ index (#159665)
In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665
Approved by: https://github.com/yushangdi
2025-08-04 17:00:07 +00:00
53e47af0f7 [dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754
Approved by: https://github.com/jansel
ghstack dependencies: #159752
2025-08-04 16:51:27 +00:00
66ad881fc7 [dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752
Approved by: https://github.com/jansel
2025-08-04 16:51:27 +00:00
1d3eef27ac [ROCm CI] Migrate to MI325 Capacity (#159649)
Migrate mi300s to gfx942.

Related to https://github.com/pytorch/pytorch/pull/159059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649
Approved by: https://github.com/huydhn
2025-08-04 16:48:12 +00:00
dd95900cec [AOTI] normalize_path_separator file path for Windows. (#159726)
`normalize_path_separator` file path for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:57:19 +00:00
1cdd665526 fix test_verbose_logs_dynamic_shapes with MSVC (#159573)
Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:56:53 +00:00
7cb2dcd2dd [c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558) (#159561)
Summary:

Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically

Test Plan:
CI + sample run

Rollback Plan:

Differential Revision: D79177525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561
Approved by: https://github.com/kwen2501
2025-08-04 15:40:29 +00:00
e5a81aa7ba Fix conversion of values in libtorch agnostic tests (#155115)
Due to different byteorder,
when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back.

This change fixes test
TestLibtorchAgnosticCPU::test_my_ones_like_cpu
from
cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115
Approved by: https://github.com/huydhn
2025-08-04 13:40:22 +00:00
3e2aa4b0e3 Update pin to include Python 3.14 support (#159725)
Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support

This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725
Approved by: https://github.com/davidberard98
2025-08-04 13:30:12 +00:00
6646461764 S390X: fix detection of magic number placeholder in inductor (#157784)
This change fixes multiple tests in
test/inductor/test_aot_inductor_arrayref.py
such as
test_cond_with_parameters_cpu_with_stack_allocation,
test_issue_140766_cpu_with_stack_allocation,
test_model_modified_weights_cpu_with_stack_allocation,
test_nested_tensor_from_jagged_cpu_with_stack_allocation.

Enable tests in test/inductor/test_aot_inductor_arrayref.py

This change is split off from https://github.com/pytorch/pytorch/pull/150116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784
Approved by: https://github.com/huydhn
2025-08-04 12:42:31 +00:00
f74da2a136 [xla hash update] update the pinned xla hash (#159758)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758
Approved by: https://github.com/pytorchbot
2025-08-04 11:21:45 +00:00
eqy
d35b27dde5 [CUDA] Add some more missing @serialTest decorators (#159672)
Seems to fix #159663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672
Approved by: https://github.com/Skylion007
2025-08-04 07:44:35 +00:00
a9dc1566d4 [MTIA Aten Backend] Migrate arange.start_out (#159540)
Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540
Approved by: https://github.com/malfet, https://github.com/nautsimon
2025-08-04 07:38:05 +00:00
33a1996714 Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024)
This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024
Approved by: https://github.com/mingfeima
2025-08-04 05:49:46 +00:00
ee62177c19 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159534
2025-08-04 05:12:44 +00:00
64cbaa876c [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
2025-08-04 05:12:44 +00:00
4516c59f5f [dynamo][source] Add special source for __code__ and __closure__ (#159722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722
Approved by: https://github.com/jansel
2025-08-04 05:02:05 +00:00
8bc843a9ec [vllm hash update] update the pinned vllm hash (#159610)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610
Approved by: https://github.com/pytorchbot
2025-08-04 04:06:09 +00:00
e39a62c70d Fix warnings in triton_helpers.py (#159719)
```
  /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '|' instead
    equal |= a_isnan and b_isnan
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719
Approved by: https://github.com/Skylion007
2025-08-04 03:21:09 +00:00
978e3a9142 refresh expected results (#159727)
Just regular update due to recent <10% changes CI is stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727
Approved by: https://github.com/anijain2305
2025-08-03 22:47:50 +00:00
e2a5c42e7e [BE][MPS] Build metal kernels of MacOS-14+ (#159733)
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.

Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731, #159732
2025-08-03 20:53:58 +00:00
5116c49b52 [BE] Remove macos-13 guard from bench_mps_ops (#159732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732
Approved by: https://github.com/dcci
ghstack dependencies: #159731
2025-08-03 20:53:58 +00:00
fecdebe385 [CI][MPS] Fix compile benchmark correctness (#159731)
By passing `fullgraph=True` attribute and increasing cache size limit to 2**16

Otherwise, compiler might decide not to fall back to eager to avoid recompilations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731
Approved by: https://github.com/dcci
2025-08-03 20:53:50 +00:00
e136a9175b [BE] Fix dev warning in Dependencies.cmake (#159702)
Namely
```
CMake Warning (dev) in cmake/Dependencies.cmake:
  A logical block opening on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if)

  closes on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif)

  with mis-matching arguments.
```

Introduced by https://github.com/pytorch/pytorch/pull/143846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-03 18:45:07 +00:00
9a680e14b7 [bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723)
The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` .

This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723
Approved by: https://github.com/IvanKobzarev
2025-08-03 09:16:55 +00:00
805a102beb Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)"
This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445.

Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
6e8d705a22 Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696)"
This reverts commit be71000ff5292293d1976f313218e2df4d5046d3.

Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
9c18901bfd [MTIA Aten Backend] Migrate all.out (#159539)
Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539
Approved by: https://github.com/malfet
ghstack dependencies: #159098
2025-08-03 02:08:35 +00:00
a29ed5e1ac Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-08-02 23:23:17 +00:00
d2792f51b2 [bucketing] Use max of input/output size for bucketing (#159717)
The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717
Approved by: https://github.com/wconstab
2025-08-02 22:42:22 +00:00
be71000ff5 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159186, #159534
2025-08-02 21:40:38 +00:00
3f86076775 gc before warming up benchmarking (#159670)
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
2025-08-02 19:37:24 +00:00
1616777cd2 [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
ghstack dependencies: #159186
2025-08-02 18:04:35 +00:00
38895c0ac2 Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712)
RuntimeError message updated in is_nonzero(input) method from bool to Boolean.

**Case 1:**
t = torch.tensor([])
torch.is_nonzero(t)

**Case 2:**
t = torch.tensor([1,2])
torch.is_nonzero(t)

**Existing Error message in documentation:**

for case 1: RuntimeError: bool value of Tensor with no values is ambiguous
for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous

**Proposed Error message in documentation:**

for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous
for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Fixes #159710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712
Approved by: https://github.com/malfet
2025-08-02 17:23:45 +00:00
310f901a71 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-08-02 16:38:54 +00:00
e11b1cd97e [ROCm] fix nightly wheel due to rocBLAS environment variable (#159570)
Fixes #159070

The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570
Approved by: https://github.com/malfet
2025-08-02 06:54:43 +00:00
b599d91738 Log autotune choices and benchmark result to scuba/chrome trace (#159496)
Summary:
Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels.

**Example**

Event: mm_template_autotuning
Column: autotune_choices

```json
{
  "num_choices": 52,
  "num_triton_choices": 19,
  "best_kernel": "cutlass_f6c25cf2",
  "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8",
  "best_time": 0.6283040046691895,
  "best_triton_pos": 26,
  "best_triton_time": 0.6832960247993469,
  "best_triton_kernel": "triton_mm_17",
  "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"
}
```

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
```

Rollback Plan:

Differential Revision: D79235037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496
Approved by: https://github.com/masnesral
2025-08-02 05:34:17 +00:00
fd6a6658c3 Enable _int_mm on Intel GPU (#157769)
# Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-08-02 05:16:01 +00:00
04973496a8 [audio hash update] update the pinned audio hash (#159611)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611
Approved by: https://github.com/pytorchbot
2025-08-02 05:15:47 +00:00
1548b011ea Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-08-02 03:54:41 +00:00
e57a92734d [export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625
Approved by: https://github.com/yushangdi
2025-08-02 02:52:42 +00:00
79ff3b320b Back out "[ez] get rid of unused var" (#159677)
Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops.

Test Plan:
backout

Rollback Plan:

Differential Revision: D79474114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677
Approved by: https://github.com/georgiaphillips
2025-08-02 01:50:16 +00:00
426f249f20 Fix launch grid calculation (#159497)
Summary:

The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages.

In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79275534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497
Approved by: https://github.com/blaine-rister
2025-08-02 01:12:58 +00:00
d33a484763 Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545
Approved by: https://github.com/xmfan, https://github.com/wconstab
ghstack dependencies: #159336, #159337
2025-08-02 00:33:41 +00:00
a81ffbc5f5 improve shape checks for grouped_mm (#159666)
Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666
Approved by: https://github.com/danielvegamyhre, https://github.com/eqy
2025-08-02 00:12:25 +00:00
465fe4d9f7 Enable sample nightly PT2 benchmark on B200 (#158011)
Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200

### Testing

https://github.com/pytorch/pytorch/actions/runs/16615101382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-01 23:47:44 +00:00
9477af1063 fix compilation on cuda < 12.3 (#159657)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657
Approved by: https://github.com/kwen2501
2025-08-01 23:40:55 +00:00
dcc36e38bb [Graph Breaks] Remove unsupported Additional Info field (#159658)
Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658
Approved by: https://github.com/williamwen42
2025-08-01 23:25:50 +00:00
efd78584a8 [EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668)
Since it's required to be run on every PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668
Approved by: https://github.com/malfet
2025-08-01 23:19:08 +00:00
135762ea20 Unpin helion (#159579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579
Approved by: https://github.com/jansel
2025-08-01 23:08:06 +00:00
e2ee9cfaa2 [NativeRT] Turn on enableStaticCPUKernels by default (#159422)
Summary: As title.

Test Plan:
Need to manual test on production models.

Rollback Plan:

Differential Revision: D78747742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422
Approved by: https://github.com/dolpm
2025-08-01 22:27:07 +00:00
06d28de17a Update CK Kernel generation and update ck submodule (#157964)
changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964
Approved by: https://github.com/842974287
2025-08-01 22:24:27 +00:00
df9720b8b5 [MTIA Aten Backend] Migrate all foreach ops (#159098)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate all foreach operators to in-tree, including:
  - _foreach_abs
  - _foreach_abs_
  - _foreach_add.List
  - _foreach_add_.List
  - _foreach_add_.Scalar
  - _foreach_add_.Tensor
  - _foreach_addcmul.Scalar
  - _foreach_addcmul_.Scalar
  - _foreach_copy
  - _foreach_copy_
  - _foreach_mul.List
  - _foreach_mul_.List
  - _foreach_mul_.Scalar
  - _foreach_mul.Tensor
  - _foreach_mul_.Tensor
  - _foreach_norm.Scalar
  - _foreach_sqrt_

Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098
Approved by: https://github.com/malfet
2025-08-01 22:10:12 +00:00
85e74d5ace [inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.

- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
2025-08-01 21:51:42 +00:00
0450f05658 Output tensor meta data for FX graph node (#159311)
FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware.
This DIFF is to modify the Graph::Node::format_node to include tensor meta data.
Before this DIFF, the triton kernel FX graph segment looks like the following:
```
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
After this DIFF:
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
```
If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py.

Differential Revision: D77973076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311
Approved by: https://github.com/angelayi
2025-08-01 21:40:29 +00:00
595a65f5c2 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/script_object.py (#159343)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-08-01 21:30:41 +00:00
8c6c2e40eb Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542)
As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
2025-08-01 21:20:25 +00:00
32840d19f9 [cutlass backend] skip stream k if shape is dynamic (#159442)
Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/)

Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442
Approved by: https://github.com/ColinPeppler
2025-08-01 20:42:24 +00:00
2040f00112 [BE][Easy] respect os.environ in subprocess calls in tools/nightly.py (#159572)
Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572
Approved by: https://github.com/Skylion007
2025-08-01 20:40:31 +00:00
c137f9da0b [Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py`

Running
```
mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  425 | 1553 | 27.37% | 17 | 62 | 27.42% |
| This PR | 1623 | 1623 | 100.00% | 62 | 62 | 100.00% |
| Delta    | +1198| +0 | +72.63% | +45 | 0 | +72.58% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518
Approved by: https://github.com/xmfan
2025-08-01 20:24:58 +00:00
5e8b95605f [PP] Support OVERLAP_F_B computation type (#158978)
Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591)

The IR looks like:

```
[0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9]
[1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9]
[2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9]
[3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9]
```

In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does.

814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978
Approved by: https://github.com/wconstab
2025-08-01 20:22:30 +00:00
8ea86a6e31 Actually test STD_TORCH_CHECK, add testfile to CMake (#159603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-08-01 19:53:41 +00:00
acad808545 Revert "[inductor] consolidate common GEMM triton param retrieval (#159383)"
This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443.

Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))
2025-08-01 19:49:21 +00:00
c687446374 Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))
2025-08-01 19:19:51 +00:00
dd22ba09b4 [C10D] Document barrier interaction with device_id (#159389)
Addresses #159262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389
Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj
2025-08-01 18:12:21 +00:00
c0e0126399 Remove unused input parameter in ExpandableSegment (#159356)
# Motivation
While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356
Approved by: https://github.com/ngimel, https://github.com/albanD
ghstack dependencies: #159159
2025-08-01 17:47:51 +00:00
e4b123b5e4 Revert direct updates (#159654)
reverts:
```

commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:32:52 2025 -0700

    Update test_utils.py

commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:27:54 2025 -0700

    Update utils.py

commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:26:05 2025 -0700
```

(commits pushed directly to main by mistake)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654
Approved by: https://github.com/atalman
2025-08-01 16:54:51 +00:00
5711a8f069 Update test_utils.py 2025-08-01 09:32:52 -07:00
b4b71d011e Update utils.py 2025-08-01 09:27:54 -07:00
52376b9b6f Update convert_frame.py 2025-08-01 09:26:05 -07:00
1371a98b0e Migrate ScalarType to headeronly (#159416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416
Approved by: https://github.com/albanD
ghstack dependencies: #159415, #159411
2025-08-01 16:07:01 +00:00
2a286cbdf4 Allow register_buffer with Tensor-like object (#159455)
As torch allows extending the tensor with `__torch_function__`, it would be desirable to allow registering it as a buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159455
Approved by: https://github.com/mikaylagawarecki
2025-08-01 15:31:38 +00:00
7c37b8e1e0 [ROCm][Windows] Switch __builtin_clz ifdef from WIN32 to MSC_VER. (#159273)
PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606.

This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140):
```
[7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj
FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe  /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp
clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument]
clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument]
In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36:
B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded
   14 | inline unsigned int __builtin_clz(unsigned int x) {
      |        ~~~~~~~~~~~~ ^
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept'
   60 |   return 31 - __builtin_clz((unsigned int)__A);
      |               ^
1 error generated.
[7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj
```

> [!NOTE]
> I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct.

There are a few other variations on fixes to this floating around, such as:
* a97a957af0/lz4.c (L34-L43) (checking with `__has_builtin`)
* c98c55ec7e/lj92.c (L31-L46) (the same code as here, but with `_MSC_VER`)
* 2760e5a2bb/def.h (L23-L25) (using `__lzcnt` instead of a custom implementation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273
Approved by: https://github.com/Skylion007, https://github.com/m-gallus
2025-08-01 15:21:26 +00:00
ee2649219c Fix max_width computation in _tensor_str._Formatter (#126859)
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.

Now, the code first checks if it should be in sci_mode, then compute `max_width`

Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])

print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```

In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([   10.0000,     0.1000,     0.0100]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```

On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)

Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)

After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000,  0.1000,  0.0100]) Formatter max_width: 7
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```

This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
2025-08-01 15:05:41 +00:00
b0b3e6e48b [PP] Refactor test_schedule_multiproc (#158780)
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780
Approved by: https://github.com/wconstab
2025-08-01 15:02:18 +00:00
3967dbedf4 [ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.

This PR makes the following changes:

- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)

What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`

**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
2025-08-01 06:49:01 +00:00
4396b15aa7 remove co_lnotab in favor of co_linetable (#159227)
Fixes #158833
DeprecationWarning: remove co_lnotab in favor of co_linetable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159227
Approved by: https://github.com/ezyang
2025-08-01 06:34:38 +00:00
bb6766053b fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-08-01 05:42:40 +00:00
a4fc051c9a Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend. (#159549)
Fixes #159548

* Throw an error message when the input tensors for the distributed `gather` are noncontiguous. This behaviour is consistent with the distributed `all_gather`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159549
Approved by: https://github.com/d4l3k
2025-08-01 03:26:06 +00:00
5cc6a0abc1 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit dfacf11f66d6512396382bdf5088f0ba9de00406.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
90f13f3b2a Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
cb9b74872b Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)"
This reverts commit d3ce45012ed42cd1e13d5048b046b781f0feabe0.

Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
c964204829 [CI] Disable executorch jobs (#159595)
The current executorch pin needs to be updated

The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966

The failure is that the pin uses a version of the nightly that has been removed from the nightly index
```
#62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu)
#62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601
```

The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392
```
2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is'
2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is
2025-07-31T01:56:57.0249969Z + exit 1
2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1
```
I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi?  Which then causes triton to be installed.

I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595
Approved by: https://github.com/malfet
2025-08-01 02:18:03 +00:00
2ac45c2752 Fix autocast context manager when there is exception (#159565)
Summary: When exception occurs inside context manager, we need to either return False OR properly propagage exceptions via __exit__(exc_type, exc_val). But previously while tracing, we don't actually run the exit node so we end up swallowing the exception in a very weird way as outlined in https://github.com/pytorch/pytorch/issues/153202. This PR fixes it

Test Plan:
new test case

Rollback Plan:

Differential Revision: D79348382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159565
Approved by: https://github.com/zou3519, https://github.com/yushangdi
2025-08-01 02:12:24 +00:00
83e2ea8135 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-08-01 01:55:48 +00:00
d994027a41 [Doc fix] fix spelling of enough (#159587)
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
2025-08-01 01:50:57 +00:00
cb4f41e125 Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566)"
This reverts commit 8e07c9870d07c5a318ab21bb16b3fa27576851e6.

Reverted https://github.com/pytorch/pytorch/pull/157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](https://github.com/pytorch/pytorch/pull/157566#issuecomment-3141840110))
2025-08-01 01:27:45 +00:00
690fc9cf88 [merge_rules] add some expected failure and skips (#159581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159581
Approved by: https://github.com/anijain2305
2025-08-01 01:18:40 +00:00
eb853e222b [cutlass upgrade] Ignore unused-but-set-variable for AsyncMM.cu (#159578)
Fixes inductor-perf-nightly-h100. This was caused by cutlass upgrade https://github.com/pytorch/pytorch/pull/158854. I missed it in https://github.com/pytorch/pytorch/pull/159276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159578
Approved by: https://github.com/Skylion007
2025-08-01 00:10:59 +00:00
06395276e4 Remove dynamo_timed from the CachingAutotuner.coordinate_descent_tuning() hot path. (#159588)
Summary: When coordinate_descent_tuning==True, CachingAutotuner.coordinate_descent_tuning() is called for every call of CachingAutotuner.run() (at least for Triton templates), but immediately returns the launcher. Move the dynamo_timed call after the check for triton template so we don't incur the context manager overhead on every call.

Fixes https://github.com/pytorch/pytorch/issues/159525

Test Plan: Used the repro in https://github.com/pytorch/pytorch/issues/159525 to make sure the overhead goes away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159588
Approved by: https://github.com/eellison
2025-07-31 23:33:10 +00:00
8becf646ef [dynamo] Make filter handle None as filter function (#159500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159500
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
ghstack dependencies: #158774, #159102
2025-07-31 23:28:57 +00:00
fa68216ca1 [itertools] Implement itertools.cycle with a polyfill (#159102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159102
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
ghstack dependencies: #158774
2025-07-31 23:28:57 +00:00
25ef3d315d [aoti][mps] Dynamic reductions (#159355)
Dynamic kernel:
```cpp
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    constant long& r0_numel,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int x0 = xindex;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_1_cnt = 0; r0_1_cnt < static_cast<int>(metal::floor(static_cast<float>(0.99902343750000000 + 0.00097656250000000000*r0_numel))); ++r0_1_cnt) {
        int r0_1 = 1024 * r0_1_cnt + r0_index;
        if (r0_1 >= r0_numel) break;
        auto tmp0 = in_ptr0[x0 + 5*r0_1];
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, metal::min(static_cast<decltype(1024+r0_numel)>(1024), static_cast<decltype(1024+r0_numel)>(r0_numel)));
    if (r0_index == 0) out_ptr0[x0] = static_cast<float>(tmp1);
}

void AOTInductorModel::run_impl(...) {
    ...
    auto arg0_1_size = arg0_1.sizes();
    int64_t s77 = arg0_1_size[0];
    inputs.clear();
    [[maybe_unused]] auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    static constexpr int64_t int_array_0[] = {5LL, };
    static constexpr int64_t int_array_1[] = {1LL, };
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel");
    auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get());
    mps_lib_0_func->runCommandBlock([&] {
        mps_lib_0_func->startEncoding();
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1);
        aoti_torch_mps_set_arg_int(mps_lib_0_func_handle, 2, s77);
        mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}, {static_cast<uint64_t>(1), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))});

    });
    arg0_1.reset();
    output_handles[0] = buf0.release();
} // AOTInductorModel::run_impl
```

Static kernel:
```cpp
kernel void generated_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
    int x0 = xindex;
    auto tmp0 = in_ptr0[x0];
    auto tmp1 = in_ptr0[5 + x0];
    auto tmp3 = in_ptr0[10 + x0];
    auto tmp5 = in_ptr0[15 + x0];
    auto tmp2 = tmp0 + tmp1;
    auto tmp4 = tmp2 + tmp3;
    auto tmp6 = tmp4 + tmp5;
    out_ptr0[x0] = static_cast<float>(tmp6);
}

void AOTInductorModel::run_impl(...) {
    ...
    static constexpr int64_t int_array_0[] = {5LL, };
    static constexpr int64_t int_array_1[] = {1LL, };
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel");
    auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get());
    mps_lib_0_func->runCommandBlock([&] {
        mps_lib_0_func->startEncoding();
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1);
        mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL)});

    });
    arg0_1.reset();
    output_handles[0] = buf0.release();
} // AOTInductorModel::run_impl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159355
Approved by: https://github.com/malfet
2025-07-31 23:15:02 +00:00
7e00f2ec9d [AOTI] add zero size consts asm handler (#159225)
Add `get_zero_consts_asm_code` to handle zero size consts to object.
This function is used to handle zero consts situation. Because cpp standard does not allow zero size array:
https://stackoverflow.com/questions/9722632/what-happens-if-i-define-a-0-size-array-in-c-c
1. On Windows, MSVC will report error C2466:
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2466?view=msvc-170
So, we can use assmbely compiler to handle this situation.
2. On Windows, why not use Win32 asm to handle all path? Because ml64 only supports up to align `16`, it is
not aligned to pytorch's `64`. Reference: https://learn.microsoft.com/en-us/cpp/assembler/masm/ml-and-ml64-command-line-reference?view=msvc-170
```
Packs structures on the specified byte boundary. The alignment can be 1, 2, 4, 8, or 16.
```
3. It function can handle zero size case on both Windows and Linux, as that:
    A. On Linux, we added `-pedantic` to disable zero size array on C++ compiler. 8e07c9870d/torch/_inductor/cpp_builder.py (L580)
    B. On Windows, msvc is not support zero size array by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159225
Approved by: https://github.com/desertfire
2025-07-31 22:46:33 +00:00
490cb3f1a4 Revert "[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)"
This reverts commit bb62e1f769ef51e2ec149d7256c135d09425aaa0.

Reverted https://github.com/pytorch/pytorch/pull/159190 on behalf of https://github.com/clee2000 due to broke [GH job link](https://github.com/pytorch/pytorch/actions/runs/16658705097/job/47150840171) [HUD commit link](bb62e1f769) on mac ([comment](https://github.com/pytorch/pytorch/pull/159190#issuecomment-3141513921))
2025-07-31 22:22:13 +00:00
b95cf5c91d Move complex to headeronly (#159411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411
Approved by: https://github.com/albanD
ghstack dependencies: #159415
2025-07-31 22:05:43 +00:00
5e2ef2a465 Move Float8 variations to headeronly (#159415)
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).

Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.

Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
2025-07-31 22:05:43 +00:00
9f753f8c0d [DTensor] Improve sort strategy (#159189)
- Sort strategy now supports sharding on non sorted dim.
~~- Fix histc xfail.~~
  - ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy,  the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189
Approved by: https://github.com/XilunWu
2025-07-31 21:52:42 +00:00
db437690d1 Add myself as a reviewer for when someone touches headeronly or stable (#159583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159583
Approved by: https://github.com/mikaylagawarecki
2025-07-31 21:30:05 +00:00
669009bcd1 [inductor] respect layout tags for ops with registered lowerings (#159134)
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.

~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.

FIXES https://github.com/pytorch/pytorch/issues/159097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
2025-07-31 21:29:40 +00:00
e4e2701429 Add the RunLLM widget to the website (#152055)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055
Approved by: https://github.com/albanD
2025-07-31 20:53:53 +00:00
64cc649275 [itertools] Fix accumulate (#158774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158774
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
2025-07-31 20:32:02 +00:00
b1fb552974 Revert "Fix ep deepcopy when there is python builitin name (#159478)"
This reverts commit de7376537f2a11783169fee2b3bc276d266898bf.

Reverted https://github.com/pytorch/pytorch/pull/159478 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159478#issuecomment-3141228423))
2025-07-31 20:20:53 +00:00
bb62e1f769 [inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.

- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
2025-07-31 19:58:07 +00:00
327e2ca580 [ez] get rid of unused var (#159571)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D79320299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159571
Approved by: https://github.com/houseroad, https://github.com/georgiaphillips
2025-07-31 19:11:57 +00:00
1ebcba4e1b Fix typo in link to torch memory_viz tool (#159214)
Fixes a small typo in the torch_cuda_memory docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214
Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007
2025-07-31 18:50:54 +00:00
5f7eae697d Deprecate DataLoader pin_memory_device param (#158323)
Build on top of https://github.com/pytorch/pytorch/pull/146821

- Moves enabling pin_memory back inside `_BaseDataLoaderIter`
  - This is required for `StatefulDataloader` which leveraged  `_BaseDataLoaderIter` directly and not the `Dataloader` class init
- Add a simple test for CPU only env where setting `pin_memory=True` is a no-op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158323
Approved by: https://github.com/ramanishsingh

Co-authored-by: zeshengzong <zesheng.zong@outlook.com>
2025-07-31 18:42:07 +00:00
c1722db0f7 [NativeRT] Make VariadicOpConverter and FuseListUnpackConverter for cpu nodes only (#159519)
Summary:
VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels.

Currently, the graph passes are ran if static_dispatch is enabled.

As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79295640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519
Approved by: https://github.com/dolpm, https://github.com/henryoier
2025-07-31 18:17:21 +00:00
8a233d6000 Revert "[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)"
This reverts commit 07fad04181321d18963b71e9566d44f86a25c9f7.

Reverted https://github.com/pytorch/pytorch/pull/158692 on behalf of https://github.com/yangw-dev due to failed some internal testapf.metrics.tests.generate_graph_def_test.GenerateGraphDefTest: test_aps_generate_inference_graph_def_with_justknobs1) AssertionError: Expected 'check' to be called once. Called 3 times., please fix the internal test and reland it ([comment](https://github.com/pytorch/pytorch/pull/158692#issuecomment-3140873894))
2025-07-31 18:00:30 +00:00
bf3ebd7ad4 Fix grouped MM load along K when TMA loads are not used (#159485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485
Approved by: https://github.com/ngimel
2025-07-31 17:58:02 +00:00
c07bb277a0 Revert "fix strategy hashing arg mismatch (#159506)"
This reverts commit 3a556762002ec0027b2120a7e6675182c0e50dbd.

Reverted https://github.com/pytorch/pytorch/pull/159506 on behalf of https://github.com/yangw-dev due to failed the internal tests test_get_bwd_hook (torch.equal(output * 2, input_tensor.grad)) ([comment](https://github.com/pytorch/pytorch/pull/159506#issuecomment-3140858905))
2025-07-31 17:54:29 +00:00
f89c28cc6b [inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462
Approved by: https://github.com/eellison
2025-07-31 17:00:32 +00:00
8fedcfa59a [export] _ccode for PythonMod (#158851)
Summary: Adds ccode impl to PythonMod

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76463347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158851
Approved by: https://github.com/kalpit-meta-1
2025-07-31 16:46:51 +00:00
6662a76f59 [cutlass backend] Fix EVT tests post buf name change (#159541)
Differential Revision: [D79317791](https://our.internmc.facebook.com/intern/diff/D79317791/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159541
Approved by: https://github.com/mlazos
2025-07-31 16:39:49 +00:00
eqy
05aade1b6d [CUDA] Add serialTest decorator to largeTensorTest in test_cuda.py (#159271)
Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-07-31 16:27:16 +00:00
f946b25865 [MPS] Speedup argmax/argmin (#159524)
By using efficient `threadgroup_arg[max|min]` primitives.
- Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test
- Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy

Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton:
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float16)  |      285.8      |       272.2       |       422.3       |        354.5        |       721.6       |        683.5        |       2224.0      |        1979.1
      max (torch.float32)  |      300.2      |       267.0       |       389.6       |        342.5        |       769.4       |        682.6        |       2995.7      |        2609.8
      max (torch.int32)    |      299.6      |       275.4       |       390.0       |        361.7        |       758.7       |        686.1        |       3103.4      |        2646.5
      max (torch.int64)    |      297.5      |       275.5       |       417.0       |        382.1        |       856.1       |        722.6        |       5467.7      |        3156.8

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #158990
2025-07-31 16:18:32 +00:00
d2e02585b8 [AOTI] Explicitly delete wait_tensor returned tensor (#159502)
Summary: In the Python wrapper codegen, the returned tensor from wait_tensor is not assigned or used anywhere, because wait_tensor always returns its input, see more discussion in https://github.com/pytorch/pytorch/issues/126773. Similarly, we should just immediately delete the returned tensor handle from aoti_torch_cpu__c10d_functional_wait_tensor in the cpp wrapper codegen, otherwise it may cause tensor's lifetime expansion and even cause OOM in some cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159502
Approved by: https://github.com/yushangdi, https://github.com/jingsh
ghstack dependencies: #159476, #159487
2025-07-31 15:33:36 +00:00
3dd7ebf418 [BE] Fix buf name mismatch in test_c10d_functional_native.py (#159487)
Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487
Approved by: https://github.com/yushangdi
ghstack dependencies: #159476
2025-07-31 15:33:36 +00:00
8273ee0646 [BE] Fix global config leak in test_c10d_functional_native.py (#159476)
Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476
Approved by: https://github.com/yushangdi
2025-07-31 15:33:36 +00:00
c57382a493 Move BFloat16.h to headeronly (#159412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412
Approved by: https://github.com/desertfire
2025-07-31 15:29:17 +00:00
e7cc42df58 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-07-31 13:05:04 +00:00
cyy
72c69e731f set MSVC debug information only on debug builds (#159533)
Fixes: https://github.com/pytorch/pytorch/issues/159515
To reduce the binary size increment in release builds by removing debug information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533
Approved by: https://github.com/atalman
2025-07-31 12:57:33 +00:00
78b9dea754 [inductor] Fix set_linter's handling of f-strings for Python 3.12 and up (fix #159056) (#159252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159252
Approved by: https://github.com/Skylion007
2025-07-31 12:56:09 +00:00
838924436e update the baseline for nightly max_autotune tests (#154973)
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:

In the latest nightly test, two models require baseline updates:

- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.

```
vision_maskrcnn                     IMPROVED:           graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
    vision_maskrcnn
```

```
detectron2_fcos_r_50_fpn            XFAIL
detectron2_fcos_r_50_fpn            FAIL:               graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
    detectron2_fcos_r_50_fpn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
2025-07-31 11:38:55 +00:00
2ffb510942 [Break XPU][Indutor UT] Fix failures introduced by community. (#159463)
Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463
Approved by: https://github.com/jansel
2025-07-31 08:37:41 +00:00
20b5f694f8 [Dynamo] Make frozen dataclasses hashable (#159529)
Fixes https://github.com/pytorch/pytorch/issues/159424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159529
Approved by: https://github.com/oulgen
ghstack dependencies: #159513
2025-07-31 07:03:01 +00:00
447e300d55 [Dynamo] Frozen dataclass attr access test (#159513)
Verifies https://github.com/pytorch/pytorch/issues/159424, but perhaps the issue is not fixed yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159513
Approved by: https://github.com/oulgen
2025-07-31 07:03:01 +00:00
5b2ad9279c [draft export] logging (#159004)
Summary: adds logging for draft export

Test Plan:
loggercli stage actualize-stage TorchDraftExportUsageLoggerConfig

Rollback Plan:

Differential Revision: D78308105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159004
Approved by: https://github.com/angelayi
2025-07-31 05:52:13 +00:00
78d7f0cdec disable execution frame cleanup (#159531)
Summary: Want to disable execution frame cleanup until fix in D78621408 is merged

Test Plan:
CI

Rollback Plan:

Differential Revision: D79306602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159531
Approved by: https://github.com/SherlockNoMad
2025-07-31 05:02:36 +00:00
d5c719ec3c [inductor] fix open temp file failed on Windows. (#159342)
Fix open temp file failed on Windows. Error message:
<img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" />

Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile
1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows
2. Use `WritableTempFile` to handle this case on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342
Approved by: https://github.com/jansel
2025-07-31 04:58:02 +00:00
c44efc3755 [Refactor] Fix Compile Warning: possibly dangling reference to a temporary (#159517)
```bash
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:25: warning: possibly dangling reference to a temporary [-Wdangling-reference]
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       |                         ^~~
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:1: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const c10::IValue&, c10::IValue>()’
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       | ^
```

This PR fixes this warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159517
Approved by: https://github.com/xmfan
2025-07-31 04:49:43 +00:00
6b9473469f [Graph Partition] add log for graph partition reasons and #partitions (#159425)
Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped.

In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425
Approved by: https://github.com/eellison
2025-07-31 04:21:06 +00:00
7a4167a164 support fabric handles with symmetric memory (#159319)
enable fabric handles for symmetric memory

Enables handle exchange via CU_MEM_HANDLE_TYPE_FABRIC on the systems that support it. This is needed to enable symmetric memory on NVLS72 systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159319
Approved by: https://github.com/malfet, https://github.com/kwen2501
2025-07-31 04:16:20 +00:00
8e67a6ae89 [vllm hash update] update the pinned vllm hash (#159320)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159320
Approved by: https://github.com/pytorchbot
2025-07-31 04:08:14 +00:00
c68ad1bd6a [dynamo][guards] Always record user.stack for informative tlparse guards (#159526)
Before
<img width="1146" height="280" alt="image" src="https://github.com/user-attachments/assets/4ddb11b2-dec8-4010-a28d-63b3cd4a7929" />

After
<img width="1248" height="248" alt="image" src="https://github.com/user-attachments/assets/8aafc5be-92cd-4468-bb8f-ad966de8c717" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159526
Approved by: https://github.com/Lucaskabela
2025-07-31 03:18:33 +00:00
3e5e094615 Revert "Fix large_tensor_test skipping cpu (#158617)"
This reverts commit debc0591b888f211bfe846bdc7cfa0626a5f6f6a.

Reverted https://github.com/pytorch/pytorch/pull/158617 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16631113381/job/47062415099) [HUD commit link](debc0591b8) ([comment](https://github.com/pytorch/pytorch/pull/158617#issuecomment-3138387762))
2025-07-31 02:57:22 +00:00
clr
c65efc8ea1 torch.compile: Record a pt2_compile_event for combo kernels (#159306)
This is off by default, but some jobs have it on. Having this show up in
perfetto and be globally queryable would be useful to see how expensive this
is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159306
Approved by: https://github.com/masnesral
2025-07-31 02:51:38 +00:00
a9049413e2 [dynamo] Turn on recursive dict tag optimization (#159186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159186
Approved by: https://github.com/jansel
2025-07-31 02:36:37 +00:00
d7a5ec9355 Fix the Doc of padding in avg_poolnd (#159142)
Fixes #159141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159142
Approved by: https://github.com/mikaylagawarecki
2025-07-31 02:02:48 +00:00
2c46922ce4 Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-31 01:36:50 +00:00
668d414ae7 [CPU] Fix bias dtype issue for FP8 qlinear (#159125)
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`

With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.

Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
2025-07-31 01:26:45 +00:00
4541509237 [Triton] [Inductor] Fix an incorrect descriptor (#159407)
Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`.

Test Plan:
Found by inspection.

Rollback Plan:

Reviewed By: NoamPaz

Differential Revision: D79178538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407
Approved by: https://github.com/NikhilAPatel
2025-07-31 00:34:19 +00:00
6c7f88c2c9 Check addmm dtypes (#159509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159509
Approved by: https://github.com/eqy
2025-07-31 00:15:46 +00:00
c400c8e2e0 [ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075)
Summary:

In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950.

The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.

The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.

Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.

Test Plan:

**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```

**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```

**Performance Sample**
| G  | M | N | K | Runtime Ms | GB/S | TFLOPS |
| --  | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |

- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`

**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After:  118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).

Reviewers: @drisspg @ngimel @jwfromm @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
25c3a7e317 [CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305)
Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd
2025-07-30 23:31:38 +00:00
de7376537f Fix ep deepcopy when there is python builitin name (#159478)
Summary: title

Test Plan:
CI

Rollback Plan:

Differential Revision: D79261007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159478
Approved by: https://github.com/pianpwk
2025-07-30 23:14:31 +00:00
fd2c64e286 Fix duplicated sources in inductor provenance tracking (#159484)
Summary:

The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources.

This also means that if there are two nested pass like:

```
with GraphTransformObserver(gm, "outer"):
      with GraphTransformObserver(gm, "inner"):
              .....
```

We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`.

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace
```

Rollback Plan:

Differential Revision: D79203058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484
Approved by: https://github.com/angelayi
2025-07-30 23:03:11 +00:00
2b1ae29960 [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397) (#159491)
Summary:
X-link: https://github.com/pytorch/executorch/pull/12986

As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Rollback Plan:

Reviewed By: JacobSzwejbka, yangw-dev

Differential Revision: D79199389

Pulled By: Lucaskabela

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491
Approved by: https://github.com/anijain2305, https://github.com/yangw-dev
2025-07-30 22:57:50 +00:00
1293405c8d [MPS] Add simd_[arg][max|min] (#158990)
And add eager tests for those.
Re-implement `threadgroup_[max|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script
```python
import itertools
import timeit
import torch
from torch.utils.benchmark import Compare, Measurement, Timer

def bench_unary_op(func, x, label) -> Measurement:
    sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else ""
    t = Timer(
        stmt=f"f(x);{sync_cmd}",
        globals={"f": func, "x": x},
        language="python",
        timer=timeit.default_timer,
        sub_label=f"{func.__name__} ({str(x.dtype)})",
        description=label,
        env=torch.__version__,
    )
    return t.blocked_autorange()

def bench_reduction(
    reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32
) -> list[Measurement]:
    rc = []

    # Bench 2D with reduction over dim=0
    def f(t):
        return reduction_func(t, dim=0)[0]

    f.__name__ = reduction_func.__name__
    f_c = torch.compile(f, dynamic=False, fullgraph=True)

    for size in (512, 1024, 2048, 4096):
        x = torch.testing.make_tensor(size, size, device=device, dtype=dtype)
        rc_c, rc_e = f(x), f_c(x)
        rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e)
        rc.append(bench_unary_op(f, x, f"eager-{size}x{size}"))
        rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}"))
    return rc

def main() -> None:
    #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64]
    dtypes = [torch.float32, torch.int32, torch.int64]

    # Profile reduction ops
    rc = []
    for op, dtype in itertools.product([torch.max], dtypes):
        rc.extend(bench_reduction(op, dtype=dtype))
    Compare(rc).print()

if __name__ == "__main__":
    torch._dynamo.config.cache_size_limit = 2**16
    main()
```

Produces the following table before
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      297.3      |       531.6       |       394.1       |        2550.5       |       773.0       |        4904.7       |       3647.2      |        9682.0
      max (torch.int32)    |      297.8      |       359.2       |       387.7       |        1179.4       |       768.2       |        2175.0       |       3677.1      |        4495.9
      max (torch.int64)    |      278.7      |       541.4       |       410.2       |        2873.3       |       858.9       |        5620.4       |       6107.2      |       11176.1

Times are in microseconds (us).
```
And after
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      307.9      |       265.3       |       401.0       |        340.8        |       766.5       |        661.9        |       3463.5      |        2829.5
      max (torch.int32)    |      293.5      |       263.1       |       405.0       |        338.8        |       761.4       |        672.5        |       3050.0      |        2688.6
      max (torch.int64)    |      308.2      |       255.7       |       417.4       |        341.4        |       877.0       |        695.0        |       5812.2      |        5762.2

```

`argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there.

Also fixes `torch.max/min` compilation for half-precision types, added regression types for it.

This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-07-30 21:57:25 +00:00
3a65ff84b6 [dynamo, easy] add comment on skipping sys.monitoring frames (#159493)
Add a comment so we know why we're doing this code (followup to https://github.com/pytorch/pytorch/pull/159369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159493
Approved by: https://github.com/azahed98, https://github.com/Lucaskabela, https://github.com/zou3519, https://github.com/jingsh
ghstack dependencies: #159369
2025-07-30 21:54:38 +00:00
acf13a9b75 Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend (#158903)
Fixes #158902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158903
Approved by: https://github.com/H-Huang
2025-07-30 21:44:29 +00:00
3a55676200 fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-07-30 21:37:13 +00:00
af39144a93 Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480)
Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead.

Fixes https://github.com/pytorch/pytorch/issues/159440

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480
Approved by: https://github.com/xmfan, https://github.com/oulgen
2025-07-30 21:29:38 +00:00
25343b343e [ATen][CUDA][cuFFT] Guard against deprecated error codes (#159466)
This PR adds a guard based on CUDA version, per latest cuFFT [documentation](https://docs.nvidia.com/cuda/cufft/index.html#return-value-cufftresult):
>The following error codes are deprecated and will be removed in a future release: `CUFFT_INCOMPLETE_PARAMETER_LIST`, `CUFFT_PARSE_ERROR`, `CUFFT_LICENSE_ERROR`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159466
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/Skylion007
2025-07-30 21:10:32 +00:00
07fad04181 [ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.

This PR makes the following changes:

- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)

What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`

**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
2025-07-30 21:01:53 +00:00
7ac70ac4cd Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit a3a51282dbabe0220c2c3947a89f7d2ecc514d33.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to failed internal build Failed to load config ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3137796767))
2025-07-30 20:59:19 +00:00
e221a1c853 [Code Motion]Restructure flex attention kernel into flex subdirectory (#159437)
Mostly code motion, updating relative paths, moving some imports that had to be lazy before to top level scope now that we are free from the curse.

This will make it easier to add newer templates and provide some organization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159437
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/Skylion007
2025-07-30 20:12:35 +00:00
4defea1e2c [c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429)
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
2025-07-30 19:55:55 +00:00
53d68b95de [ROCm CI] Migrate to MI325 Capacity. (#159059)
This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059
Approved by: https://github.com/jithunnair-amd, https://github.com/atalman

Co-authored-by: deedongala <deekshitha.dongala@amd.com>
2025-07-30 19:47:59 +00:00
f74842d57f [PP] Fix zero bubble schedules for eval() (#159475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475
Approved by: https://github.com/tianyu-l, https://github.com/Skylion007
2025-07-30 19:46:10 +00:00
644fee2610 Fix TestAutogradFallback flaky tests under Dynamo: migrate to lib._destroy() (#159443)
under dynamo, the libraries couldn't properly be cleared unless we manually did `gc.collect()`, but that's slow. it also worked if we just used the _destroy() method to tear down

FIXES
#159398
#159349
#159254
#159237
#159153
#159114
#159040
#158910
#158841
#158763
#158735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159443
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 19:30:55 +00:00
7821fbc560 [BE] Clarify comment to not revert when command has been edited (#159495)
This is mostly a nit. I was a bit confused when I saw
<img width="1032" height="183" alt="image" src="https://github.com/user-attachments/assets/7a18f167-78c1-4c33-ba6f-3588914c642e" />
in https://github.com/pytorch/pytorch/pull/159172

So I decided I should clean up this message a bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159495
Approved by: https://github.com/yangw-dev, https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-30 19:23:33 +00:00
73ee323380 [ONNX] RMS Norm (#159377)
- Implement rms norm using onnx RMSNormalization-23
- Use the correct eps for float32
  eaadd1282c/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L1844-L1866)
  <img width="743" height="107" alt="image" src="https://github.com/user-attachments/assets/a6fd45aa-01d9-4667-924d-3012232cfcde" />

- Created facility to run tests with the reference runtime by extending ONNXProgram and assert_onnx_program.

Fix https://github.com/pytorch/pytorch/issues/159257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159377
Approved by: https://github.com/titaiwangms
2025-07-30 18:55:47 +00:00
176c6446f8 Update CODEOWNERS for ONNX (#159390)
Update CODEOWNERS for ONNX to reflect current maintainers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159390
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-07-30 18:54:25 +00:00
debc0591b8 Fix large_tensor_test skipping cpu (#158617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158617
Approved by: https://github.com/BoyuanFeng
2025-07-30 18:48:07 +00:00
0df78f0c11 Remove /d2implyavx512upperregs- flag (#159431)
And reopen https://github.com/pytorch/pytorch/issues/145702

As this flag is not documented anywhere, slows down sccache accelerated build and  per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath

Fixes https://github.com/pytorch/pytorch/issues/159082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431
Approved by: https://github.com/clee2000
2025-07-30 18:47:03 +00:00
d0e8a0ec4c Add CPython test for heapq (#159370)
Not used directly but used internally by `collections.Counter`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159370
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 18:43:06 +00:00
22492848b6 [BE]: Update CUTLASS submodule to 4.1.0 (#158854)
Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854
Approved by: https://github.com/henrylhtsang
2025-07-30 17:44:38 +00:00
5c14315b05 fixed typo error (#159451)
Fixes #159375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159451
Approved by: https://github.com/albanD
2025-07-30 17:41:30 +00:00
1b99c1859c [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-30 17:29:43 +00:00
435edbcb5d [Graph Partition] add graph partition doc (#159450)
This pr adds doc for graph partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450
Approved by: https://github.com/eellison
2025-07-30 17:01:10 +00:00
6c6e11c206 Revert "Fix max_width computation in _tensor_str._Formatter (#126859)"
This reverts commit 1465757959dd7e63715b7621650896eca977aefa.

Reverted https://github.com/pytorch/pytorch/pull/126859 on behalf of https://github.com/yangw-dev due to broke trunk with test  distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single - RuntimeError: Expected to find buf7 = empty but did not find it ([comment](https://github.com/pytorch/pytorch/pull/126859#issuecomment-3137137030))
2025-07-30 16:56:32 +00:00
a775c8e73e [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
2025-07-30 16:35:51 +00:00
24d07b3a67 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-30 16:34:15 +00:00
90fd06be71 Various bugfixes for running NanoGPT training (#159166)
Fix various small bugs with running nanogpt on torchbenchmark in OSS under python 3.10. After these changes, the following now succeeds:

```
tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --training --backend inductor  --caching-precompile --warm-start-latency
```

Cold start: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp12LuZ5/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm start (we are invesigating the recompile):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpT5YTB2/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159166
Approved by: https://github.com/zhxchen17
2025-07-30 16:30:22 +00:00
002f18807e [DCP] Improve error handling for process based async checkpointing (#159374)
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79177410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
2025-07-30 16:25:28 +00:00
259e79e3ff Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 16:11:58 +00:00
ee343ce60c [RPC][TensorPipe] Fix import torch if compiled without TensorPipe (#159461)
This is a follow up on the PR #154382, as the issue still persists:
```
  File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module>
    from . import api, backend_registry, functions
  File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module>
    from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
  File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
    from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461
Approved by: https://github.com/lw
2025-07-30 16:04:02 +00:00
ea5369113a unflatten closure (#159418)
Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call.

Test Plan:
added test, updated another

Rollback Plan:

Differential Revision: D79204669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418
Approved by: https://github.com/angelayi
2025-07-30 15:42:18 +00:00
b268f22ab2 Move Float4 to headeronly (#159414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414
Approved by: https://github.com/desertfire
2025-07-30 15:34:01 +00:00
52a52d1b78 [dynamo][guards] Skip no tensor aliasing guard on inbuilt nn module buffers (#159453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159453
Approved by: https://github.com/jansel
2025-07-30 15:31:07 +00:00
eaadd1282c Revert "Move Half to headeronly (#159172)"
This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa.

Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](6d0f4566e2).  Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))
2025-07-30 15:10:26 +00:00
1465757959 Fix max_width computation in _tensor_str._Formatter (#126859)
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.

Now, the code first checks if it should be in sci_mode, then compute `max_width`

Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])

print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```

In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([   10.0000,     0.1000,     0.0100]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```

On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)

Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)

After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000,  0.1000,  0.0100]) Formatter max_width: 7
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```

This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
2025-07-30 14:01:00 +00:00
17b9c618dd [a2av] not returning out tensor from ops (#159435)
torch.compile of `all_to_all_vdev_2d` hits the following error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.
```

This PR selects option (1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435
Approved by: https://github.com/ngimel, https://github.com/xmfan
2025-07-30 08:30:25 +00:00
d3ce45012e Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
2025-07-30 06:37:15 +00:00
1fc010a9d8 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
2025-07-30 06:37:15 +00:00
dfacf11f66 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
2025-07-30 06:37:06 +00:00
c8cf811995 Enable AcceleratorAllocatorConfig key check (#157908)
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
2025-07-30 06:36:56 +00:00
914b1a3873 Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-30 06:36:46 +00:00
7eb5fdb358 [dynamo][guards] Recursive dict tag optimization (#159183)
Design doc here - https://docs.google.com/document/d/1W29DrWID5miGWlZXspsQVN5U0zydE3kjZpziOXrhuaY/edit?tab=t.0#bookmark=id.sba04iw9sp68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159183
Approved by: https://github.com/jansel
2025-07-30 06:01:32 +00:00
f1fb57d854 Add user annotation for FX graph cache key (#159318)
Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Rollback Plan:

Differential Revision: D79019069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-07-30 05:52:50 +00:00
6d0f4566e2 Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 05:02:13 +00:00
e785c087c5 [audio hash update] update the pinned audio hash (#159321)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159321
Approved by: https://github.com/pytorchbot
2025-07-30 04:35:01 +00:00
d214901133 Add a title to distributed._dist2.md (#159385)
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
2025-07-30 04:09:41 +00:00
96ac64d00c Migrate easy q(u)int/bits stuff to torch/headeronly (#159302)
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.

It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-30 03:41:27 +00:00
46d34d6766 (should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm (#159184)
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.

806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)

## DDE
```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
    if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4472 in should_fold)
```

```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
    return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4483 in should_fold)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
2025-07-30 03:12:14 +00:00
clr
880249adbc dynamo: handle AttributeErrors from nn_module when infer_paramaters throws. (#158501)
This only handles AttributeError, but in general, any exception coming from
here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions.

```
 File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function
    initialize_lazy_module(tx, mod, args, kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module
    mod._infer_parameters(mod, fake_args, fake_kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
  ...,
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...'
```

Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user.
```
======================================================================
ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params
    exp_res = opt_m(x, y)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__
    return super().__call__(*args, **kwargs)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check
    self._orig_mod._infer_parameters(self._orig_mod, args, kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters
    self.foo += 1
    ^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
        f"'{type(self).__name__}' object has no attribute '{name}'"
    )
AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-07-30 02:41:41 +00:00
846ada4973 [AOTI] disable crashed AOTI UTs on Windows. (#159427)
disable crashed AOTI UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159427
Approved by: https://github.com/angelayi
2025-07-30 02:23:27 +00:00
badd0618e4 Remove unused paramter on CUDA AllocParams (#159159)
# Motivation
While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-30 02:05:25 +00:00
a753a72b14 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158290, #158291
2025-07-30 01:36:03 +00:00
b57d1ef110 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
2025-07-30 01:36:03 +00:00
dd7c996d5c [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
2025-07-30 01:36:03 +00:00
70d2e9ba45 [MPS] Avoid outputing zeros from exponential_ for MPS (#159386)
Fixes #159103
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386
Approved by: https://github.com/malfet
2025-07-30 00:20:31 +00:00
eqy
62f98dbb44 [CUDA][Convolution] Add tf32_on_and_off decorator to test_deconv_freezing_cuda (#159280)
Blackwell seems to select TF32 kernels for this case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159280
Approved by: https://github.com/zou3519, https://github.com/jingsh, https://github.com/Skylion007
2025-07-29 23:44:10 +00:00
e288c258f7 Revert "Remove tensorexpr tests (#158928)"
This reverts commit d742a2896c571a535003d5928fe80397325575a5.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616))
2025-07-29 23:32:07 +00:00
df58db8831 [dynamo, docs] add recompilation, observability, reporting issues docs (#159062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062
Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305
2025-07-29 23:23:51 +00:00
15bb81ea4f [2/N][CI] Remove MacOS-13 workarounds from tests (#159304)
Part of https://github.com/pytorch/pytorch/issues/159275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #159277, #159278
2025-07-29 23:12:13 +00:00
8d37073bac [ROCm] Update jit_utils.cpp trait modification based on HIP version. (#159292)
The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5.

Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292

Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now.

Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292
Approved by: https://github.com/malfet
2025-07-29 22:45:27 +00:00
dc286aef61 Fused RMSNorm Housekeeping (#159317)
Small PR to address comments that were made from the original fused rmsnorm PR that were not landed

Changes:
- Warning message when input.dtype doesn't match weight.dtype
- Ensure default epsilon value is correct

Comments:
https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005
https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy
2025-07-29 22:39:18 +00:00
b4619f0272 Pin Helion to 0.0.10 in PyTorch CI (#159420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159420
Approved by: https://github.com/aorenste, https://github.com/malfet
2025-07-29 22:06:50 +00:00
477c2273e1 [dynamo] better way to skip tracing sys.monitoring callables (#159369)
Better approach to https://github.com/pytorch/pytorch/pull/158171, according to https://github.com/python/cpython/issues/137178#issuecomment-3131617493.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159369
Approved by: https://github.com/Skylion007
2025-07-29 21:54:58 +00:00
2176d481c1 [DTensor] dispatch to sharding prop over decomps (#159324)
Fixes #159110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324
Approved by: https://github.com/ezyang
2025-07-29 21:28:36 +00:00
b97274e8ac [iter] Raise TypeError if iter arg cannot be iterable (#158410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158410
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371, #156416, #156460
2025-07-29 21:24:21 +00:00
f9be65cea4 [iter] Wrap iter(..) call in a ObjectIteratorVariable (#156460)
This object keeps track when the iterator is exhausted (raise Stopiteration).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156460
Approved by: https://github.com/zou3519
ghstack dependencies: #156371, #156416
2025-07-29 21:24:20 +00:00
4e3e3dc0a7 [iter] support iter(callable, sentinel) (#156416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156416
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371
2025-07-29 21:24:20 +00:00
fcf59df2b6 [iter] Add support for sequence protocol in iter(..) (#156371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156371
Approved by: https://github.com/zou3519
2025-07-29 21:24:20 +00:00
1bcb2f41e0 [BE] Eliminate workspace info in templates with new API (#159055)
Summary: Moves the workspace info calculations to the old TMA API.

Test Plan:
NFC

Rollback Plan:

Differential Revision: D78904434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159055
Approved by: https://github.com/NikhilAPatel
2025-07-29 21:22:36 +00:00
8460131087 [nativert] Add OSS version of ModelRunner (#159268)
Summary: Implement a ModelRunner from scratch with the minimum features for OSS only

Test Plan:
test_export -r NativeRT

Rollback Plan:

Differential Revision: D78979812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159268
Approved by: https://github.com/dolpm
2025-07-29 21:08:14 +00:00
c0c24b61ff Revert "Partitioner: Fix to align partition node order with original graph (#157892)"
This reverts commit 2d1e92307d3e67622f4fe8058d62e44fe4fa2f4e.

Reverted https://github.com/pytorch/pytorch/pull/157892 on behalf of https://github.com/yangw-dev due to fails internal tests : [executorch/backends/xnnpack/partition/xnnpack_partitioner.py:101:24] Incompatible parameter type [6]: In call `Partition.__init__`, for argument `nodes`, expected `Optional[Iterable[Tuple[Node, Optional[int]]]]` but got `dict_keys[Node, str]`. ([comment](https://github.com/pytorch/pytorch/pull/157892#issuecomment-3134004881))
2025-07-29 20:41:45 +00:00
4fac43b21f [BE] Move _freeze.py to torch/fb/utils (#159307)
Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary

Test Plan:
It's a targets change so CI should suffice

Rollback Plan:

Differential Revision: D78910653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307
Approved by: https://github.com/albanD
2025-07-29 20:07:17 +00:00
b794e77b7b Disable cudagraph GCs by default (#158649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158649
Approved by: https://github.com/eellison
ghstack dependencies: #158193
2025-07-29 19:56:11 +00:00
d987a6f7f0 Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)"
This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d.

Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))
2025-07-29 19:49:40 +00:00
5d93127c87 Revert "[HOP, map] Rework of map autograd to the new interface (#153343)"
This reverts commit 24b1f10ca13d682430725c511812e43a35fcd6a6.

Reverted https://github.com/pytorch/pytorch/pull/153343 on behalf of https://github.com/yangw-dev due to a older pr this pr dependes on needed to revert, rebase it after it's in ([comment](https://github.com/pytorch/pytorch/pull/153343#issuecomment-3133816812))
2025-07-29 19:46:42 +00:00
a3a51282db Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-29 19:26:20 +00:00
e557b3d5e5 Revert "[inductor] Fix mm decomposition evaluating symints (#158998)"
This reverts commit 52e180c3799a7638ee668b1291a711865ab8cfec.

Reverted https://github.com/pytorch/pytorch/pull/158998 on behalf of https://github.com/yangw-dev due to it broke trunk with pr_time_benchmark test  ([comment](https://github.com/pytorch/pytorch/pull/158998#issuecomment-3133696775))
2025-07-29 19:04:11 +00:00
f3a9e99036 Fix inductor cuda sort nan behavior (#159308)
Fix for https://github.com/pytorch/pytorch/issues/152423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159308
Approved by: https://github.com/isuruf
2025-07-29 19:02:45 +00:00
f7d6e9f500 [dynamo][guards] More small guard optimizations (#159345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159345
Approved by: https://github.com/williamwen42
ghstack dependencies: #159288
2025-07-29 18:36:49 +00:00
e43e09e6c1 [dynamo][guards] Use lambda guards for object aliasing to improve object aliasing guards (#159288)
# Note - On Lambda guarding of object aliasing
        # We previously installed object‑aliasing guards as relational guards,
        # but that undermined the recursive‑dict guard optimization: placing the
        # aliasing guard at a leaf prevented the parent dict node from
        # qualifying as a recursive‑dict guard root. Because aliasing guards are
        # rare, we now emit them as epilogue guards via a small Python lambda.
        # This repeats the access in Python—adding a bit of work—but the
        # overhead is outweighed by the gains from enabling recursive‑dict guard
        # optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288
Approved by: https://github.com/StrongerXi
2025-07-29 18:36:49 +00:00
2004f8aa10 FXConverter handling of generic output in inductor fallback kernel (#159002) (#159297)
Summary:

A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case.

Test Plan:
Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`.

Before this diff the modified test would fail with
```
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate
    line.codegen_fx(self)(line)
  File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output
    inds = line.indices[0][1:]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
IndexError: list index out of range
```
 (Full error paste in P1878839403)

With this diff the error is no longer present.

Rollback Plan:

Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297
Approved by: https://github.com/blaine-rister
2025-07-29 18:29:01 +00:00
31b3b38e3a Ensure export joint with descriptors + compile works (#159337)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159337
Approved by: https://github.com/wconstab
ghstack dependencies: #159336
2025-07-29 17:43:52 +00:00
2f0db0444e Track previous MetricsContext edits for ease of debugging. (#159336)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159336
Approved by: https://github.com/wconstab
2025-07-29 17:43:52 +00:00
6162e650b0 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-29 17:40:49 +00:00
5d89634ca8 Graph break with error message (#158800)
Fixes #157452

Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```

### Release Notes

Change to nn.Parameter Constructor Behavior in Dynamo

Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging.  Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
2025-07-29 17:34:49 +00:00
52e180c379 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-29 17:29:38 +00:00
c55e72bea1 [Re-land][Inductor] Support native Inductor as backend for MTIA (#159211)
The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error:
<img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" />
I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function.

So this diff/PR is an exactly copy of the previous one, except for adding the docstring.

-------------
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2025-07-29 17:03:24 +00:00
750348b579 [NativeRT] Clean up use of TargetDevice in KernelFactory (#159298)
Summary:
Remove use of targetDevice in KernelFactory.

AOTI would infer device when creating AOTIDelegateExecutor.

Test Plan:
CI

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D79007317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298
Approved by: https://github.com/dolpm
2025-07-29 16:24:33 +00:00
52b9af163c Add avg_pool3d for MPS (#158877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877
Approved by: https://github.com/malfet
2025-07-29 15:22:22 +00:00
f4bfac11c7 [Precompile] [easy] API For Editable PrecompileCacheArtifacts (#158586)
This adds an option for backend precompile artifacts to be *editable*, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them.

This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is:
- User runs AOTAutograd -> Inductor -> Triton
- User saves to AOTAutogradCache the normal results
- User runs autotuning
- User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels.

This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586
Approved by: https://github.com/zhxchen17
2025-07-29 14:53:21 +00:00
8d00833fdb [PP] Fix eval step under no_grad() (#159293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293
Approved by: https://github.com/tianyu-l, https://github.com/wconstab
2025-07-29 14:42:33 +00:00
de529ef002 [ONNX] onnx.md to simplify deprecated entities (#159312)
Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312
Approved by: https://github.com/titaiwangms
2025-07-29 14:24:17 +00:00
61aa2ae20f Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341)"
This reverts commit e469414b59ceeaae2860e36708de8852b9892776.

Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))
2025-07-29 13:56:20 +00:00
9d32aa9789 Help fix numpy detection in cross compiled layouts (#137084)
We had trouble at conda-forge getting numpy to get detected on aarch64 due to our splayed layout and cross compilation needs.

see:
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256
* https://github.com/conda-forge/pytorch-cpu-feedstock/issues/266
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/267

This is my attempt at making an "upstreamable patch" that tries to follow your structure.

It could introduce a new environment variable `Python_NumPy_INCLUDE_DIR` if you want, but CMake doesn't use it as an environment variable, so I feel like that would be weird.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137084
Approved by: https://github.com/atalman
2025-07-29 12:08:56 +00:00
5cf77a0ea2 Fix redistribution costs for slice_scatter (#159223)
We were previously assuming that the `input_strategy == src_strategy`, which is not true in all cases.

This should fix this.

On the side, I also realized that for `slice_scatter` some DTensorSpecs don't have TensorMeta, e.g., https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L524

It would be good to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159223
Approved by: https://github.com/ezyang, https://github.com/wconstab
2025-07-29 12:00:39 +00:00
efcf87654e [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-29 08:05:56 +00:00
2523e58781 unbacked handling for view_copy (#159244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159244
Approved by: https://github.com/bobrenjc93
2025-07-29 07:10:46 +00:00
222fa451a2 Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-29 05:43:53 +00:00
6de24135e5 Fix flaky test_inductor_multiple_specializations (#159264)
Summary: This test was using do_bench, so it was flaky performance is non-deterministic.

Test Plan:
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled

Rollback Plan:

Differential Revision: D79098692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264
Approved by: https://github.com/jingsh
2025-07-29 05:16:55 +00:00
27ae72036d [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276)
Differential Revision: [D79106238](https://our.internmc.facebook.com/intern/diff/D79106238/)

This is in prep for cutlass upgrade.

More context: https://github.com/NVIDIA/cutlass/issues/2487

Tested in https://github.com/pytorch/pytorch/pull/159115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159276
Approved by: https://github.com/adamomainz, https://github.com/njriasan, https://github.com/Skylion007
2025-07-29 04:40:24 +00:00
e924df23a6 [NativeRT] Strengthen matcher check for StaticDispatch kernel (#159187)
Summary:
Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU.

Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy.

Prepare for turning on static dispatch kernel by default.

Test Plan:
I should add some test before land.

Rollback Plan:

Differential Revision: D78747600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187
Approved by: https://github.com/dolpm
2025-07-29 04:03:49 +00:00
67e68e0785 [c10d] Cleanup split_group logic using the newly built splitGroup (#158488)
with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it.

Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488
Approved by: https://github.com/d4l3k
2025-07-29 03:27:11 +00:00
775788f93b [BE][PYFMT] migrate PYFMT for test/[i-z]*/ to ruff format (#144556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144556
Approved by: https://github.com/ezyang
2025-07-29 03:26:09 +00:00
19ce1beb05 [AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight (#159279)
Summary:
Add test for enabling CUDACachingAllocator for AOTInductor's Weight.
Implementation TBD

Test Plan:
N/A, commit is adding a test.

Rollback Plan:

Differential Revision: D79107507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159279
Approved by: https://github.com/desertfire, https://github.com/jingsh
2025-07-29 02:52:10 +00:00
a91ddea61f Add CPython tests for collections module (#158950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158950
Approved by: https://github.com/zou3519
2025-07-29 02:24:27 +00:00
ffccb90ff4 [dynamo, docs] add fullgraph=False docs (#159050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050
Approved by: https://github.com/svekars, https://github.com/anijain2305
ghstack dependencies: #157985, #158055, #158531
2025-07-29 01:53:47 +00:00
f916f34739 [dynamo, docs] non-strict programming model docs (#158531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531
Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #157985, #158055

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:47 +00:00
c32994ce4b [docs, dynamo] add fullgraph=True, common graph breaks docs (#158055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055
Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305
ghstack dependencies: #157985

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:41 +00:00
433e43cbec [dynamo, docs] programming model dynamo core concepts (#157985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985
Approved by: https://github.com/svekars, https://github.com/anijain2305
2025-07-29 01:53:34 +00:00
e469414b59 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-07-29 01:14:50 +00:00
657e5e9aa6 All custom operators go through Inductor's graph.call_function (#159174)
Fixes #158892

All custom operators should go through the graph.call_function path. The
other fallback path is for aten/prim operations that don't have support
for things (like torch.float8_e8m0fn).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174
Approved by: https://github.com/eellison
2025-07-29 00:31:57 +00:00
f02b783aae [1/N] Remove MacOS-13 MPS testing (#159278)
Starts addressing https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278
Approved by: https://github.com/dcci
ghstack dependencies: #159277
2025-07-28 23:52:47 +00:00
8ad96a563c [inductor] normalize path of the code. (#159255)
Error stack:
<img width="1361" height="345" alt="image" src="https://github.com/user-attachments/assets/50fb2baa-34fd-4a48-a3e7-76e3185391d4" />

After fix:
<img width="1103" height="398" alt="image" src="https://github.com/user-attachments/assets/ece5a9ba-a085-46fe-b061-0c2ebda3a2df" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159255
Approved by: https://github.com/desertfire
2025-07-28 23:42:11 +00:00
59e261bbd8 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit f5130bf339f12ccf5c6296130c47685bdc4858e4.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/yangw-dev due to this pr failed internally when build torchgen due to rror: fail: Unknown PyPI project: pyyaml, it seems like this is caused by change PyYAML into  pyyaml, please fix it ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3129995414))
2025-07-28 22:02:10 +00:00
08ea8fccaf [ez][docker] Remove some unused vars and scripts (#158680)
`CUDNN_VERSION` isn't used in any Dockerfiles, it's picked automatically based on the cuda version in `install_cuda.sh`

`install_cudnn.sh` isn't used anywhere, cudnn installation happens in `install_cuda.sh`

I didn't find any mentions of `GRADLE_VERSION` or `TENSORRT_VERSION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158680
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-07-28 21:44:47 +00:00
41754539be Add 3.14 triton wheel build (#159261)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159261
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-28 20:34:16 +00:00
716d52779f [BE] Delete non-existing labels (#159277)
As no such runners has been online for last 2+ month
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277
Approved by: https://github.com/clee2000
2025-07-28 20:28:57 +00:00
3bf41f26c8 [cutlass] rename EVT args within kernels for code caching (#159243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159243
Approved by: https://github.com/henrylhtsang
2025-07-28 19:01:40 +00:00
19aa8eb4f5 [TF32][Flex Attention] Turn off TF32 for reference computation in test_flex_decoding (#158979)
Seems to avoid threshold (fudge factor) twiddling games as this causes the checks to go down the "very small ref error" path instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158979
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng, https://github.com/nWEIdia
2025-07-28 18:38:23 +00:00
8c0c5c58c7 [benchmarks] Set model name early to keep warmup and main model same (#159231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231
Approved by: https://github.com/williamwen42
ghstack dependencies: #159209
2025-07-28 18:18:16 +00:00
2d1e92307d Partitioner: Fix to align partition node order with original graph (#157892)
Fixes #157891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892
Approved by: https://github.com/ezyang
2025-07-28 17:36:29 +00:00
399c89e15c fix torch/distributed contributing doc (#158934)
both pointers are pointing to a page of empty github issues. I'm moving this to point to all issues tagged with `pt_distributed_rampup`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158934
Approved by: https://github.com/d4l3k
2025-07-28 17:01:05 +00:00
14d67eec05 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 9b4d938f04c95cebe0fbd96974f64c935567e039.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/ZainRizvi due to This was reverted internally. Somehow this PR didn't get reverted alongside it. See D78772867. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3128148475))
2025-07-28 16:58:27 +00:00
9ad7dd54f9 [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string (#158896)
Summary: - Upgrade KernelLauncher kernelLaunchCheck to print help string, following D78440016

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```

Rollback Plan:

Differential break Revision: D78572009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158896
Approved by: https://github.com/atalman
2025-07-28 16:11:13 +00:00
387db86ef1 Name Inductor's Subproc pool threads. (#158815)
Differential hack Revision: D78710371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158815
Approved by: https://github.com/d4l3k
2025-07-28 16:08:08 +00:00
e5a1d839c5 [nativert] ensure planner once flag is class-local, not static. (#159116)
Summary: att - otherwise only one global planner will be made even though we need it to be per-model if models are colocated.

Differential hack Revision: D78939141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159116
Approved by: https://github.com/SherlockNoMad
2025-07-28 16:06:21 +00:00
c06164a9c5 [nativert][ez] Remove unused dist collectives ops. (#159220)
Removing dependency to c10d/ in ExecutionFrame.h. We don't need c10d::Work in the frame.

Differential Revision: [D79041618](https://our.internmc.facebook.com/intern/diff/D79041618/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159220
Approved by: https://github.com/SherlockNoMad, https://github.com/dolpm
2025-07-28 16:03:14 +00:00
c7586d4ed3 typo (#156560)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156560
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-07-28 15:40:06 +00:00
8e07c9870d [dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566)
inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation,  it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation.

Fixes https://github.com/pytorch/pytorch/issues/157399
@xmfan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157566
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2025-07-28 12:44:22 +00:00
a76147c9e0 [xla hash update] update the pinned xla hash (#158223)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158223
Approved by: https://github.com/pytorchbot
2025-07-28 11:19:05 +00:00
f3913ea641 [CUDA] fix nansum in non-JIT build (#158633)
This change fix crash of
```
import torch
a = torch.tensor([[1, 2]], dtype=torch.complex32).to('cuda')
b = torch.nansum(a, dim=0)
print(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158633
Approved by: https://github.com/ngimel
2025-07-28 08:11:32 +00:00
1abff80fae Reland D78841818 (#159216)
Summary: Relanding D78841818 with fixes

Test Plan:
Tested all failing tests

buck build --config fbcode.use_link_groups=true --flagfile fbcode//mode/dev-nosan fbcode//sigmoid/core/executor/memory/test:layout_planner_tests

buck test 'fbcode//mode/opt' fbcode//sigmoid/inference/test:test_passes

Rollback Plan:

Reviewed By: hl475

Differential Revision: D79038615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159216
Approved by: https://github.com/dolpm
2025-07-28 07:39:35 +00:00
799303f655 Fix atleast_{1,2,3}d() with no arguments description (#156042)
Fixes #130667

## Test Result

### Before
![image](https://github.com/user-attachments/assets/7e3a6764-872a-4573-8bec-e7219f920a15)
![image](https://github.com/user-attachments/assets/194be00c-9a29-44cf-b6bc-4d261a12d04e)
![image](https://github.com/user-attachments/assets/21cd6a4f-0793-44e3-9073-7b8b801f997c)

### After

![image](https://github.com/user-attachments/assets/fdbaa2ff-f13c-4fa9-bf52-0810faa698bd)
![image](https://github.com/user-attachments/assets/0374b474-4c6b-4b7d-abea-70e3df0c0a06)
![image](https://github.com/user-attachments/assets/9f9dc188-60e2-4c0f-9e23-36a39310008c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156042
Approved by: https://github.com/zou3519
2025-07-28 06:25:23 +00:00
d26ab281d2 Revert "Setup TorchBench in Docker (#158613)"
This reverts commit d72ebefe3fa7d3ee0e9c9b399f5c07611e790664.

Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/XuehaiPan due to checkout_install_torchbench function is removed but still referenced in trunk ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3125695250))
2025-07-28 06:19:00 +00:00
1cffb217ef Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)"
This reverts commit e88f804a2eecf967dbbf95c5643248352626dafd.

Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/XuehaiPan due to Breaks Windows wheels ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3125566269))
2025-07-28 05:29:37 +00:00
c8342b7231 [vllm hash update] update the pinned vllm hash (#159235)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159235
Approved by: https://github.com/pytorchbot
2025-07-28 04:16:31 +00:00
f63673626d [dynamo][guards] Skip guards on constant func.__defaults__ elements (#159209)
Func.__defaults__ is a tuple. Therefore, we can skip guards on immutable elements. Mutable elements are still guarded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159209
Approved by: https://github.com/jansel
2025-07-27 22:46:17 +00:00
37638c303e Addressing some linter errors (#158670)
Summary: Addressing the linter errors reported in the changed files.

Test Plan:
```
buck test mode/opt deeplearning/fbgemm:QuantUtilsTest
```
https://www.internalfb.com/intern/testinfra/testrun/11821949118528688

```
buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test
```
https://www.internalfb.com/intern/testinfra/testrun/7881299627525465

Rollback Plan:

Differential Revision: D78352311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158670
Approved by: https://github.com/excelle08, https://github.com/cyyever, https://github.com/digantdesai
2025-07-27 21:55:50 +00:00
ee2edf3d37 [ROCm][CK][Inductor] enable gfx950 for max autotune with CK (#159195)
+ update inductor config for new gfx arch
+ fixes in codegen for conv2d and ck-tile matmul
+ use appropriate fp8 dtypes
+ test cleanup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159195
Approved by: https://github.com/chenyang78
2025-07-27 20:47:13 +00:00
51eb41a57e Enable dynamic shapes for foreach operations by default (#158985)
## Summary

This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`.

## Context

The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows.

## Why This Change?

After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations.

### Current behavior:
- Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes`
- Without this flag, foreach operations may fail with dynamic shapes
- This creates friction and confusion

### With this change:
- Foreach operations work seamlessly with dynamic compilation
- No manual configuration needed
- Better "it just works" experience

## Testing

Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1:
-  Various tensor sizes (8, 16, 32, 64, 128)
-  Multiple tensors in operations (tested up to 20)
-  Nested foreach operations
-  Mixed operations (foreach + standard operations)
-  Both CPU and CUDA devices
-  Symbolic shapes with dynamic compilation

## Impact Assessment

- **Performance**: No impact - this only affects compilation behavior
- **Backward Compatibility**: Fully maintained - users can still set to `False`
- **Risk**: Minimal - feature has been stable since August 2024

## References

- Original implementation: PR #134477 by @qchip
- This completes the feature rollout by making it available by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-27 19:56:07 +00:00
ede6186c86 [PP] Allow intermediate nodes in ZB to have multiple grads (#159084)
Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792)

Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests.

This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2):

```python
class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        return x, 2
    @staticmethod
    def backward(ctx, gI1, gI2):
        assert gI2 is None
        return gI1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084
Approved by: https://github.com/tianyu-l
2025-07-27 19:16:51 +00:00
6d071bd65d Remove numpy dependency from onnx (#159177)
One should not expect numpy to be there during onnx import
Forward fix for : https://github.com/pytorch/pytorch/pull/157734
Added regression test to `test_without_numpy` function

Test plan: Run `python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"` with/without this fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159177
Approved by: https://github.com/atalman, https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/andrewboldi
2025-07-27 13:23:03 +00:00
cyy
d742a2896c Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-07-27 07:13:27 +00:00
11d6559a58 [inductor] disable failed UTs of test_misc.py (#159210)
Disable failed UTs.

<img width="1195" height="118" alt="image" src="https://github.com/user-attachments/assets/da0933fb-3c4c-44c9-ba85-45971f03405f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159210
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-07-27 05:41:44 +00:00
e7667e5702 [vllm hash update] update the pinned vllm hash (#159217)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159217
Approved by: https://github.com/pytorchbot
2025-07-27 04:16:35 +00:00
cyy
f6c89c1ef3 Detach tensor before clone in SGD optimiser and other code (#159204)
Reverse the pattern of tensor clone followed by detach in SGD and other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159204
Approved by: https://github.com/Skylion007
2025-07-27 03:31:12 +00:00
d72ebefe3f Setup TorchBench in Docker (#158613)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-07-26 12:56:03 -07:00
46b925681c [inductor] Update to(tl.int8).to(tl.uint8) workaround from #94717 to handle entire range of torch.uint8 (#158567)
https://github.com/pytorch/pytorch/pull/94717/files#r2210265070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-07-26 19:11:37 +00:00
fe0ff12dab Revert "[Inductor] Support native Inductor as backend for MTIA (#158526)"
This reverts commit cd68559d0451185f8521912c23e77b83d76b87cf.

Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))
2025-07-26 17:58:00 +00:00
7dafab6a93 Fix SDPA sharding when return_debug_mask is False (#159205)
If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions.

This PR fixes it for `scaled_dot_product_flash_attention_strategy`.  Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205
Approved by: https://github.com/wconstab
2025-07-26 17:41:42 +00:00
f5130bf339 [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-26 17:12:29 +00:00
f62772f365 Revert "Remove tensorexpr tests (#158928)"
This reverts commit 517eebc1dd4ae6430a95818b16c5f8b4b10fd1bc.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks trunk test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_frac_cpu_bfloat16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16534544469/job/46768022799) [HUD commit link](517eebc1dd) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3122158944))
2025-07-26 17:01:54 +00:00
e2b2685f84 [inductor] enable compiled autograd on CPU windows - v2 (#159185)
The first version: https://github.com/pytorch/pytorch/pull/158432
compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code.
However these code can be compiled on CPU. This PR enable these code on CPU windows.

But the first version changed ifdef block logical, and caused torch audio build fail: https://github.com/pytorch/audio/issues/3992

Here is the version two, which keep the original logical.

# Local test torch audio build pass:
<img width="874" height="1043" alt="image" src="https://github.com/user-attachments/assets/9657be86-04f7-4c66-b8c6-802ec2a7c5c8" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159185
Approved by: https://github.com/xmfan
2025-07-26 16:21:28 +00:00
3db8623dcb Revert "[NativeRT] Apply Device placement once when loading the graph (#158996)"
This reverts commit 28ee8be5bfeebb2e44daace6551462b52557e451.

Reverted https://github.com/pytorch/pytorch/pull/158996 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158996#issuecomment-3121540050))
2025-07-26 09:05:26 +00:00
cd68559d04 [Inductor] Support native Inductor as backend for MTIA (#158526)
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526
Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison
2025-07-26 08:16:34 +00:00
62a49d929b [vllm hash update] update the pinned vllm hash (#159198)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159198
Approved by: https://github.com/pytorchbot
2025-07-26 04:44:38 +00:00
c6b479bc09 remove guard_or_x from allowlist_for_publicAPI (#159181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159181
Approved by: https://github.com/albanD
2025-07-26 01:22:17 +00:00
cyy
517eebc1dd Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-07-26 01:21:01 +00:00
7f266020de add softmax_backward_strategy missing field (#159167)
Add input_specs in softmax_backward_strategy, as is needed by AutoParallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159167
Approved by: https://github.com/XilunWu
2025-07-26 00:53:53 +00:00
e06798191b Split out C++ code from fused adagrad PR (#159008)
The original fused Adagrad pull request was: PR#153038

This PR contains only the c++ code of that original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159008
Approved by: https://github.com/janeyx99
2025-07-26 00:36:59 +00:00
eqy
c89fa88acb [conv][cuDNN][64-bit indexing] reduce memory usage of depthwise conv 64-bit indexing test (#158981)
Use half instead for reduced memory usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158981
Approved by: https://github.com/soulitzer, https://github.com/Skylion007
2025-07-25 23:58:45 +00:00
f5cf05c983 Throw invalid_argument instead of RuntimeError when parameters exceed… (#158267)
Throw invalid_argument instead of RuntimeError when parameters exceed limits (for torch.int32 dtype)

Fixes #157707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158267
Approved by: https://github.com/albanD
2025-07-25 23:49:46 +00:00
21a95bdf7c [Inductor] [Triton] Enabling TMA for flex-attention for supported device types (#157822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157822
Approved by: https://github.com/drisspg
ghstack dependencies: #159123
2025-07-25 23:45:26 +00:00
fb029accb7 (is_non_overlapping_and_dense) gso to guard_or_false in when checking length 1 (#158894)
Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides.

2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)

We think it's safe because Laith tested whether this fallback would fail any tests. It did not.
https://github.com/pytorch/pytorch/pull/158157

## Data-dependent exceptions (DDE)
```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy
    x_tensor = torch._prims.convert_element_type(x_tensor, dtype)
  ...
  File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta
    if torch._prims_common.is_non_overlapping_and_dense(a):
  File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense
    if guard_size_oblivious(length == 1):
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)).  (Size-like symbols: u0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894
Approved by: https://github.com/pianpwk, https://github.com/laithsakka
2025-07-25 23:43:38 +00:00
26f4dd5160 Scaled MM Fix NVfp4 (#159170)
Fixes mm on B200:
Before:
```Shell
    def _addmm_nvfp4_dispatch(
        a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear.
        The only difference is whether bias is None or not.
        """
        assert a._data.is_contiguous()
        assert b._data.t().is_contiguous()
        assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}"
        assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}"

        M, K = a.shape[0], a.shape[1]
        N = b.shape[1]

        # Swizzle Dizzle
        if a._is_swizzled_scales:
            a_scale_blocked = a._scale_e4m3  # Already swizzled
        else:
            a_scale = a._scale_e4m3.view(M, K // a._block_size)
            a_scale_blocked = to_blocked(a_scale)

        if b._is_swizzled_scales:
            b_scale_blocked = b._scale_e4m3  # Already swizzled
        else:
            b_scale = b._scale_e4m3.view(N, K // b._block_size)
            b_scale_blocked = to_blocked(b_scale)

        # Merge double quant scales into 1 scale for Scale_In^D
        if a._per_tensor_scale is not None:
            assert b._per_tensor_scale is not None
            scale_result = a._per_tensor_scale * b._per_tensor_scale
        else:
            assert b._per_tensor_scale is None and a._per_tensor_scale is None
            scale_result = None

        # THIS IS A WORKAROUND:
        # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
        # When we have per-tensor scaling, we need to apply it before bias
        # since bias is not quantized
        should_add_bias_separately = (scale_result is not None) and (bias is not None)
        # should_add_bias_separately = bias is not None

>       result = torch._scaled_mm(
            a._data.view(torch.float4_e2m1fn_x2),
            b._data.view(torch.float4_e2m1fn_x2),
            a_scale_blocked.view(torch.float8_e4m3fn),
            b_scale_blocked.view(torch.float8_e4m3fn),
            bias=None if should_add_bias_separately else bias,
            out_dtype=a._orig_dtype,
            # scale_result=scale_result,  # Not supported yet
        )
E       RuntimeError: Invalid scaling configuration.
E       - For TensorWise scaling, a and b should be float8, scales should be float and singletons.
E       - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous.
E       - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major.
E       - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides).
E       - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous.
E       - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous.
E       Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170
Approved by: https://github.com/ngimel
2025-07-25 23:34:03 +00:00
b9e3eb64a7 [Optimus] Support decompose mm with dynamic shapes (#158821)
Summary: The current implementation will not do the decompose for GEMM with dynamic shapes, thus we add one more option for users to enable this feature

Test Plan:
### how to enable

Step 1: Set decompose_mem_bound_mm = false
Step 2:
Add the decompose_mm_pass pattern to the post_grad_fusion_options
json config example:

"post_grad_fusion_options": {
            "decompose_mm_pass": {
              "min_first_dimension_decomposition": 10240, -> default value
              "max_other_dimention_decomposition": 32,  -> default value
             "skip_dynamic_shape_dim_check": true, -> default is false
            }
      },

yaml config example

```
 post_grad_fusion_options:
        decompose_mm_pass:
          skip_dynamic_shape_dim_check: true
```
Note that all these hyper-parameters can be set by the users, if nothing gives, a default value will be used

### unit test

```
buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_dynamic_shape_decompose_addmm
```

Buck UI: https://www.internalfb.com/buck2/a98eb4b3-da1d-4450-9e49-472ba98b2267
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924745731095
Network: Up: 86KiB  Down: 1.3MiB  (reSessionID-96cf35cc-5189-4372-8f25-1fc6a52a3963)
Executing actions. Remaining     0/3                                                       1.4s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:00.6s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

before: aps-DPA_new_v0_amd_20250716-e7927755df
after: aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### qps and NE

{F1980635506}
 {F1980635505}
- 12.5% qps improvement with NE neutral

### trace analysis
baseline:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716-e7927755df%2F0%2Frank-1.Jul_22_22_28_01.4592.pt.trace.json.gz&bucket=aps_traces

{F1980633952}
proposal:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb%2F0%2Frank-1.Jul_24_14_37_59.4576.pt.trace.json.gz&bucket=aps_traces

{F1980633966}

```
        unsqueeze_default: "bf16[32*s54, 8, 1][8, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_2, 2)
        unsqueeze_default_1: "bf16[1, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_3, 0);  constant_pad_nd_default_3 = None
        mul_tensor: "bf16[32*s54, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.mul.Tensor(unsqueeze_default, unsqueeze_default_1);  unsqueeze_default = unsqueeze_default_1 = None
```

### what have been decomposed
P1880443593

Rollback Plan:

Differential Revision: D78716034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158821
Approved by: https://github.com/Yuzhen11
2025-07-25 23:19:53 +00:00
69cc99525c [nn]: updated type alias for padddingmode in module/conv.py (#158843)
Fixes #152280

Changed type of `padding_mode` from `str` to `Literal["zeros", "reflect", "replicate", "circular"]`

**cc** @Skylion007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158843
Approved by: https://github.com/mikaylagawarecki
2025-07-25 23:05:02 +00:00
72af19dadf Add aot_autograd.fx_utils (#159005)
See docblock for details.  The API here has been validated by use
in autoparallel but I'm always open to suggestions for tweaks.  One
particular choice I made is to make most of the functions return dicts
by default; this isn't strictly necessary for inputs but it is very
convenient for outputs as the output desc lives on the output node,
not the argument that feeds into the node.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005
Approved by: https://github.com/wconstab
2025-07-25 22:52:33 +00:00
8aebf01287 [bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663)
Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function.
`all_gather_merge_fn_to_trace`
`reduce_scatter_merge_fn_to_trace`

(Instead of creating nodes and doing FakeTensor prop manually)
This allows to experiment with merge function.

Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_`

Adding topological sort after bucketing passes (comment in post_grad.py):
```
        # Fx collectives bucketing passes require topological sort for the cases:
        # when bucketed collectives have users before the last collective in the bucket
        # AND when inputs of bucketed collective have ancestors after the first collective in the bucket.
        #
        # In this case we can not manually pick the place for bucketed collective insertion.
        # But we are guaranteed by the bucketing (independent collectives in the bucket),
        # that it is possible to reorder nodes to satisfy all ordering requirements.
        #
        # --- before bucketing ---
        # in0 = ...
        # wait_ag0 = ag(in0)
        # user0(wait_ag0)
        # ...
        # pre_in1 = ...
        # in1 = transform(pre_in1)
        # wait_ag1 = ag(in1)
        # user1(wait_ag1)
        #
        # --- after bucketing ---
        #
        # in0 = ...
        # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective.
        #
        # pre_in1 = ...
        # in1 = transform(pre_in1)
        # ag_bucket(in0+in1)
        # wait_bucket
        # wait_ag0 = wait_bucket[0]
        # wait_ag1 = wait_bucket[1]
        # user1(wait_ag1)
````

Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel:

<img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" />
<img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663
Approved by: https://github.com/wconstab
2025-07-25 22:49:51 +00:00
bc5dbbbb78 support scalar tensor for functional all_gather (#149913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149913
Approved by: https://github.com/H-Huang
ghstack dependencies: #149912
2025-07-25 22:38:08 +00:00
36cf8f1ed8 [BE] Use .md instead of .rst for nn.aliases doc (#158666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491, #158654
2025-07-25 22:03:55 +00:00
1e79872f2e [BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491
2025-07-25 22:03:55 +00:00
9e8f27cc79 [BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491)
Options to address the "undocumented python objects":

1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change)
3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs.

#### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html)
This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except
- NLLLoss2d (deprecated)
- Container (deprecated)
- CrossMapLRN2d (what is this?)
- NonDynamicallyQuantizableLinear

This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491
Approved by: https://github.com/janeyx99
2025-07-25 22:03:55 +00:00
e65ab9a868 Enable generating generic c_shim that doesn't bypass dispatcher (#158974)
Adds `c_shim_aten.{h/cpp}` and use this for `fill_`

This is the generated `c_shim_aten.cpp` for reference

```cpp

// WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND.
// See 7e86a7c015/torchgen/gen.py (L2424-L2436) for details

// This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py

#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
#include <torch/csrc/inductor/aoti_torch/utils.h>

#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/CompositeExplicitAutogradFunctions.h>
#include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h>
#include <ATen/CompositeImplicitAutogradFunctions.h>
#else
#include <ATen/ops/fill.h>

#endif // AT_PER_OPERATOR_HEADERS

using namespace torch::aot_inductor;

AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) {
    AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({
        at::fill_(
            *tensor_handle_to_tensor_pointer(self), value
        );
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974
Approved by: https://github.com/albanD, https://github.com/janeyx99
2025-07-25 21:59:14 +00:00
bfe6765d6b [export] assert fix in serdes (#159060)
Summary: catch asserts on True

Test Plan:
T232064560

Rollback Plan:

Differential Revision: D78907485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159060
Approved by: https://github.com/yiming0416
2025-07-25 21:46:20 +00:00
e88f804a2e [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
2025-07-25 21:44:57 +00:00
7ef3c3357d NUMA binding integration with elastic agent and torchrun (#149334)
Implements #148689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334
Approved by: https://github.com/d4l3k

Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>
2025-07-25 21:19:49 +00:00
24b1f10ca1 [HOP, map] Rework of map autograd to the new interface (#153343)
This PR reworks the current autograd implementation of map to the new interface.

@pytorchbot label "topic: not user facing"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343
Approved by: https://github.com/ydwu4
2025-07-25 21:17:06 +00:00
0006dd5c43 [test][torchbind] don't allow set torchbind attr at runtime (#158608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158608
Approved by: https://github.com/zou3519
ghstack dependencies: #158583, #158606, #158607
2025-07-25 20:55:41 +00:00
0f31e9a656 [torchbind] fix fakifying a staitc tensor returns dynamic accidentally (#158607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158607
Approved by: https://github.com/zou3519
ghstack dependencies: #158583, #158606
2025-07-25 20:55:41 +00:00
0427e439aa [test][torchbind] turn on inductor backend for compile torchbind tests (#158606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158606
Approved by: https://github.com/zou3519
ghstack dependencies: #158583
2025-07-25 20:55:41 +00:00
4aa69ae336 [torchbind] support register_autocast for torchbind custom op (#158583)
Fix https://github.com/pytorch/pytorch/issues/158414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158583
Approved by: https://github.com/zou3519
2025-07-25 20:55:41 +00:00
14c314b30d [nativert] make per-node benchmark work with memory planning (#159117)
Summary: this will use-after-free otherwise

Rollback Plan:

Differential Revision: D78934104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159117
Approved by: https://github.com/SherlockNoMad
2025-07-25 20:46:17 +00:00
0b01e11416 [ez][export] add sym_sum to verified ops (#159111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159111
Approved by: https://github.com/angelayi
2025-07-25 20:42:42 +00:00
806d9e3fe7 [Inductor][TMA] Split config-gated and pure compatibility logic for TMA template eligibility checks (#159123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159123
Approved by: https://github.com/drisspg
2025-07-25 20:35:49 +00:00
d90ce83027 add a util function _make_all_gather_out_tensor to reduce code duplication (#149912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149912
Approved by: https://github.com/H-Huang
2025-07-25 20:29:01 +00:00
dfcb07bdfa [Inductor] disable windows failed UTs temporary. (#159163)
Disable windows failed UTs temporary.
<img width="1238" height="107" alt="image" src="https://github.com/user-attachments/assets/c8a40408-a793-4016-99bb-19c1bb09860a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159163
Approved by: https://github.com/desertfire
2025-07-25 20:25:36 +00:00
fa0355c18d Fix full_like decomposition to preserve strides (#158898)
Summary:
See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again.

Rollback Plan:

Differential hack Revision: D78783627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898
Approved by: https://github.com/eellison
2025-07-25 20:21:36 +00:00
28ee8be5bf [NativeRT] Apply Device placement once when loading the graph (#158996)
Summary:
Placement is leaked to too many classes!

In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement.

After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device.
The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta.

graph's tensorMeta becomes the only ground truth!

Test Plan:
Need to add some tests before landing.
This is a big change.

Rollback Plan:

Differential Revision: D78841818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996
Approved by: https://github.com/henryoier
2025-07-25 20:11:35 +00:00
ed472257d1 [associative_scan] stop manually set example inputs in dynamo (#159065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159065
Approved by: https://github.com/zou3519
ghstack dependencies: #159063, #159064
2025-07-25 20:08:08 +00:00
57eea56a9a [scan] stop manually set example inputs in dynamo (#159064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159064
Approved by: https://github.com/zou3519
ghstack dependencies: #159063
2025-07-25 20:08:08 +00:00
dd681f7f59 [while_loop] stop manually setting example inputs in dynamo (#159063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159063
Approved by: https://github.com/zou3519
2025-07-25 20:08:08 +00:00
0d4d3e8a89 [TCPStore] Allow ping to be retried (#159165)
On client setup we retry connections with server:

f8fafdc7a6/torch/csrc/distributed/c10d/TCPStore.cpp (L313-L350)

I noticed `ping()` raises `TORCH_INTERNAL_ASSERT` AKA a runtime error rather than a `DistNetworkError`. So updating that so it can be retried as well.

We have seen this pop up internally:
- https://fb.workplace.com/groups/319878845696681/permalink/1478849733132914/
- https://fb.workplace.com/groups/319878845696681/permalink/1479368959747658/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159165
Approved by: https://github.com/d4l3k
2025-07-25 20:03:00 +00:00
ee4c5c7cd2 Add torchcheck for replication_pad3d_backward (#151986)
Fixes #142833

Add check on channel dimension, logic same to the CUDA implementation 78bbb468c6/aten/src/ATen/native/cuda/ReplicationPadding.cu (L347)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151986
Approved by: https://github.com/mikaylagawarecki
2025-07-25 19:48:51 +00:00
51cd6697cd Fix: Use memory_order_relaxed instead of memory_order_relaxed (#159105)
Addresses #159074 by using `memory_order_release` instead of `memory_order_relaxed` here:

9c10760662/c10/core/DeviceType.cpp (L161)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159105
Approved by: https://github.com/colesbury
2025-07-25 19:39:04 +00:00
ba949c54a7 [inductor] fix test_save_graph_repro on Windows. (#159148)
The issue is caused by Windows path separator work as escape character. Fixed by `normalize_path_separator` in torch front end codegen.

Error message:
<img width="855" height="542" alt="image" src="https://github.com/user-attachments/assets/ad08b521-05e6-4c93-9507-ad19c68ac7b5" />

Fixed:
<img width="855" height="312" alt="image" src="https://github.com/user-attachments/assets/4a0a142a-2dbe-4226-a4cb-8eacfab2c3fc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159148
Approved by: https://github.com/desertfire
2025-07-25 19:11:08 +00:00
2a528e80ce Add more type hints for _inductor/ir.py (#159049)
Fixes #146167

Incremental step to add type hints for _inductor/ir.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159049
Approved by: https://github.com/Skylion007
2025-07-25 18:56:30 +00:00
56c45f863b Add aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors (#158715)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158715
Approved by: https://github.com/fmassa, https://github.com/wconstab, https://github.com/xmfan
ghstack dependencies: #158624, #158708, #158734
2025-07-25 18:49:00 +00:00
d30f89b9b8 Add host protoc script back (#159157)
Following comment from https://github.com/pytorch/pytorch/pull/158475#issuecomment-3116518904

Also this is a fake issue as protoc is dead anyways: https://github.com/pytorch/pytorch/issues/159156

Also also, macos cross compilation is not something that is tested :/ But I guess we're ok with that given how niche it it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159157
Approved by: https://github.com/janeyx99
2025-07-25 18:44:20 +00:00
3fb78501f0 Revert "enable compiled autograd on CPU windows (#158432)"
This reverts commit a369350065493109d1abfbb994695777ab11bcf4.

Reverted https://github.com/pytorch/pytorch/pull/158432 on behalf of https://github.com/atalman due to Broke audio cuda windows builds see: https://github.com/pytorch/audio/issues/3992 ([comment](https://github.com/pytorch/pytorch/pull/158432#issuecomment-3119912177))
2025-07-25 18:29:16 +00:00
8a0508335f [export] Fix public bindings (#159109)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159109
Approved by: https://github.com/jbschlosser
2025-07-25 18:18:52 +00:00
4c0d5ad4be Fix docstring for clip_grads_with_norm_ to reflect clamping behavior (#158200)
Fix docstring for clip_grads_with_norm_ to reflect clamping behavior
This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by:

grad = grad * (max_norm / (total_norm + eps))

However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users.

Updated the formula in the docstring to:

grad = grad * min(max_norm / (total_norm + eps), 1.0)

Added a note explaining the rationale for clamping (to prevent gradient amplification).
Ensured consistency with the behavior of clip_grad_norm_.

Fixes #151554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200
Approved by: https://github.com/mikaylagawarecki
2025-07-25 18:07:41 +00:00
316c188a5e Remove torch.functional entries from the doc ignore list (#158581)
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking.
3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs.

This PR takes option (3) above and:
* Removes all 20 `torch.functional` entries from the doc ignore list
* Removes `torch.functional.align_tensors()` entirely, since we don't want to document it.
    * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety.
* Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581
Approved by: https://github.com/janeyx99
2025-07-25 17:19:01 +00:00
191eca0bf0 Use simple_wraps instead of functools.wraps in AOTAutograd (#158734)
Wrapping is load bearing for things that introspect argument signatures,
but use of functools.wraps to do this is undesirable as this overrides
the name/module of the wrapping function, which is bad for tracking down
exactly what code is actually being run at runtime.  simple_wraps is
like wraps but it doesn't override the name information, so you still
get an appropriate printout.  To see the stack of all functions wrapping
each other, there is now a helper fn_stack.

I also make some assertions tighter in the descriptor PR.  These didn't
catch any bugs but I figure might as well.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734
Approved by: https://github.com/wconstab
ghstack dependencies: #158624, #158708
2025-07-25 17:08:54 +00:00
74f64d3c84 Add inputs and outputs in Triton Kernel FX Graph segment (#158174)
Summary: Add inputs and outputs in Triton Kernel FX Graph segment

The FX graph segment in Triton kernel does not include the input tensors and return tensors, for example
Python code:
```
  @torchdynamo.optimize("inductor")
  def fn(a, b, c):
      x = torch.nn.functional.linear(a, b)
      x = x.sin()
      x = x.t() + c * 2
      return x
```

```
# %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {})
# %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})

```
The fix is to add the input and output tensors into FX graph segment

```
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {})
# %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# return %add
```

Differential Revision: D78131358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158174
Approved by: https://github.com/jansel
2025-07-25 17:01:17 +00:00
f8fafdc7a6 Revert "[BE] remove torch deploy - conditionals (#158288)"
This reverts commit ab26d4fbeb5bc4b4e6ef1c37fbec9fab6e5a9edd.

Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
c8316d0e79 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit 6ed2cb6ccd00e64f67fd414d42dff54393140c8f.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
a9f6770edd Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 9c68c4d08f4c4da49f0086b80e382f0cdd518f60.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
5620e617c9 Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)"
This reverts commit 255c0545e7eac2ec6d00a41a3fc9d6d8201f8f39.

Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
ee84ba42ea [Experiment] Run PT2 benchmark twice a day (#159162)
Running every 4 hours seems too many, lower it to twice a day.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159162
Approved by: https://github.com/desertfire, https://github.com/eellison
2025-07-25 15:58:29 +00:00
561193e5f2 [CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2

sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory

They use larger runners, which have more GPU memory, so its usually ok.  I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)

I've applied skips to the ones that OOMed

Time decreases from ~2.7hr per test job -> ~2hr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
2025-07-25 15:26:29 +00:00
9535995bbc Revert "Remove tensorexpr tests (#158928)"
This reverts commit a0bc865123dba047aa1507e281bf2462780cf271.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to broke cpp static runtime test? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16517697273/job/46715871457) [HUD commit link](a0bc865123) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3118554478))
2025-07-25 15:22:51 +00:00
6fcb2b4413 [dynamo] unimplemented -> unimplemented_v2 for user_defined.py (#156652)
For https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156652
Approved by: https://github.com/zou3519

Co-authored-by: Sidharth <ssubbarao8@meta.com>
2025-07-25 15:04:17 +00:00
204eb4da5e Add expanded_def option for FX printing, render descriptor, update tests (#158708)
----

- First, we add a new expanded_def to FX, which will expand the
  definitions of variables into multiple lines, one per variable
  definition.  This makes extremely long args/return lists much
  more readable.

- Next, we extend this mechanism to also print out descriptors on
  placeholders and return values, as comments, if available.  This
  is how we will test descriptors.

- We update tlparse for AOTAutograd to use this format.

- We update expect tests to use this format and update their formats,
  so you can inspect what it can look at.  There may be other tests
  I should update, open to suggestions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708
Approved by: https://github.com/wconstab
ghstack dependencies: #158624
2025-07-25 13:22:32 +00:00
bf311141d6 Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624)
One of the recurring challenges of working with FX graphs produced by
AOTAutograd is that there is a very intricate input/output calling
convention that is essentially impossible to understand without actually
reverse engineering the AOTAutograd code.  It is so bad that there
is a bit of logic for stashing indices of relevant arguments/outputs
in TracingContext so Inductor can figure out what the correct arguments
are.

This PR introduces the necessary scaffolding to keep track of
"descriptors" of every input/output to a (joint) FX graph produced
by AOTAutograd.  First read through descriptors.py to get a sense for
what is available: for inputs, you can figure out if you have
a plain input, tangent, parameter, or something more exotic like
one of the fields of a subclass or view base.  For outputs, you can
determine if you have a plain output or grad, or something more exotic
like the contents of a mutated input or an intermediate base of several
views that were returned.

There are two distinct parts of this patch: AOTInput tracking, and
AOTOutput tracking.

**AOTInput tracking.**  The way this works is that AOTAutograd starts of
with some Tensor `flat_args` that are the inputs to the graph being
traced, and then updates these arguments as it modifies the input
calling convention.  Anywhere these `args` are passed around, we now add a
news argument `args_descs` which is updated in synchrony with args.  Add
a new arg?  Add a new AOTInput to `args_descs`.

**AOTOutput tracking.**  Originally, I wanted to also add an `outs_descs`
analogous to `args_descs` tracking output metadata.  However, it is
often difficult to compute what the output will be until you're actually
tracing the function for real (and are able to peek at the real
outputs).  So we only compute `outs_desc` when we actually trace.  To do
this, we change the calling convention of the function we trace to
return not just outputs, but a tuple of `outs` and `outs_descs`.  Before
we bottom out at the `make_fx` invocation, we save `outs_descs` to a
nonlocal and bottom out.

To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624
Approved by: https://github.com/xmfan
2025-07-25 13:22:32 +00:00
92e93bb580 [inductor][cpu] Stop lowering div to reciprocal multiplication to preserve precision when the divisor is a scalar and device is on cpu (#158231)
## Fixes https://github.com/pytorch/pytorch/issues/157959
## mini repro from issue
```c++
import torch
from torch import nn

class Foo(nn.Module):

    def __init__(
        self,
        use_parameter: bool
    ) -> None:
        super().__init__()
        self.b = 101
        if use_parameter:
            self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # return x + self.b
        # return x - self.b
        return x / self.b
        # return x * self.b

torch.manual_seed(42)
x = torch.rand((5, 5))
expected = Foo(False)(x)

models = [
    Foo(False),
    Foo(True),
    torch.compile(Foo(False), fullgraph=True),
    torch.compile(Foo(True), fullgraph=True),
]

for m in models:
    print((m(x) - expected).sum())
```

all outputs equal zero except the result of  torch.compile(Foo(False), fullgraph=True)

## summary:
when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal.
this could lead precision lost in c++ kernel. but not in triton kernel
## why:
Generated C++ kernel; thanks to @xmfan for supplying the code.
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = static_cast<float>(0.009900990099009901);
                    auto tmp2 = at::vec::Vectorized<float>(tmp1);
                    auto tmp3 = tmp0 * tmp2;
                    tmp3.store(out_ptr0 + static_cast<int64_t>(x0));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                    auto tmp1 = static_cast<float>(0.009900990099009901);
                    auto tmp2 = at::vec::Vectorized<float>(tmp1);
                    auto tmp3 = tmp0 * tmp2;
                    tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                }
            }
        }
    }
}
```
The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits.
```c++
#include <iostream>
#include <iomanip>

int main() {
 auto tmp1 = static_cast<float>(0.009900990099009901);
 auto tmp2 = static_cast<double>(0.009900990099009901);
 std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl;
 std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl;
    return 0;
}
```
the ouput is

```bash
tmp1 = 0.0099009899422526359558
tmp2 = 0.0099009900990099011103
```
 `auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value.
I also found that the bug occurred at that position
86d8af6a6c/torch/_inductor/lowering.py (L6238)

The commit states that the precision lost is expected in cuda implementation.
original commit
03439d4c1c
cuda implementation
0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)

What is interesting is that the Triton kernel works correctly due to the precision of float type in python.
```python
def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 25
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = 0.009900990099009901
    tmp2 = tmp0 * tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231
Approved by: https://github.com/eellison
2025-07-25 08:57:17 +00:00
cyy
a0bc865123 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD
2025-07-25 08:37:51 +00:00
aaa384b2d4 move view_meta to fake impl (#158406)
Python dispatcher is not always enabled in fake tensors and have to be called explicitly.
While it should be, it requires some work to get all tests working.

 I have been running in several issues where I add to add enable_python_dispatcher ex
  XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406
Approved by: https://github.com/bobrenjc93
2025-07-25 08:21:27 +00:00
0fd5f1c294 [ROCm][CI] upgrade wheels to 6.4.2 patch release (#158886)
Upgrade wheels to ROCm 6.4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158886
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 08:11:41 +00:00
e38a2b3d0f [inductor] add missing ignore_errors parameter for Windows. (#159025)
The origin code comemnts:
```python
# Let's not fail if we can't clean up the temp dir. Also note that for
# Windows, we can't delete the loaded modules because the module binaries
# are open.
```
But we are missing the `ignore_errors` parameter for Windows. I help to add it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025
Approved by: https://github.com/jansel
2025-07-25 07:58:22 +00:00
ae183d6092 Aten vector default constructors set to 0, add fnmadd and fnmsub (#158508)
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158508
Approved by: https://github.com/swolchok
2025-07-25 06:55:37 +00:00
659f8fb115 [dynamo][guards] Add some relational guard helpers (#159077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159077
Approved by: https://github.com/jansel
ghstack dependencies: #158995
2025-07-25 06:28:10 +00:00
05a748d287 [dynamo][guards] Expand is_immutable_object to have None (#158995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158995
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-07-25 06:12:05 +00:00
02ca965560 Device agnostic for DCP (#158337)
Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well.
use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices.

Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-25 05:24:09 +00:00
511d987378 only call re-plan if historic max's were updated. (#159016)
Summary: wasteful. only update the plan if a new maximum has been found.

Test Plan:
ci

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D78859344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159016
Approved by: https://github.com/SherlockNoMad
2025-07-25 05:07:30 +00:00
9685fc36d4 Add missing optional for tensor ops (#159028)
## Test Result

<img width="872" height="340" alt="image" src="https://github.com/user-attachments/assets/20c3f1a2-0160-4ea3-b9f3-14630b4ec06d" />
<img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/68f8d8da-0570-4ae8-8e45-573b2c64cae5" />
<img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/42d133f6-94eb-4a38-8b4b-5586f52bff88" />
<img width="878" height="285" alt="image" src="https://github.com/user-attachments/assets/d3ad8950-81fa-4c4c-a5b5-621b0d9df99b" />

<img width="889" height="430" alt="image" src="https://github.com/user-attachments/assets/9aabeaff-bb8f-4990-b253-1bb053e72aca" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159028
Approved by: https://github.com/Skylion007
2025-07-25 04:36:55 +00:00
9e5cfd3ee5 [audio hash update] update the pinned audio hash (#159108)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159108
Approved by: https://github.com/pytorchbot
2025-07-25 04:35:21 +00:00
cdf8e9ec1a [MPS] Add support for unsigned types (#159094)
As both Metal and MPS support uint16/uint32 and uint64

Test plan: `python3 -c "import torch;print(torch.randint(55, 66, (16,), device='mps', dtype=torch.uint16)[10:])"`

Fixes https://github.com/pytorch/pytorch/issues/159076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159094
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-07-25 04:31:42 +00:00
bcf34d24eb [vllm hash update] update the pinned vllm hash (#159107)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159107
Approved by: https://github.com/pytorchbot
2025-07-25 04:03:39 +00:00
9b29166f57 [ROCm] add flag torch.backends.miopen.immediate (#158951)
The MIOpen integration has changed over the years.  In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode.  But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True.  This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode.  Thus, immediate mode needs its own flag.  Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags.  This PR adds the first miopen-only flag to control its immediate mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 04:01:51 +00:00
1fced0c7d5 [ROCm] enable hipblaslt on gfx908 for ROCm >= 6.3 (#159092)
Fixes #159030.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159092
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 03:54:30 +00:00
16c0ccd669 [ROCm][CI] upgrade to 6.4.2 patch release (#158887)
Upgrade to ROCm 6.4.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158887
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 03:45:44 +00:00
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
f903bc475c [BE] add noqa for flake8 rule B036: found except BaseException without re-raising (#159043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043
Approved by: https://github.com/Skylion007
2025-07-25 02:56:34 +00:00
4261e26a8b [OpenReg] move fallback tests into test_openreg.py (#158441)
----

- move fallback tests into test_operneg
- remove the test_cpp_extensions_open_device_registration.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158441
Approved by: https://github.com/albanD
ghstack dependencies: #158415, #158440
2025-07-25 02:39:41 +00:00
b635359e4c [OpenReg] add pyproject.toml for openreg (#158440)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158440
Approved by: https://github.com/albanD
ghstack dependencies: #158415
2025-07-25 02:39:41 +00:00
f1a1aa9490 [OpenReg] Improve README.md and optimize some codes for OpenReg (#158415)
----

- add description for DSO dependencies
- remove unnecessary code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158415
Approved by: https://github.com/albanD
2025-07-25 02:39:41 +00:00
6fc0ad22f0 Using the latest torch.library.register_fake API instead of torch.library.impl_abstract (#158839)
As the title stated.

`torch.library.impl_abstract` have beed deprecated in PyTorch2.4, so change to use the new API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158839
Approved by: https://github.com/jingsh, https://github.com/zou3519
ghstack dependencies: #158838
2025-07-25 02:37:30 +00:00
c60d382870 Add tests for torch.ops.load_library (#158838)
According to this [comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3097899129), adding a related test to keep BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158838
Approved by: https://github.com/zou3519
2025-07-25 02:37:30 +00:00
64cb349b81 Extract a method that filters frames in the captured stack trace (#158266)
Summary: The subclass can override the filtering logic to customize which frames to keep or drop.

Test Plan:
```
buck run caffe2/test:test_export -- -r  test_stack_trace
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r test_custom_obj_list_out
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx  -- -r class_member_back_compat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266
Approved by: https://github.com/ezyang, https://github.com/yushangdi
2025-07-25 02:22:03 +00:00
a53db90e21 Revert "[inductor] consolidate common GEMM triton param retrieval (#158015)"
This reverts commit 9faef3d17c2e422d5d62f62b266155e2deb52c40.

Reverted https://github.com/pytorch/pytorch/pull/158015 on behalf of https://github.com/henrylhtsang due to breaking tests ([comment](https://github.com/pytorch/pytorch/pull/158015#issuecomment-3115384824))
2025-07-25 00:16:50 +00:00
9c10760662 [SymmMem] Use host/nvshmem_api.h for backward compat (#159061)
Resolves #159045

`nvshmem_host.h` was introduced in 3.3.9.
Use `host/nvshmem_api.h` and `host/nvshmemx_api.h` for prior versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159061
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/fegin
2025-07-24 22:56:26 +00:00
8d2a1d6e18 Revert "Graph break with error message (#158800)"
This reverts commit cae4746952afbb6d26ecf7599cb7c6c449c69ef4.

Reverted https://github.com/pytorch/pytorch/pull/158800 on behalf of https://github.com/clee2000 due to broke some tests on main inductor/test_distributed_patterns.py::DistributedPatternTests::test_nn_param_return4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16507837934/job/46685704688) [HUD commit link](cae4746952), note to self: bad TD, but also dynamo/test_repros failed but didn't get skipped by TD so maybe a landrace, or I just blaming the wrong commit entirely.. ([comment](https://github.com/pytorch/pytorch/pull/158800#issuecomment-3115224608))
2025-07-24 22:45:58 +00:00
751285cb22 Revert "Move some of vec into headeronly in preparation for Half.h (#158976)"
This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d.

Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))
2025-07-24 22:31:49 +00:00
efc810c7d0 [Bugfix] Fix circular import between export and dynamo from tensor fn map (#158931)
Fixes #158120

The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized

Fix is to populate this map lazily

```
python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305
2025-07-24 22:24:57 +00:00
abb0bf45df [AOTI] skip crashed case on Windows temporary. (#158929)
skip crashed case on Windows temporary.

This case will crashed application:
<img width="1053" height="275" alt="image" src="https://github.com/user-attachments/assets/3225e9c8-cbe7-4998-86da-f20fbb12ead2" />

Quick analysis:
<img width="1400" height="261" alt="image" src="https://github.com/user-attachments/assets/9c21fefc-9ed8-40f2-84c5-edde2004777c" />

1. It is crashed on OpenMP.
2. stack is dameged, need consider how to debug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158929
Approved by: https://github.com/desertfire
2025-07-24 22:08:19 +00:00
b533f12120 Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)"
This reverts commit da94023b0205bf98c3da366f2f86e0a443f4db17.

Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @sraikund16 can you please help validate the fix? (See D78845227 for details). You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3115072504))
2025-07-24 21:46:00 +00:00
e20736bf1d Dont't GC as often when collecting cudagraphs (#158193)
TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s

Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`.

We were previously garbage collecting at the beginning of each cudagraph
capture. vLLM collects 5427 graphs and most of those garbage collections weren't
actually collecting any memory (CPU or GPU). This changes it to not collect more
than every 10s so if we're capturing in a loop we don't burn all our cycles
looking for garbage.

(These number have a lot of variance from run to run but give the correct
general scale)
```
       | calls | total | synchronize |  gcs | collect | empty cache | sys freed | cuda freed |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
before |  5427 |   78s |       1.48s | 5427 |  53.22s |       1.21s |    145855 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
after  |  5427 |   24s |          0s |    3 |   1.53s |       0.84s |       592 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
```
total - this is the total time reported by vLLM's "Graph capturing finished" log.
The rest of these are measured in torch.cuda.graphs.graph.__enter__():
  calls - number of times torch.cuda.graphs.graph.__enter__ was called
  synchronize - this is the duration taken by the cuda.synchronize call
  gcs - number of times gc.collect was called
  collect - this is the duration taken by the gc.collect call
  empty cache - this is the duration taken by the torch.cuda.empty_cache call
  sys freed - the number of bytes reported freed by gc.collect
  cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved

So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is
fairly quick.

Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this):
<img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193
Approved by: https://github.com/ngimel
2025-07-24 21:37:11 +00:00
cae4746952 Graph break with error message (#158800)
Fixes #157452

Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```

### Release Notes

Change to nn.Parameter Constructor Behavior in Dynamo

Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging.  Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
2025-07-24 21:05:17 +00:00
4a13d4d7d0 [ROCm] Update jit_utils.cpp for compatibility with ROCm7.0 (#158868)
Resolves error when running tests such as `test_nn.py::TestNN::test_L1Loss_no_reduce_complex_cuda` etc. on ROCm7.0:

```
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1016:7: error: no template named 'is_floating_point'; did you mean '__hip_internal::is_floating_point'?
 1016 |       is_floating_point<_Tp>::value,
      |       ^~~~~~~~~~~~~~~~~
      |       __hip_internal::is_floating_point
/tmp/comgr-4cd8ad/include/hiprtc_runtime.h:1481:31: note: '__hip_internal::is_floating_point' declared here
 1481 | template<typename _Tp> struct is_floating_point : public false_type {};
      |                               ^
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1017:16: error: too few template arguments for class template '__libcpp_complex_overload_traits'
 1017 |       typename __libcpp_complex_overload_traits<_Tp>::_ComplexType
      |                ^
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:850:10: note: template is declared here
  847 |   template <class _Tp, bool = is_integral<_Tp>::value,
      |   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  848 |                        bool = is_floating_point<_Tp>::value
      |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  849 |                        >
      |                        ~
  850 |   struct __libcpp_complex_overload_traits {};
      |          ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx90a.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158868
Approved by: https://github.com/jeffdaily
2025-07-24 21:00:37 +00:00
da35562bba [ONNX] Filter out torchscript sentences (#158850)
Fixes #157300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850
Approved by: https://github.com/justinchuby, https://github.com/svekars
2025-07-24 20:59:06 +00:00
de85ee73ae Update context in unimplemented_v2 when exception bubbles up to the interpreter (#158924)
Before:
```
.Observed exception
  Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
  Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
  Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.

  Developer debug context:
```

After:
```
Observed exception
  Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
  Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
  Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.

  Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924
Approved by: https://github.com/williamwen42, https://github.com/zou3519
2025-07-24 20:50:22 +00:00
eqy
8573a2beda [CUDA] Fix missing __syncthreads in MultiMarginLoss backward (#158994)
Turns out issue in #158921 is detectable with a simple unit test and adding the missing sync fixes it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158994
Approved by: https://github.com/malfet, https://github.com/Skylion007

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-07-24 20:47:29 +00:00
13398dab79 Revert "Remove tensorexpr tests (#158928)"
This reverts commit a3f9f79f591102afa93145bb67dc7e34df44f9a4.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to Theres still some references to the things removed in this PR in test.sh, the jobs on this PR are failing because of that but log classifier is probably pointing to a wrong line, should be an easy fix tho ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3114873706))
2025-07-24 20:45:30 +00:00
5564f2ca2e Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-24 20:32:33 +00:00
f3edcac23a [dynamo] Added back weblink generation (#159011)
Added back weblink generation for v2.9 development

Note: It is fine to bring the weblink generation back since v2.9 isn't released for a while

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159011
Approved by: https://github.com/williamwen42
2025-07-24 20:27:11 +00:00
90c241dedd [precompile] Support user defined function calls from bytecode. (#158947)
Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage.

Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process.

With this fix, we are able to run nanogpt inference+training with precompile under torchbench.

Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947
Approved by: https://github.com/jamesjwu
2025-07-24 20:10:57 +00:00
5ab0eb28f7 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-24 20:10:51 +00:00
0b2ef76e85 DDE-Free select with unbacked index. (#157605)
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
2025-07-24 20:08:05 +00:00
9faef3d17c [inductor] consolidate common GEMM triton param retrieval (#158015)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D78081314](https://our.internmc.facebook.com/intern/diff/D78081314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158015
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
2025-07-24 19:17:48 +00:00
aeaa20083f [profiler] update CUDA runtime kernel identification logic (#157890)
Update CUDA kernel detection to exclude memory API calls

References:
- https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html
- https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157890
Approved by: https://github.com/sraikund16
2025-07-24 19:14:08 +00:00
5be7e187ba Support sort and scatter_add strategy (#159022)
Add `sort`, `scatter_add` strategy.  I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies.

Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022
Approved by: https://github.com/XilunWu, https://github.com/wconstab
2025-07-24 18:33:18 +00:00
347a97da66 [MPS] Enable dlpack integration (#158888)
Though testing is a lie and dependent on https://github.com/pytorch/pytorch/pull/153835

Fixes https://github.com/pytorch/pytorch/issues/153789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158888
Approved by: https://github.com/albanD
ghstack dependencies: #158874
2025-07-24 18:05:41 +00:00
78aa3bd6b6 Added Emscripten __assert_fail declaration to Macros.h (#158580)
Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D78500790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580
Approved by: https://github.com/JacobSzwejbka
2025-07-24 17:10:29 +00:00
ee97dbf2e7 [ROCm][CI] update HIP patch for 6.4.1, again (#159001)
Another fix for hipGraph capture of MIOpen OCL kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159001
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-24 16:36:19 +00:00
b7d41729e0 Add zerotensor design description in code (#158837)
Fix `TODO: add a note explaining the design decisions` by adding design description in https://github.com/pytorch/pytorch/issues/69687 to codebase. Make it easier to get and read by other developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158837
Approved by: https://github.com/soulitzer
2025-07-24 16:35:42 +00:00
abcb24f4de [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
2025-07-24 15:55:18 +00:00
fd48681b6a [DeviceMesh][ez] Make the logic within flatten simpler (#158999)
While looking at the code of device mesh I find that this logic can be simplified. Also the naming needs to be correct. Because this mesh is not "flattened" yet, so we can just call it flatten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158999
Approved by: https://github.com/wz337, https://github.com/wconstab
ghstack dependencies: #158900
2025-07-24 15:40:13 +00:00
cyy
a3f9f79f59 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD
2025-07-24 15:38:36 +00:00
2fc0b1605e [a2av] Make test input more random (#157029)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Use torch.randn to fill input buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157029
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881, #157026
2025-07-24 15:35:12 +00:00
11ea3736dd Revert "[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)"
This reverts commit 0c0fcb53ff5ee1eb5f0d1f535ed3726d01f8abb5.

Reverted https://github.com/pytorch/pytorch/pull/158691 on behalf of https://github.com/ZainRizvi due to Sorry but these are causing jobs to fail with out of memory errors on trunk ([comment](https://github.com/pytorch/pytorch/pull/158691#issuecomment-3113922186))
2025-07-24 15:31:53 +00:00
43d4ff6851 [a2av] Test dispatch-then-combine (#157026)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Putting both the dispatch API and combine API in battlefield, one following the other, i.e.
```
all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...)

all_to_all_vdev_2d_offset(
    input=out,
    out=combine_out,
    in_splits_offsets=out_splits_offsets,
    out_splits_offsets=combine_out_splits_offsets
)
```
Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine.

Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits.

It works!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026
Approved by: https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881
2025-07-24 15:21:02 +00:00
83957d1c03 [a2av] Add token combine operator (#156881)
Added `all_to_all_vdev_2d_offset`, which:

Perform a 2D AllToAllv operation, with input split and offset
information provided on device. The input offsets need not to be
exact prefix sum of the input splits, i.e. paddings are allowed between the
splitted chunks. The paddings, however, will not be transferred to peer
ranks.

In Mixure of Experts models, this operation can be used to combine tokens
processed by experts on remote ranks. This operation can be viewed as an
"reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles
tokens to experts).

The change may seem a bit dense, sorry.  But it is mainly two changes:
1. templating existing device functions (to use provided input offset or calculate it)
2. generalizing variable names, e.g. npes, ne --> minor_size, major_size,
so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743
2025-07-24 15:08:04 +00:00
48fe4ff247 [export] set enable_gqa in export flash->math decomp (#158604)
Differential Revision: D78524147

For `scaled_dot_product_attention(..., enable_gqa=True)`:
- the Math backend passes the flag through, performing the extra [KV broadcast](6e07d6a0ff/aten/src/ATen/native/transformers/attention.cpp (L902)) if set to True
- the Flash backend has no flag, and relies on correct indexing in the C++ kernel
- Export used to default to Math for `enable_gqa=True`, but https://github.com/pytorch/pytorch/pull/157893 landed and enabled Flash. At the same time, there's an export-only [decomp](6e07d6a0ff/torch/_decomp/decompositions.py (L4968)) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs.

This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158604
Approved by: https://github.com/angelayi, https://github.com/drisspg
2025-07-24 14:46:13 +00:00
f55c5d085e [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-24 14:09:54 +00:00
a3025e17b2 Fix inductor non-stable argsort/sort test (#146622)
- Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output.
- Add functionality to allow alternative assert_equal functions in inductor tests for future cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622
Approved by: https://github.com/eellison

Co-authored-by: George Wigley <georgewi@graphcore.ai>
2025-07-24 14:02:12 +00:00
afd6eb0d49 [docker release] Remove build layer as not used (#158988)
[docker release] Remove build layer as not used in any of the : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158988
Approved by: https://github.com/oulgen, https://github.com/malfet
2025-07-24 12:22:55 +00:00
3ced1079a4 [inductor] Fix collectives_reordering overwrite real_dep with fake_dep with the same name (#158960)
Differential Revision: [D78839734](https://our.internmc.facebook.com/intern/diff/D78839734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158960
Approved by: https://github.com/wconstab
2025-07-24 11:08:58 +00:00
3e954d3943 better testing for subclasses + compile (#158742)
Fixes #114398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158742
Approved by: https://github.com/ezyang
2025-07-24 10:28:44 +00:00
fb067de550 [NativeRT] Remove device_ member from OpKernel base class (#158944)
Summary:
In general, device_ is not very useful in OpKernel.  Remove it to avoid misuse.

Also, the meaning of `device_` is also ambiguous in the OpKernel.
For StaticDispatch kernels, we always call cpu kernel.
For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on.
For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined.

Test Plan:
CI

Rollback Plan:

Reviewed By: henryoier, dolpm, kqfu, zhxchen17

Differential Revision: D78704840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944
Approved by: https://github.com/dolpm
2025-07-24 09:21:37 +00:00
693197eed6 [doc] remove FSDP1 developer note (#158991)
this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2

https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991
Approved by: https://github.com/svekars, https://github.com/mori360
ghstack dependencies: #158989
2025-07-24 08:21:54 +00:00
cyy
65c1109ca2 Remove CUDA 11 CMake code (#156795)
CUDA 11 is no longer supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156795
Approved by: https://github.com/atalman, https://github.com/malfet
2025-07-24 08:00:41 +00:00
70fb5bb6fb [CI] Add smoke test for NVSHMEM availability (#158938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158938
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-07-24 06:34:21 +00:00
30bb7636da removed zero dim cpu logic from fake_tensor.py (#147501)
Fixes #144748
In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug.
The root cause is fake_tenosr.py's find-common-device method, 0b0da81021/torch/_subclasses/fake_tensor.py (L833), takes zero dim cpu tensor into account but  the device check in adaption.h doesn't.

This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501
Approved by: https://github.com/ezyang
2025-07-24 06:19:46 +00:00
68349118b5 [doc] add weifengpy to torch distributed pocs (#158989)
<img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989
Approved by: https://github.com/mori360
2025-07-24 04:42:33 +00:00
e09d80c545 [vllm hash update] update the pinned vllm hash (#158997)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158997
Approved by: https://github.com/pytorchbot
2025-07-24 04:04:17 +00:00
07df6ba7f5 [BE] Remove unused test_python_gloo_with_tls (#158964)
This was last modified in 2021 and has not been invokved at least since 2.0 release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158964
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #158961, #158962, #158963
2025-07-24 02:34:27 +00:00
d61153a300 Delete mobile merge rule (#158963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158963
Approved by: https://github.com/atalman
ghstack dependencies: #158961, #158962
2025-07-24 02:34:27 +00:00
da9e120e3f [BE] Remove unused build-android action (#158962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158962
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #158961
2025-07-24 02:34:27 +00:00
611b61e758 [BE] Remove android build rules (#158961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158961
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-07-24 02:34:27 +00:00
cyy
d352c28dd1 [2/N] Remove FindPackageHandleStandardArgs.cmake (#156559)
Following #157188, this PR removes FindPackageHandleStandardArgs.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156559
Approved by: https://github.com/albanD
2025-07-24 02:34:10 +00:00
0c0fcb53ff [CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2

sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory

They use larger runners, which have more GPU memory, so its usually ok.  I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)

I've applied skips to the ones that OOMed

Time decreases from ~2.7hr per test job -> ~2hr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
2025-07-24 01:51:28 +00:00
febf3c475e fix forced loglevel in pytorch oss code (#158820)
Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820
Approved by: https://github.com/Skylion007, https://github.com/pradeepfn
2025-07-24 00:40:28 +00:00
7001d6fbc9 Skip slow tests for aarch64-inductor-benchmarks (#158842)
This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`.  [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)

•	The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms  , which significantly slow down execution leading to the timeout.
**Action:**
•	An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.

To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
      * timm benchmarks
      * torchbench benchmarks

 As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
•	An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.

Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
2025-07-24 00:21:38 +00:00
0118931e27 [Inductor] Fix a user-defined Triton kernel bool param codegen issue (#158845)
Summary: Fixes https://github.com/pytorch/pytorch/issues/158778. When handling a boolean type parameter to a user-defined Triton kernel, we need to treat it differently from integer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158845
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-07-24 00:19:27 +00:00
ebb032a202 [docker release] Fix push nightly tag (#158984)
This is a typo.
I see that this step is not executing in nightly builds:
https://github.com/pytorch/pytorch/actions/runs/16464544564/job/46538759844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158984
Approved by: https://github.com/oulgen
2025-07-23 23:39:49 +00:00
60ac3414eb [a2av] Split in_out_splits into in_splits and out_splits_offsets (#156743)
So that it would be easier if user would like to feed `out_splits_offsets` as input to a combining a2av (coming next).
An example is in #157029.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156743
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235
2025-07-23 23:34:48 +00:00
d34cee4cf3 Revert "[Torch Native] Add test for packaging weight (#158750)"
This reverts commit 85ee2fb8c5c57b513526b0cc968ba13012167572.

Reverted https://github.com/pytorch/pytorch/pull/158750 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing on trunk: inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_with_exporter_weights [GH job link](https://github.com/pytorch/pytorch/actions/runs/16478978095/job/46590552109) [HUD commit link](85ee2fb8c5) ([comment](https://github.com/pytorch/pytorch/pull/158750#issuecomment-3111188266))
2025-07-23 23:24:55 +00:00
5cdb3d896e [FSDP][Replicate] added replicate function that uses FSDP instead of DDP (#158207)
**Summary**
Users would like to use Replicate with TP. Currently, the replicate function uses DDP, which has not been maintained resulting in a lack of integration options. Since users can use FSDP with TP, we will make the replicate function use FSDP so that users can use replicate with FSDP. To that end I have created a replicate function that uses FSDP instead of DDP. One blocker that I ran into is that the replicate function has a contract which assigns a module "replicate" attribute in registry. This would mean that fully_shards is_composable requirement would not be satisfied making it impossible to apply fully_shard to a replicate module. The solution to this was to copy the fully_shard function and state and modify it for replicate. In the future, it should be explored making the replicate_state inherit from FSDP_state to get rid of code duplicity. I have attached below the profile tracing of a replicated Net Module.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_270fcc36-194a-42f5-9841-cace984c2132_devgpu263.prn2.facebook.com_1792146.1753232748025155780.pt.trace.json

**Test Case**
1.  pytest test/distributed/_composable/test_replicate_with_fsdp.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158207
Approved by: https://github.com/weifengpy

Co-authored-by: Anshul Sinha <50644008+sinhaanshul@users.noreply.github.com>
2025-07-23 22:53:06 +00:00
0204099762 Raise exception in Dynamo if op fails in the interpreter (#158661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158661
Approved by: https://github.com/williamwen42
ghstack dependencies: #158660
2025-07-23 22:31:51 +00:00
b67f97c166 Correctly handle OP_CONTAINS (#158660)
CPython can fallback to `__iter__` if object doesn't implement
`__contains__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660
Approved by: https://github.com/zou3519
2025-07-23 22:31:51 +00:00
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
86df3ff1f1 fix xnnpack build on mac (#158881)
Summary: Fix a bug for not getting the correct sources

Test Plan:
CI

on my mac:
```
buck2 build @//fbobjc/mode/profile --show-full-output //xplat/executorch/examples/portable/executor_runner:executor_runner_opt
File changed: fbsource//xplat/caffe2/third_party/xnnpack.buck.bzl
Buck UI: https://www.internalfb.com/buck2/67b59179-4de8-462a-9202-0b9c34a35aef
Network: Up: 2.4MiB  Down: 1.3KiB  (reSessionID-f687a7cd-5961-4851-bc67-b07043baa52a)
Loading targets.   Remaining     0/1                                                                                                          504 targets declared
Analyzing targets. Remaining     0/42                                                                                                         1960 actions, 2424 artifacts declared
Executing actions. Remaining     0/975                                                                                                        37.2s exec time total
Command: build.    Finished 40 local
Time elapsed: 7.7s
BUILD SUCCEEDED
fbsource//xplat/executorch/examples/portable/executor_runner:executor_runner_opt /Users/maxren/fbsource/buck-out/v2/gen/fbsource/267ffdee31edf15e/xplat/executorch/examples/portable/executor_runner/__executor_runner_opt__/executor_runner_opt
```

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78771697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158881
Approved by: https://github.com/digantdesai
2025-07-23 22:06:27 +00:00
82f8e04f27 Update distributed maintainers (#158900)
I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900
Approved by: https://github.com/albanD
2025-07-23 21:53:27 +00:00
5619bf9971 Enable MI355X PyTorch CI testing. (#158889)
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
2025-07-23 21:50:31 +00:00
d8425e9c75 [1/N] support of replication fallback strategy (#158046)
#### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor.

We can call register_op_strategy to register using the `fallback_op_strategy`:
- For op without List[Tensor] as input, call:
```
register_op_strategy(op_overload)(replicate_op_strategy)
```
- For op contains List[Tensor] as input, call:
```
register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy)
```
The strategy will force all input and output to be replicated with the corresponding redistribute_cost.

#### 2. Add a test function as a necessary condition for strategy function.
```
detect_exists_identical_opspec(*args, op, mesh, strategy_function)
```
This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`.

#### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing.
E.g.,
```
with op_strategy_context(test_op.default, replicate_op_strategy):
    ...
```
#### 4. Fix a bug that TupleStrategy never get flatten as expected:
9df0176408/torch/distributed/tensor/_op_schema.py (L286)
Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after  `strategy_schema = _wrap_with_op_strategy(op_schema)`.

This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2025-07-23 21:14:20 +00:00
633d5faf3f [DeviceMesh] Enable slicing a submesh with warnings (#158899)
We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899
Approved by: https://github.com/wz337
2025-07-23 21:13:41 +00:00
4d5d56a30e [dynamo] lintrunner for gb_registry adds/updates (#158460)
This PR adds automation to adding/updating the JSON registry through the lintrunner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158460
Approved by: https://github.com/williamwen42
2025-07-23 21:02:54 +00:00
64e8d7d66b [BE] bump test dependency z3-solver to drop using deprecated pkg_resources (#158905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158905
Approved by: https://github.com/albanD, https://github.com/ezyang
ghstack dependencies: #158904
2025-07-23 21:01:02 +00:00
b935ad17d5 [BE][Easy] add missing Python 3.14 PyPI classifier (#158904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158904
Approved by: https://github.com/albanD
2025-07-23 21:01:02 +00:00
f7f550649f [cutlass backend] Change default inst level mm config number (#158901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158901
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh, https://github.com/Skylion007
2025-07-23 20:53:22 +00:00
255c0545e7 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290, #158291
2025-07-23 20:27:28 +00:00
9c68c4d08f [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-23 20:27:28 +00:00
6ed2cb6ccd [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-23 20:27:28 +00:00
ab26d4fbeb [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-23 20:27:28 +00:00
da94023b02 [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-23 20:03:52 +00:00
c996aff6ed [ROCm] UT verifies a runtime error is raised if tensor.item() is captured in a cudagraph (#158878)
Unit test for this PR: https://github.com/pytorch/pytorch/pull/158165

This unit test verifies that a runtime error is raised when tensor.item() operation is captured in a cudagraph. Equally valid for ROCm and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158878
Approved by: https://github.com/jeffdaily, https://github.com/ngimel
2025-07-23 20:01:50 +00:00
691736ae07 Add kernel options to flex docs (#158875)
Fixes https://github.com/pytorch/pytorch/issues/158741
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875
Approved by: https://github.com/BoyuanFeng, https://github.com/albanD
2025-07-23 19:05:19 +00:00
fe8f556006 Fix Triton GEMM templates with k=1 (#158650)
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
2025-07-23 18:45:51 +00:00
85ee2fb8c5 [Torch Native] Add test for packaging weight (#158750)
Add test that require weights to be packaged for torch native

For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.

After we added weight deduping, we should be able to let this config be False.

```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter_weights
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
2025-07-23 18:36:10 +00:00
fef236da69 Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (#158866)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158866
Approved by: https://github.com/janeyx99
2025-07-23 18:31:05 +00:00
76be282e3a Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)"
This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8.

Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))
2025-07-23 18:25:46 +00:00
9905ed616a [Inductor] Expose decomposeK knobs as envvars (#158745)
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
2025-07-23 18:23:44 +00:00
30b0ad5c68 Revert "Fix decorators skipping NCCL tests (#158846)"
This reverts commit 57024913c409764f129d6a7792625f5b05462e31.

Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](57024913c4) ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3109553414))
2025-07-23 17:47:35 +00:00
41b6cdaf76 Revert "Fix Triton GEMM templates with k=1 (#158650)"
This reverts commit 9df0f565972a8a034fd77d65aff2c53e6e9856d1.

Reverted https://github.com/pytorch/pytorch/pull/158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158650#issuecomment-3109538827))
2025-07-23 17:42:10 +00:00
1b456c580d [dynamo][guards] Add type info of the guarded value in guard managers (#158765)
tlparse looks like this

<img width="1165" height="226" alt="image" src="https://github.com/user-attachments/assets/04c4e6b1-34a3-4d9d-8304-6eb6d9a94980" />

This will aid in reading guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158765
Approved by: https://github.com/Lucaskabela, https://github.com/StrongerXi
2025-07-23 16:59:15 +00:00
5e386eec94 [AOTI] enable aot inductor on Windows (#158915)
With many PRs landed, we can run the first aot inductor example on Windows.

<img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" />

Let's remove the Windows check on `AotCodeCompiler`.

CC: @angelayi , @desertfire , @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915
Approved by: https://github.com/desertfire
2025-07-23 16:29:15 +00:00
00da8e63eb CI for Windows Arm64 (#148753)
This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml.
It can be triggered on any pull request by including the ciflow/win-arm64 tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148753
Approved by: https://github.com/malfet
2025-07-23 16:12:20 +00:00
576253c476 [math] Trace float.fromhex (#156976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156976
Approved by: https://github.com/zou3519
ghstack dependencies: #156975, #156977
2025-07-23 16:12:08 +00:00
f5314f89c8 [struct] Add struct.pack and struct.unpack polyfills (#156977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156977
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
ghstack dependencies: #156975
2025-07-23 16:12:08 +00:00
671e22a951 [math] Raise exception in Dynamo if constant fold call fail (#156975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156975
Approved by: https://github.com/zou3519
2025-07-23 16:12:08 +00:00
d3d9bc1c31 [inductor] Allow backends to register their own custom config object (#158254)
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
7d296d5c19 [aoti][mps] Enable more tests (#158703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158703
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #158349, #158350, #158351
2025-07-23 15:38:56 +00:00
2a60b8fc97 [export][ez] Fix packaging (#158855)
Summary: as title, seems ytpo

Test Plan:
CI

Rollback Plan:

Differential Revision: D78758466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158855
Approved by: https://github.com/henryoier
2025-07-23 15:36:14 +00:00
d898d0d437 [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-23 15:06:54 +00:00
5998cd4eaa [MPS] Speedup torch.full for 1-byte types (#158874)
By using [`fillBuffer:range:value:`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/fillbuffer:range:value:?language=objc) rather than MPSGraph op, which should be faster and also does not have INT_MAX limit

Which in turn fixes `test_index_put_accumulate_large_tensor_mps` test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158874
Approved by: https://github.com/dcci
2025-07-23 14:00:40 +00:00
57024913c4 Fix decorators skipping NCCL tests (#158846)
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using `unittest.skip` decorators avoids the starting of the test in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
2025-07-23 13:31:21 +00:00
ee72338f0c [Inductor] MSVC use pointer when generating temporary array pointer (#158913)
MSVC cannot implicitly convert a const iterator to a const pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158913
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-23 13:19:11 +00:00
c665594c1e [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index filename, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-23 08:00:14 +00:00
255a04baf1 [pt2 event logging] send autotuning data for strides and hinted shapes (#158852)
Summary:
# Why

capture relevant data for offline lookup table generation

# What

report the hinted sizes not just the symbolic sizes

Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized)

Rollback Plan:

Reviewed By: stashuk-olek

Differential Revision: D77837548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852
Approved by: https://github.com/jingsh
2025-07-23 06:44:27 +00:00
1d302eaee8 [vllm] add vllm test base docker image (#158755)
# description
Add base docker image for vllm.

It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI.

# Others
Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158755
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-23 05:42:44 +00:00
a6b7bea244 [inductor] support linear & layer_norm unbacked (#155267)
### What
- Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path.
- For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`).

### Example DDE
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes)
```
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/ir.py:2797 in create)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267
Approved by: https://github.com/eellison
2025-07-23 05:42:01 +00:00
be72bcf828 [vllm hash update] update the pinned vllm hash (#158806)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158806
Approved by: https://github.com/pytorchbot
2025-07-23 04:41:53 +00:00
f80f97d192 [audio hash update] update the pinned audio hash (#158807)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158807
Approved by: https://github.com/pytorchbot
2025-07-23 04:39:50 +00:00
42a69f7c2b [MTIA Aten Backend] Migrate addmm.out / baddbmm.out / bmm.out (#158749)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addmm.out / baddbmm.out / bmm.out to in-tree.

Differential Revision: [D78578483](https://our.internmc.facebook.com/intern/diff/D78578483/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158749
Approved by: https://github.com/albanD, https://github.com/nautsimon
ghstack dependencies: #158748
2025-07-23 03:45:28 +00:00
b87471e66f [MTIA Aten Backend] Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out (#158748)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out to in-tree.

Differential Revision: [D78568103](https://our.internmc.facebook.com/intern/diff/D78568103/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158748
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-07-23 03:45:20 +00:00
f10e4430e2 [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-23 02:58:21 +00:00
2dccff7dcf [inductor] pass_fds not supported on Windows, skip them on Windows. (#158830)
<img width="1366" height="806" alt="image" src="https://github.com/user-attachments/assets/ddf3d27a-36da-47ce-9ba9-00c43805bb06" />

Almost UTs are failed on `AssertionError: pass_fds not supported on Windows.`, let's skip them on Windows.
TODO: I will also debug and confirm `pass_fds` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158830
Approved by: https://github.com/jansel
2025-07-23 02:24:35 +00:00
dec0d3101c [export] fix unbacked range deserialization (#158681)
Fixes https://github.com/pytorch/pytorch/issues/151809, by reading shape assertion nodes into ShapeEnv, and deferring instantiation of node example values, to be done node-by-node.

Differential Revision: D78588406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158681
Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri
2025-07-23 02:13:11 +00:00
9df0f56597 Fix Triton GEMM templates with k=1 (#158650)
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
2025-07-23 02:05:57 +00:00
91602a9254 Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-23 01:21:31 +00:00
cc372ad557 [aoti][mps] Improve tabbing in cpp generation (#158351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158351
Approved by: https://github.com/desertfire, https://github.com/malfet
ghstack dependencies: #158349, #158350
2025-07-23 00:54:53 +00:00
84058d1179 [aoti][mps] Fix cpu kernel generation (#158350)
In the case where we have both mps and cpu code which can be inductor compiled, we need to case on the device -- this requires the device field to be correctly passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158350
Approved by: https://github.com/malfet
ghstack dependencies: #158349
2025-07-23 00:54:53 +00:00
096dc35d77 [aoti][mps] Fix update constants buffer (#158349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158349
Approved by: https://github.com/malfet
2025-07-23 00:54:52 +00:00
56d07d0bde Add merge_rules category for Dynamo; add guilhermeleobas (#158620)
Adds guilhermeleobas to merge_rules for Dynamo and functorch.
Guilherme has done good work on both of these subsystems and I am tired
of him approving my PRs and me not being able to merge them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158620
Approved by: https://github.com/anijain2305
2025-07-23 00:44:27 +00:00
39b54b78d7 [export] runtime asserts for while HOP subgraphs (#158467)
Differential Revision: D78431075

For #158366
- Calls runtime asserts pass for HOP subgraphs (in reenter_make_fx)
- For while_loop only (can be expanded), clones input tensors for subgraph tracing, so unbacked memos (item, nonzero, etc.) aren't reused

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158467
Approved by: https://github.com/ydwu4
2025-07-23 00:34:18 +00:00
3703dabe42 [ROCm] delete un-needed workaround for tensor.item() (#158486)
Deleting unused workaround per discussion here:
https://github.com/pytorch/pytorch/pull/158165#discussion_r2207968880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158486
Approved by: https://github.com/jeffdaily, https://github.com/houseroad
2025-07-23 00:31:57 +00:00
d3f9107d68 Remove top limit for cpython version and fix lint appropriately. (#158853)
As per title.
Sorry for the churn in the main commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158853
Approved by: https://github.com/seemethere, https://github.com/Skylion007, https://github.com/jingsh, https://github.com/malfet, https://github.com/ZainRizvi
2025-07-22 23:59:00 +00:00
cab96b5879 [tests] Reduce sizes of unnecessarily large tensors to reduce OOM flakes (#158456)
Downsizes several tensors that were massively oversized to test the problem at hand, to reduce test flaking.

Fixes #126867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158456
Approved by: https://github.com/desertfire
2025-07-22 23:41:48 +00:00
6100ed457c [ROCm] Improve Type Safety of C10_WARP_SIZE (#158271)
# Background

The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs.
Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`.

This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime.

This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro.

Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example)

# Behaviors

* If compiling with HIPCC (i.e `defined(__HIPCC__)`):
  + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3)
  + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass
    - `__GFX8__` is also 64 but we do not support any GFX8 GPU.
* If not compiling with HIPCC:
  + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()`

# `constexpr` variant for host code

For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-07-22 23:19:38 +00:00
badfebf29e Revert "[Inductor] Expose decomposeK knobs as envvars (#158745)"
This reverts commit eac777c4f46b381106f2f2b78fe05b506f8c558c.

Reverted https://github.com/pytorch/pytorch/pull/158745 on behalf of https://github.com/jeffdaily due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/158745#issuecomment-3105071170))
2025-07-22 23:04:16 +00:00
fc5a404eb1 [gtest][listing] fixing caffe2:verify_api_visibility - main (#158229)
Summary: Remove the custom main from this test file

Test Plan:
https://www.internalfb.com/intern/testinfra/testrun/9570149303161031

Rollback Plan:

Reviewed By: patskovn

Differential Revision: D78015676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158229
Approved by: https://github.com/Skylion007
2025-07-22 22:45:28 +00:00
04a393507b Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-07-22 22:25:44 +00:00
a626dc8f16 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-07-22 21:35:57 +00:00
fd47401536 [doc] Updates to distributed.md for XCCL backend (#155834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 21:01:43 +00:00
e44e05f7ae [dynamo] Move skipIf decorator to class level in test_fx_graph_runnable (#157594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157594
Approved by: https://github.com/xmfan
ghstack dependencies: #157162
2025-07-22 20:41:49 +00:00
ddd74d10fc More fixes to MakeTensor::computeStorageSize() (#158813)
Followup after https://github.com/pytorch/pytorch/pull/158690 that fixessimilar logic if `strides` are not explicitly specified
Expanded testing to cover both cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158813
Approved by: https://github.com/ZainRizvi, https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #158690
2025-07-22 20:36:12 +00:00
823e223893 [ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)
Fixes #156012

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 20:32:34 +00:00
6499420e45 [DeviceMesh] Make the repr shorter when debug ENV not set (#158822)
Users want a shorter repr so this PR is trying to address that when TORCH_DISTRIBUTED_DEBUG is not set to DETAIL. Feedback and discussion is welcomed. Somehow I found that torch.set_printoptions is global, so I am hesitated to use it.

Now the print is like

<img width="435" height="79" alt="image" src="https://github.com/user-attachments/assets/8f173287-7138-4fbe-a4a3-8483523b21e4" />

or

<img width="485" height="104" alt="image" src="https://github.com/user-attachments/assets/21e34db9-56b5-47e2-9767-750d6105a273" />

or

<img width="675" height="97" alt="image" src="https://github.com/user-attachments/assets/53aa763e-7edd-4622-9cdb-37e2af8ec11f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158822
Approved by: https://github.com/wz337, https://github.com/wconstab, https://github.com/xmfan
2025-07-22 20:31:44 +00:00
e17538022a Making input dynamically adjust. (#157324)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157324
Approved by: https://github.com/Skylion007, https://github.com/d4l3k
2025-07-22 20:14:05 +00:00
37ded2ac90 Using torch.accelerator in comm_mode_features_example.py and visualize_sharding_example.py (#157317)
Continuation of https://github.com/pytorch/pytorch/pull/153213  .

 @guangyey
 @kwen2501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157317
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 19:58:48 +00:00
767791943d [ONNX] Set default opset to 20 (#158802)
Bump default opset to 20, which is a newer opset and the max torchscript exporter supports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158802
Approved by: https://github.com/titaiwangms
2025-07-22 19:55:05 +00:00
c917c63282 [ROCm][tunableop] UT tolerance increase for matmul_small_brute_force_tunableop at FP16 (#158788)
TunableOp will sometimes find a less precise solution due to the small input vectors used in this UT. Bumping op tolerance to eliminate flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158788
Approved by: https://github.com/jeffdaily
2025-07-22 19:45:35 +00:00
659bfbf443 Revert "We do support 3.14" (#158856)
Reverting to fix lint
This reverts commit 2a249f1967d29626fe6ac6a07f28440348d1cc93.

An emergency fix since the change needed to fix this is a little more complex than expected (see https://github.com/pytorch/pytorch/pull/158853 for reference)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158856
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-07-22 19:40:53 +00:00
832ab990c9 Use init_device_mesh API for select tests where possible (#158675)
This addresses reviews made for:
#158538
#108749

It interchanged all the specific DevideMesh constructor calls with the API provided by the test cases, to improve abstraction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158675
Approved by: https://github.com/wconstab
2025-07-22 19:28:42 +00:00
56df025d51 Add caching for _rename_without_collisions (#158594)
Fixes #158357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158594
Approved by: https://github.com/pianpwk
2025-07-22 19:19:13 +00:00
55ff4f85e9 [FP8][CUTLASS] xFail honor_sm_carveout on sm100 (#152378)
CUTLASS only supports SM carveout via green contexts on `sm100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia
2025-07-22 18:39:50 +00:00
7d2ceaff21 [dynamo] skip tracing functions registered in sys.monitoring (#158171)
Fixes https://github.com/pytorch/pytorch/issues/158164

This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python.

Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`:

BEFORE:
```
requires_grad=False
eager    7.1us (warmup=0.0s)
compiled 24.6us (warmup=10.0s)

requires_grad=True
eager    8.9us (warmup=0.0s)
compiled 57.8us (warmup=0.1s)

inference_mode()
eager    6.5us (warmup=0.0s)
compiled 23.4us (warmup=0.1s)
```

AFTER:
```
requires_grad=False
eager    7.0us (warmup=0.0s)
compiled 23.2us (warmup=15.2s)

requires_grad=True
eager    9.0us (warmup=0.0s)
compiled 55.1us (warmup=0.1s)

inference_mode()
eager    6.4us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?)

Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171
Approved by: https://github.com/jansel
2025-07-22 18:02:30 +00:00
2a249f1967 We do support 3.14
This has been added a bit back.
2025-07-22 10:40:18 -07:00
52c294008e [hop] allow non fake inputs when check input alias and mutation (#158798)
https://github.com/pytorch/pytorch/pull/154193 gets reverted due to a test failure. The root cause being that: an executorch pass turns int inputs into a scalar tensor in cond's subgraph. The pass have been around on the critical path of executorch since two years ago. Changing it would be difficult. So we just allow non-fake inputs for check input mutation and aliasing, which shoudn't affect the correctness of the analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158798
Approved by: https://github.com/pianpwk
2025-07-22 17:22:37 +00:00
0971637c11 Fix torch.tensor warning in ONNX symbolic_opset10 export (#158835)
Fix PyTorch tensor copying warning in ONNX export

## Problem

PyTorch ONNX exporter was generating a warning about incorrect tensor copying method:

```
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158835
Approved by: https://github.com/justinchuby
2025-07-22 16:32:49 +00:00
7d6f340238 Revert "[AOTI] Add more default options to compile_standalone (#158560)"
This reverts commit a991e285ae35159680b0ad4be24669906a6fa256.

Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))
2025-07-22 16:20:17 +00:00
4060f30042 [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-22 16:01:51 +00:00
9a28e23d97 Revert "removed zero dim cpu logic from fake_tensor.py (#147501)"
This reverts commit 9e0473b56621162bd85e94943a516be4727e5651.

Reverted https://github.com/pytorch/pytorch/pull/147501 on behalf of https://github.com/ZainRizvi due to Seems to have broken ROCm. See inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_standalone_cos [GH job link](https://github.com/pytorch/pytorch/actions/runs/16428359564/job/46426243808) [HUD commit link](a991e285ae) ([comment](https://github.com/pytorch/pytorch/pull/147501#issuecomment-3103494041))
2025-07-22 15:45:34 +00:00
d0c00d9a69 [MPS] Do not crash if tensor dim > INT_MAX (#158824)
Looks like all MPS operations will crash if one of tensor dimentions are
greater than `2**31-1`

Change it into a structured exception, by checking tensor size before
attempting to create MPS Tensor

Add regression test for it. Before this change running following will abort with exception
```
% python3 -c "import torch; torch.randint(0, 10, (2**31,), dtype=torch.uint8, device='mps')"
/AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
zsh: abort      python3 -c·
```

Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824
Approved by: https://github.com/dcci
ghstack dependencies: #158690, #158823
2025-07-22 15:12:26 +00:00
371ffaf415 [bucketing] Support case of several pgs in graph (#158632)
Main changes:
- bucketing collectives only from the same process_group by group_name
- Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632
Approved by: https://github.com/wconstab
2025-07-22 14:50:39 +00:00
1b772de397 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-22 14:12:21 +00:00
8e99714204 [EZ][BE][MPS] Remove unused ndArrayFromTensor (#158823)
And `printTensorNDArray`, both of which according to https://github.com/search?type=code&q=ndArrayFromTensor+org%3Apytorch are not used anywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158823
Approved by: https://github.com/dcci
ghstack dependencies: #158690
2025-07-22 14:06:42 +00:00
9b4d938f04 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-07-22 11:26:54 +00:00
0142d5f4e2 Revert "Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)"
This reverts commit f09a484b8164aaadd57a79354f0ccf47733f365e.

Reverted https://github.com/pytorch/pytorch/pull/158682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158682#issuecomment-3101648365))
2025-07-22 08:33:08 +00:00
91b69deeb0 [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 08:04:59 +00:00
392fa75411 Change from import trace to import config (#158796)
Summary:
for this particular instance, we're doing

 from torch._inductor.config import trace

...trace.provenance_tracking...

but for all other call sites, we're doing

from torch._inductor import config
... config.trace.provenance_tracking....

Test Plan:
CI

Rollback Plan:

Differential Revision: D78699876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158796
Approved by: https://github.com/c00w
2025-07-22 06:10:38 +00:00
3a67bf9c62 [PGNCCLx] Bring split and merge for PGNCCLx (#158790)
Summary: We added group split in D78300794 and remote_group_merge in D78450094. We first want to upstream this change to PGNCCLx as well so that NCCLx can use this new API and we can continue our c10d clean up in https://github.com/pytorch/pytorch/pull/158488.

Test Plan:
CI

```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```

Rollback Plan:

Differential Revision: D78521060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158790
Approved by: https://github.com/d4l3k
2025-07-22 06:05:00 +00:00
d984143a74 [ci][cutlass backend] Add ci for cutlass backend tests (#156626)
redo of https://github.com/pytorch/pytorch/pull/156136

Differential Revision: [D77327309](https://our.internmc.facebook.com/intern/diff/D77327309)

I want to try land the full version first. If the ci is taking too long, we can revert back to only testing for a few names.
```
 -k 'test_max_autotune_cutlass_backend_regular_mm and not test_max_autotune_cutlass_backend_regular_mm_streamk'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156626
Approved by: https://github.com/huydhn, https://github.com/mlazos
2025-07-22 05:18:13 +00:00
21c97bd565 [reland] Transfer "stack_trace" in post_grad passes (#158752)
Summary:
We transfer stack trace in post_grad passes.

We shouldn't add "stack_trace" to _COPY_META_FIELDS because _COPY_META_FIELDS is used in proxy.py where stack_trace is explicitly set.

Since the stack_trace is being used by more and more debugging tools, we should also start testing it more rigorously. This PR start by adding a first test for testing that stack trace is preserved through post_grad_passes.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_pattern_matcher_transfer_meta

buck run mode/dev-nosan
 fbcode//caffe2/test/inductor:auto_functionalize -- --rcaffe2/test/inductor:auto_functionalize_old
```

Rollback Plan:

Differential Revision: D78669729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158752
Approved by: https://github.com/jingsh
2025-07-22 03:49:13 +00:00
a155f742ad [benchmark] allow default mode for compile (#158792)
Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792
Approved by: https://github.com/zou3519
2025-07-22 03:07:22 +00:00
cyy
3639d29ea1 Fix warnings of unused-variable (#158627)
Fixes
```
/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_kernel.cpp:42:22: error: unused variable 'verification_pattern' [-Werror,-Wunused-variable]
```
and also extra semicolons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158627
Approved by: https://github.com/albanD
2025-07-22 02:49:06 +00:00
aee8a2e985 Remove duplicated installation for python dependencies. (#158339)
As the title stated.

The `Common` Section have installed the python dependencies
1b389025ba/README.md (L247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158339
Approved by: https://github.com/ezyang
2025-07-22 02:39:28 +00:00
eac777c4f4 [Inductor] Expose decomposeK knobs as envvars (#158745)
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
2025-07-22 01:59:51 +00:00
1a6b21c59f [AOTI] fix load_pt2 split wrong model name on Windows (#158711)
fix load_pt2 split wrong model name on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158711
Approved by: https://github.com/jansel
2025-07-22 01:54:44 +00:00
abe0c9538a [BE] Fix extra-semi warnings (#158730)
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )

300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
 - TensorPipe
 - LLVM
 - FBGEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
2025-07-22 01:05:03 +00:00
95b658427d Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 1179e333237b02ed8fe2ba10cb9a23adf98d7d7a.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
6341311333 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 2ad5c25cfc603c3656e6699d6137419dbb009495.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
350d6af52c [AOTI] add windows support for get_cpp_compile_command (#158732)
add windows support for `get_cpp_compile_command`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158732
Approved by: https://github.com/desertfire
2025-07-22 00:23:10 +00:00
9281625a9b Revert "Setup TorchBench in Docker (#158613)"
This reverts commit cab28330f8c49cdb66d6a299755dc09c87c14a9d.

Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/ZainRizvi due to Seems to have broken trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16429779764/job/46430634676) [HUD commit link](b3c868d603) ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3100023071))
2025-07-22 00:12:49 +00:00
2c37acfd89 [AOTI][CPU] Consider bias=None case for fbgemm_linear_fp16_weight (#158535)
Test Plan:

Rollback Plan:

Differential Revision: D78458214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158535
Approved by: https://github.com/houseroad, https://github.com/henryoier, https://github.com/jingsh
2025-07-21 23:42:44 +00:00
08540b13c6 Use cuda error code instead of error text in get_cuda_error_help (#158688)
Use cudaError_t and switch through the enum to prevent impact by upstream changes in wording
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158688
Approved by: https://github.com/q10, https://github.com/aorenste
2025-07-21 23:34:50 +00:00
187c2deb40 Fix clamp(min/max) strategy (#158619)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158619
Approved by: https://github.com/wanchaol
2025-07-21 23:26:08 +00:00
67be2f27e1 [CI][lintrunner] Only run on non deleted changed files (#158794)
My PR was failing lint because I removed a file, and then lintrunner would try to run on the deleted file and error, so this changes how the changed files are retrieved to only retrieve changed files that have not been removed.

I don't think this is possible through `gh pr view`, so instead it uses `gh api`

Testing: https://github.com/pytorch/pytorch/pull/158795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158794
Approved by: https://github.com/seemethere
2025-07-21 23:22:37 +00:00
d293022c47 [cutass backend] memorize parts of cache key to reduce general overhead (#158311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158311
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #156781
2025-07-21 23:21:12 +00:00
ee5a434f8c Revert "[BE] remove torch deploy - conditionals (#158288)"
This reverts commit 1a4268b8113d5160d71225bab980f03c2318a0a4.

Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
4c18e85300 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit a6de309ca15cda6b2792fc74e82814dc8d2f9dd9.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
920f26c761 Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 0b9fb91f17edfbc51ae36584dcb8350b2d8bb23b.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
99cc3633f6 Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)"
This reverts commit d9426a81d2ab54f809a3b32a6ab2e606073fe66f.

Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
15a50dcf1c Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)"
This reverts commit eb7365072315be2bc4259114e25e269801441748.

Reverted https://github.com/pytorch/pytorch/pull/158427 on behalf of https://github.com/ZainRizvi due to Reverting this as part of reverting the stack for https://github.com/pytorch/pytorch/pull/158288 ([comment](https://github.com/pytorch/pytorch/pull/158427#issuecomment-3099815367))
2025-07-21 23:14:57 +00:00
1227ed6674 [dynamic shapes] fix _maybe_evaluate_static axioms bug (#158672)
Summary: couldn't get a minimal repro, but xref for size change during dict iteration error: https://fb.workplace.com/groups/1075192433118967/posts/1709439696360901

Test Plan:
-

Rollback Plan:

Differential Revision: D78047846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158672
Approved by: https://github.com/bobrenjc93
2025-07-21 23:14:19 +00:00
2bb684304d Fix the typos in the right nav by pulling the latest theme (#158746)
This will fix broken links in the right nav.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746
Approved by: https://github.com/malfet
2025-07-21 22:51:07 +00:00
f09a484b81 Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)
Summary:
**Changes**
*   Deleted function import from build definition utilities
    *   Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
*   Replaced is_arvr_mode() function calls with direct references to configuration flags
    *  Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
*   Changed conditional expressions to Buck `select()` statements

Test Plan:
Check if CI passes

Rollback Plan:

Differential Revision: D78520947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
2025-07-21 22:49:26 +00:00
feaa02f9ad Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a78fb63dbdf98a1db219095293de1a11005e0390.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to It still breaks inductor-perf-nightly, see https://github.com/pytorch/pytorch/actions/runs/16425364208/job/46417088208, I'm going to dismiss all previous reviews ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3099706457))
2025-07-21 22:46:53 +00:00
b3c868d603 [vllm]Add vllm.txt for pinned commit (#158754)
It seems the nightly.yml won't auto-generate txt file when it does not existed, so added the file with latest merged commit from vllm:

[vllm commit](https://github.com/vllm-project/vllm/commits/main)

Error:
https://github.com/pytorch/pytorch/actions/runs/16405915719/job/46351847504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158754
Approved by: https://github.com/huydhn
2025-07-21 22:41:07 +00:00
cab28330f8 Setup TorchBench in Docker (#158613)
This reduces the time spending to setup TorchBench in A100/H100 by another half an hour

### Testing

* H100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396172453.  Once this done, I will review the results on [HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2011%20Jul%202025%2023%3A01%3A24%20GMT&stopTime=Fri%2C%2018%20Jul%202025%2023%3A01%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/huydhn/6/head&lCommit=14a38c719b29a19f518239b5edb084838ac5d2fb&rBranch=main&rCommit=0a99b026d6bd0f67dc2c0a20fe3228ddc4144854) to confirm that all models are there
* A100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396173932

Signed-off-by: Huy Do <huydhn@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158613
Approved by: https://github.com/janeyx99
2025-07-21 22:34:08 +00:00
4366610f5a [c10d] block_current_stream: correctness fixes (#158757)
This fixes a number of issues that were present in https://github.com/pytorch/pytorch/pull/156883 as pointed out by @ngimel

Test plan:

Expanded tests to cover use after free behavior + non-default stream

```
pytest test/distributed/test_c10d_pypg.py -v -k block_current_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158757
Approved by: https://github.com/ngimel
2025-07-21 22:23:44 +00:00
dd0adc9386 [SymmMem] Add NVSHMEM broadcast support into Triton (#158514)
Adds broadcast collective operation for distributing data from root PE to all other PEs in NVSHMEM Triton kernels.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_broadcast`
<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 0] Starting broadcast test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
============================================================
[Rank 1] Starting broadcast test with world_size=2
============================================================
[Rank 1] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
[Rank 1] Non-root source data: [-1, -1, -1, -1]
[Rank 0] Root source data: [100, 101, 102, 103]
[Rank 1] Initial destination: [-999, -999, -999, -999]
[Rank 0] Initial destination: [-999, -999, -999, -999]
[Rank 0] Executing broadcast operation...
[Rank 1] Executing broadcast operation...
[Rank 0] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results after broadcast:
[Rank 0] Results after broadcast:
[Rank 1] Destination buffer: [100, 101, 102, 103]
[Rank 1] Expected: [100, 101, 102, 103]
[Rank 0] Destination buffer: [100, 101, 102, 103]
[Rank 0] Expected: [100, 101, 102, 103]
[Rank 1] Match: ✓
[Rank 0] Match: ✓
[Rank 1] ============================================================
[Rank 1] Broadcast test PASSED ✓
[Rank 1] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Broadcast test PASSED ✓
[Rank 0] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 0] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158514
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512, #158513
2025-07-21 22:23:26 +00:00
734826d88e Revert "[AOTI] windows package load dev (#158671)"
This reverts commit d42c40976727fed4c9908d4194f26917d0a3da66.

Reverted https://github.com/pytorch/pytorch/pull/158671 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @angelayi can you please help them validate the fixes internally? You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158671#issuecomment-3099570374))
2025-07-21 22:20:46 +00:00
5a56e6a72b Revert "[AOTI] fix extract file failed on Windows. (#158702)"
This reverts commit 7cc1a9546c135f8e7635e0d38aa2bba797f8907d.

Reverted https://github.com/pytorch/pytorch/pull/158702 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158702#issuecomment-3099556215))
2025-07-21 22:18:19 +00:00
e8af168ee0 Revert "[AOTI] normalize path and process model files. (#158705)"
This reverts commit ff0da08f4bc5ee135b495926cd58a36a1c0e1a5b.

Reverted https://github.com/pytorch/pytorch/pull/158705 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158705#issuecomment-3099532516))
2025-07-21 22:16:03 +00:00
97d7dc197f Revert "[AOTI] Convert C-struct zip handling to RAII container (#158687)"
This reverts commit 8ed5e1844c77d952bcea89ca7d0225d876fec4e8.

Reverted https://github.com/pytorch/pytorch/pull/158687 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158687#issuecomment-3099515618))
2025-07-21 22:13:26 +00:00
9498d95b9c [Dynamo][BetterEngineering] Type trace_rules.py (#158679)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `trace_rules.py`
Running
```
mypy torch/_dynamo/trace_rules.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2564 | 3997 | 64.15% | 34 | 53 | 64.15% |
| This PR | 4022 | 4022 | 100.00% | 53 | 53 | 100.00% |
| Delta    | +1458 | +25 | +35.85% | +19 | 0 | +35.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158679
Approved by: https://github.com/williamwen42
2025-07-21 22:12:59 +00:00
0e46f54286 [ROCm][CI] update HIP patch for 6.4.1 (#158651)
patch is intended to fix hipGraph capture for some miopen kernels

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-21 22:09:36 +00:00
216ba6e5f2 Fix MaskedTensor to device ignored mask (#151205)
Fixes #147140

## Changes

- Add `to` implementation in `MaskedTensor` to support move `mask` to target device

## Test Result

```python
In [1]: import torch
   ...: from torch.masked import as_masked_tensor
   ...: data = torch.tensor([1,2,3])
   ...: mask = torch.tensor([True,False,True])
   ...: mt = as_masked_tensor(data, mask).to('cuda')
   ...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))

In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)

```

```bash
pytest test/test_maskedtensor.py -vv
```

![image](https://github.com/user-attachments/assets/640b809c-b4f0-4aca-a09e-04049017a745)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
2025-07-21 21:44:49 +00:00
c774180e59 Bump requests from 2.32.2 to 2.32.4 in /tools/build/bazel (#158006)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.2&new-version=2.32.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-21 21:35:38 +00:00
a991e285ae [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-07-21 21:16:48 +00:00
9e0473b566 removed zero dim cpu logic from fake_tensor.py (#147501)
Fixes #144748
In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug.
The root cause is fake_tenosr.py's find-common-device method, 0b0da81021/torch/_subclasses/fake_tensor.py (L833), takes zero dim cpu tensor into account but  the device check in adaption.h doesn't.

This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501
Approved by: https://github.com/ezyang
2025-07-21 21:11:10 +00:00
5e17932c22 [DCP] Add support for ShardedTensor to PgTransport (#158573)
Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints

Test is pulled from https://github.com/pytorch/pytorch/pull/157963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573
Approved by: https://github.com/meetv18
2025-07-21 21:04:23 +00:00
6b0526a2c4 ban fusion of large amount of reads (#158667)
This is an reland attempt of https://github.com/pytorch/pytorch/pull/157563, but insteading of introducing the `realize_acc_reads_size_threshold` config and setting to a default value, we set it to `None` for now to unblock an internal use case. Will deep dive into the issue and harden the logic in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158667
Approved by: https://github.com/yf225
2025-07-21 21:00:40 +00:00
bc379aebe2 Revert "Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)"
This reverts commit 8e57cdb746b4ab28865fdf01532f87b0d21700e9.

Reverted https://github.com/pytorch/pytorch/pull/158048 on behalf of https://github.com/jeffdaily due to rocm failures due to unit test introduced in this PR, but no pre-merge signal available ([comment](https://github.com/pytorch/pytorch/pull/158048#issuecomment-3098746624))
2025-07-21 20:45:21 +00:00
b1a0c34dd3 [pt2 event logging] add configurable prefix (#157678)
Summary:
# Why

make experiments easier to find

# What

- dynamo config to provide a prefix
- use the prefix when sending data to scuba through the self.id_ field

Test Plan:
```
# code edited to set the prefix as `coconutruben-02`
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

on scuba

```
| autotune_dtypes | autotune_offset | autotune_shape | autotune_strides | event | run_id |
| -----| -----| -----| -----| -----| ----- |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-e6bdccc5-6dcf-4d68-9a04-b34f2c6d94fd" |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-14165153-5842-4eaa-9e6c-3b0cbc016375" |

```

Rollback Plan:

Differential Revision: D77837550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157678
Approved by: https://github.com/stashuk-olek
2025-07-21 20:41:03 +00:00
851e953f68 ci: Only run lint jobs on relevant files (#158773)
Conditionally run lint jobs on relevant files, this
is mainly targetd at clangtidy since it takes a long time
but also includes mypy since that's an additional 4 minutes
of runtime that we can save.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158773
Approved by: https://github.com/malfet
2025-07-21 20:21:34 +00:00
b66f429827 Fix torch.randint, torch.mul param missing description (#158731)
Wrong separator cause param description truncated.

- Change separator of param and its description
- Remove quote make `torch.dtype` display as reference to the class

## Test Result

### Before

<img width="1092" height="784" alt="image" src="https://github.com/user-attachments/assets/e8d96b26-07e9-40ff-9392-fa6665d4bbe4" />
<img width="1111" height="457" alt="image" src="https://github.com/user-attachments/assets/a3c2e333-f861-4aeb-b4fb-05c8d880ae81" />

### After

<img width="897" height="820" alt="image" src="https://github.com/user-attachments/assets/d1b5cefa-717a-4223-84b0-4346b7eecf44" />
<img width="872" height="409" alt="image" src="https://github.com/user-attachments/assets/96223c37-cd9d-4656-9e55-032d09cbe5c1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158731
Approved by: https://github.com/ngimel
2025-07-21 20:17:27 +00:00
ea5b06ed5b [Dynamo][BetterEngineering] Type side_effects.py (#158605)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `side_effects.py`
Running
```
mypy torch/_dynamo/side_effects.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  365 | 1166 | 31.30% | 16 | 51 | 31.37% |
| This PR | 1185 | 1185 | 100.00% | 51 | 51 | 100.00% |
| Delta    | +820 | +19 | +68.70% | +35 | 0 | +68.63% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158605
Approved by: https://github.com/StrongerXi
2025-07-21 19:34:14 +00:00
25fbf09d5f Use more fine-grained locks in sym mem kernels (#158523)
Summary: Use only acq in the beginning of the kernel, and only release in the end

Test Plan:
Existing tests

Rollback Plan:

Differential Revision: D78458020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158523
Approved by: https://github.com/drisspg, https://github.com/kwen2501
2025-07-21 19:23:47 +00:00
22920c9138 Grab bag of (mostly) typing improvements (#158075)
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:

1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
2025-07-21 19:17:01 +00:00
ad2dec1997 [SymmMem] Add NVSHMEM alltoall support into Triton (#158513)
Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall`

<details>
<summary>Quick debug print for sanity check</summary>

```markdown
============================================================
[Rank 0] Starting alltoall test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems_per_pe: 2
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 16
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
[Rank 0] Preparing source data:
[Rank 1] Preparing source data:
  - Data for PE 0: [0, 0] (indices 0-1)
  - Data for PE 1: [1, 1] (indices 2-3)
[Rank 0] Complete source buffer: [0, 0, 1, 1]
  - Data for PE 0: [100, 100] (indices 0-1)
  - Data for PE 1: [101, 101] (indices 2-3)
[Rank 1] Complete source buffer: [100, 100, 101, 101]
[Rank 1] Initial destination buffer: [-1, -1, -1, -1]
[Rank 0] Initial destination buffer: [-1, -1, -1, -1]
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
NCCL version 2.27.5+cuda12.4
[Rank 1] Executing alltoall operation...
[Rank 0] Executing alltoall operation...
[Rank 1] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Results after alltoall:
[Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100]
[Rank 0] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [0, 0]
    Actual:   [0, 0]
[Rank 1] Destination buffer: [1, 1, 101, 101]
[Rank 1] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [1, 1]
    Actual:   [1, 1]
    Match:    ✓
    Match:    ✓
  - From PE 1 (indices 2-3):
    Expected: [100, 100]
  - From PE 1 (indices 2-3):
    Expected: [101, 101]
    Actual:   [100, 100]
    Actual:   [101, 101]
    Match:    ✓
    Match:    ✓
[Rank 0] ============================================================
[Rank 0] Summary: ALL TESTS PASSED ✓
[Rank 0] Data flow explanation:
  - Each rank sends 2 elements to every other rank
[Rank 1] ============================================================
[Rank 1] Summary: ALL TESTS PASSED ✓
  - Rank 0 sent: [0, 0, 1, 1]
[Rank 1] Data flow explanation:
  - Each rank sends 2 elements to every other rank
  - Rank 0 received: [0, 0, 100, 100]
  - My data for PE 0 (0) went to PE 0's buffer
  - I received PE 0's data for me (0)
  - My data for PE 1 (1) went to PE 1's buffer
  - Rank 1 sent: [100, 100, 101, 101]
  - I received PE 1's data for me (100)
[Rank 0] ============================================================
  - Rank 1 received: [1, 1, 101, 101]
  - My data for PE 0 (100) went to PE 0's buffer
  - I received PE 0's data for me (1)
  - My data for PE 1 (101) went to PE 1's buffer
  - I received PE 1's data for me (101)
[Rank 1] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512
2025-07-21 19:14:47 +00:00
662dd7db5b [cutlass backend] cache maybe_append_choices (#156781)
This PR attempts to cache:
* codegen for cutlass backend for the same kernel. Even if runtime params are different.

From some profiling, most of the time spent is on render. So we only target to cache that part for now.

The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render.

make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781
Approved by: https://github.com/ColinPeppler
2025-07-21 19:02:39 +00:00
72db0a98a3 Revert "[DTensor] Assert DTensorSpec has valid placements (#158133)"
This reverts commit 1839e8d04b81ee6eda0cff6fbfc218a7a600f6f7.

Reverted https://github.com/pytorch/pytorch/pull/158133 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78496151 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158133#issuecomment-3097994857))
2025-07-21 18:54:07 +00:00
8ed5e1844c [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-21 18:53:14 +00:00
393fecb2cc [Optimus][Unit test] clean up the unit test (#158696)
Summary: We should only patch the specific pattern(s) for each unit test.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```

Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915
Network: Up: 100KiB  Down: 233KiB  (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66)
Analyzing targets. Remaining     0/296
Executing actions. Remaining     0/20196                                                                    5.8s exec time total
Command: test.     Finished 2 local, 2 cache (50% hit)                                                      4.6s exec time cached (79%)
Time elapsed: 3:55.1s
Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D78598127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-07-21 18:05:09 +00:00
9285b8245c [BE][testing] fix test_cat_max_autotune_triton (#158589)
Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix.

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589
Approved by: https://github.com/eellison
2025-07-21 18:02:18 +00:00
637e75433c [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2025-07-21 17:56:05 +00:00
a78fb63dbd [build] pin setuptools>=77 to enable PEP 639 (#158104)
For reference here is the link PEP 639: [peps.python.org/pep-0639](https://peps.python.org/pep-0639/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-21 17:46:40 +00:00
7205458b85 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-21 17:32:31 +00:00
35f1b4ad9e Revert "Fused RMSNorm implementation (#153666)"
This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts.  See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935))
2025-07-21 17:31:42 +00:00
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
9894d43b6c [AOTI] explicit aoti wrapper functions for Windows. (#158713)
On Windows, we need to explicit declaration for export APIs. Because the package loader call these API via GetProcAddress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158713
Approved by: https://github.com/desertfire
2025-07-21 15:59:44 +00:00
f168cf49a8 [BE] Always use python 3.9 for pre-push hook's lintrunner (#158693)
A follow up to https://github.com/pytorch/pytorch/pull/158389

Sets up the pre-push lintrunner to always use python 3.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158693
Approved by: https://github.com/atalman
2025-07-21 15:19:27 +00:00
393377d215 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit a527e816935957a164d74dd7c5069310b2857695.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/malfet due to This broke lint, see 8e57cdb746/1 ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3096893256))
2025-07-21 13:58:50 +00:00
8e57cdb746 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-21 13:35:46 +00:00
d5a29fc58a De-abstract premature generalization with InductorWrapper (#158528)
See docblock on InductorWrapper for the distinction.  This will matter
on a later refactor PR where I will change the signature for one of
these but not the other.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158528
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158449
2025-07-21 13:27:07 +00:00
979fae761c Rename modules in AOTAutograd (#158449)
Fixes https://github.com/pytorch/pytorch/issues/158382

```
renamed:    torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py
renamed:    torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py
renamed:    torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py
```

Everything else is ONLY import changes. I did not rename any functions
even if we probably should have.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449
Approved by: https://github.com/jamesjwu
2025-07-21 13:27:07 +00:00
1eb6b2089f [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-21 12:46:05 +00:00
bbc32d680f [SymmMem] Add NVSHMEM sync_all support into Triton (#158512)
Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512
Approved by: https://github.com/fduwjj
ghstack dependencies: #158511
2025-07-21 10:27:59 +00:00
a527e81693 [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-21 09:24:29 +00:00
1c6328a588 [EZ][BE] Fix compilation warning in Pooling.metal (#158729)
This one
```
Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.
```
Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729
Approved by: https://github.com/Skylion007
2025-07-21 04:34:14 +00:00
70b4a8880b [SymmMem] Add NVSHMEM barrier_all, my_pe, n_pes support into Triton (#158511)
Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels.

We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier`

Also tested that if you remove the barrier, you get an assertion error/race conditions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511
Approved by: https://github.com/fduwjj
2025-07-21 02:37:33 +00:00
5e1232871b Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a4ec381302f8acd279033707b182bed30ffd2091.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to This break inductor-perf-nighly-macos by failing to build torchvision, see https://github.com/pytorch/pytorch/issues/158728 ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3095048940))
2025-07-21 02:24:11 +00:00
ff0da08f4b [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-21 01:08:59 +00:00
2cdafab0bd [BE] Raise ValueError from torch.cat meta func (#158249)
Followup after https://github.com/pytorch/pytorch/pull/155460

From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError):
> Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.

Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other
> Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.

> This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249
Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD
2025-07-20 23:49:18 +00:00
4b02bd76d3 DCP safetensors test fix (#158685)
https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034.
Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685
Approved by: https://github.com/teja-rao
2025-07-20 22:52:54 +00:00
2e038793ef [inductor][templates] Finalize all registered hooks (#157270)
This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds:
- `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually.
- A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270
Approved by: https://github.com/eellison
2025-07-20 22:07:32 +00:00
5e149a6482 Add deprecation warning (#158203)
Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78240836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203
Approved by: https://github.com/JacobSzwejbka
2025-07-20 17:02:01 +00:00
badf002014 [Reland] Add warning about removed sm50 and sm60 arches (#158700)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy
2025-07-20 14:57:46 +00:00
4869f71170 don't set CUDA_MODULE_LOADING (#158712)
If needed, it'll be set in `_C._cuda_init()`. setenv is not threadsafe, so this can cause segfaults due to getenv/setenv races.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158712
Approved by: https://github.com/eqy
2025-07-20 01:36:26 +00:00
b4abf41425 Raise BufferError for DLPack buffer-related errors. (#150691)
This PR addresses the Array API documentation for [`__dlpack__`][1] and
[`from_dlpack`][2] by making some buffer-related errors `BufferError`
instead of `RuntimeError`, e.g. incompatible dtype, strides, or device.

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
[2]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.from_dlpack.html#from-dlpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150691
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #150216, #150217, #150218
2025-07-20 00:46:21 +00:00
a10f15718d [DLPack] Add support for missing keyword-arguments. (#150218)
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.

In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:

- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
  refactoring of the `getDLDevice()` function at
  _aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
  _torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
  method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216, #150217
2025-07-20 00:46:20 +00:00
1d526fe78f Fix DLPack stream logic. (#150217)
This PR fixes the logic for dealing with CUDA and ROCm streams whenever
we are trying to create a DLPack capsule from a tensor.

In summary, this PR:

- Uses the legacy default stream if `tensor.__dlpack__(stream=None)` is
  called for a CUDA tensor.
- Errors if `tensor.__dlpack__(stream=2)` is called for a CUDA tensor:
  PyTorch doesn't support the per-thread default stream.
- Errors if `tensor.__dlpack__(stream=stream)`, where `stream` is 1 or
  2, is called for a CUDA tensor using ROCm.

For more details, see [the documentation][1].

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150217
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #150216
2025-07-20 00:46:20 +00:00
b64f338da4 [DLPack] add NumPy exchange tests. (#150216)
This PR resolves an old TODO that requested NumPy DLPack exchange tests
once version 1.22 was required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150216
Approved by: https://github.com/msaroufim, https://github.com/albanD
2025-07-20 00:46:20 +00:00
a1cfe7f1df [nativert] benchmark util (#158678)
Differential Revision: D78514241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158678
Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips
2025-07-20 00:28:09 +00:00
d36afac83b Build domain libraries for all workflows with TorchBench config (#158601)
They are expensive GPU runners and should not spend time building packages

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158601
Approved by: https://github.com/ZainRizvi
2025-07-19 21:51:39 +00:00
7cc1a9546c [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index name, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-19 08:58:42 +00:00
7cc5d03dfc Document the rest of the specific optimizer module APIs (#158669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669
Approved by: https://github.com/albanD
ghstack dependencies: #158483
2025-07-19 07:27:15 +00:00
f73594164a [BE] document Adadelta and Adagrad APIs properly (#158483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483
Approved by: https://github.com/albanD
2025-07-19 07:27:15 +00:00
a9f84021fb [CI] Fixes CI for CUDA Version > 12.9 (#157385)
Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385
Approved by: https://github.com/eqy
2025-07-19 06:51:57 +00:00
22d82222c6 GenAI Layer Benchmark (#158536)
This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

## Motivations

- Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
- We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

## Key Settings

- Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.
```python
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
```

- For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.
```
torch._dynamo.mark_dynamic(x, 0)
```

GPU: H100 (devvm006.dkl0)

Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170)

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536
Approved by: https://github.com/yf225, https://github.com/eellison
2025-07-19 05:41:01 +00:00
5cde34473c Fix MakeTensor::computeStorageSize() (#158690)
For tensor with non-zero offset, it must be multiplied by element size

Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with
```
C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15
Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib)
frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib)
frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib)
frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib)
frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib)
frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib)
frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib)
frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib)
frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib)
frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib)
frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic)
frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158690
Approved by: https://github.com/angelayi
2025-07-19 05:21:33 +00:00
fac0be7b9c [async-TP] Turn asserts back into silent skips (#158572)
https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise.

This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569.

This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572
Approved by: https://github.com/danielvegamyhre, https://github.com/atalman
2025-07-19 04:54:38 +00:00
64dabb2cf5 only fail regressions>10% on pr_time benchmarks (#158577)
Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking.
1. only fail PRs >10% regressions.
2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes)
(supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics)
3. setting up some one detections  detectors warnings that would be triggered at regressions and notify internally post land
https://www.internalfb.com/monitoring/detector/1140915271179237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577
Approved by: https://github.com/xmfan, https://github.com/janeyx99
2025-07-19 04:35:31 +00:00
ab557421a4 [cca] [c10d] Refactor CUDAEventCache into separate files (#158616)
Summary:
Refactored CUDAEventCache from ProcessGroupNCCL.hpp/.cpp into dedicated header and implementation files for better code organization and maintainability.

Split out CUDAEventCache into:
- New header file: CUDAEventCache.hpp
- New implementation file: CUDAEventCache.cpp
- Updated build_variables.bzl to include the new file

This change improves code maintainability, readability, and follows better code organization practices.
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Test Plan:
Verified build with:
```
buck build //caffe2/test/distributed:c10d
```
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158616
Approved by: https://github.com/fduwjj
2025-07-19 02:51:28 +00:00
90b082e207 enable_caching_generated_triton_templates=True by default (#158592)
Got some risk, but good to catch issues if there is any, easy to revert single flag flip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158592
Approved by: https://github.com/eellison
2025-07-19 02:19:34 +00:00
a741094159 Build domain libraries on the build job (#158600)
By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there.  Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle

Fixes https://github.com/pytorch/pytorch/issues/152024

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600
Approved by: https://github.com/yangw-dev
ghstack dependencies: #158598, #158599
2025-07-19 02:03:50 +00:00
2955acaed6 Clean up some unused build env variables (#158599)
* Parameter build-with-debug isn't needed, it isn't even passed into Docker. Debug build is detected via the build environment name
* AWS_DEFAULT_REGION is a leftover from ARC and isn't used anywhere in .ci/pytorch nor .github

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158599
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
ghstack dependencies: #158598
2025-07-19 01:59:00 +00:00
2c16eb9f3d [dynamo] Support more basic output types for nonstrict_trace (#157969)
Fixes #157397 and improves the user-facing error message for remaining
unsupported cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157969
Approved by: https://github.com/zou3519
2025-07-19 00:59:54 +00:00
c2c88846a9 Revert "[Easy] Show some clear error when torch.ops.load_library fails. (#157524)"
This reverts commit 555f3562541992b66a550eca8e8740884b1247f8.

Reverted https://github.com/pytorch/pytorch/pull/157524 on behalf of https://github.com/wdvr due to reverting for now to reopen the discussion ([comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3091317252))
2025-07-19 00:45:31 +00:00
5b40f6581e Revert "Add warning about removed sm50 and sm60 arches (#158301)"
This reverts commit fb731fe371cb1b5bf95de84b19c213590526acb2.

Reverted https://github.com/pytorch/pytorch/pull/158301 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158301#issuecomment-3091307023))
2025-07-19 00:32:04 +00:00
d42c409767 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi
2025-07-19 00:06:40 +00:00
a3aacd6cb2 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
2025-07-18 23:44:43 +00:00
36bddcd18c [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-07-18 23:44:42 +00:00
15ef4f28df Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/albanD
2025-07-18 23:24:21 +00:00
60b9b06a53 [caffe2] Fix Missing override in get_buffer of NCCLSymmetricMemory (#158597)
Summary:
Fix the error that occurs in the devarm environment when compiling with Clang:
```
caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
97 | virtual at::Tensor get_buffer(int
| ^
caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
| ^
1 error generated.
```

Test Plan:
See D78520305

Rollback Plan:

Differential Revision: D78517953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597
Approved by: https://github.com/janeyx99
2025-07-18 23:12:29 +00:00
a835dbc096 [c10d][ez] Fix error message to reflect the correct API name (#158668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158668
Approved by: https://github.com/VieEeEw
2025-07-18 23:10:47 +00:00
f76f4abf3f Track monitor (#156907)
Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix.

UI fix: https://github.com/pytorch/test-infra/pull/6878/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907
Approved by: https://github.com/huydhn
2025-07-18 22:54:13 +00:00
be483a5481 setup pinned commit for vllm in pytorch ci (#158591)
Set up pinned commit for vllm in nightly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158591
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-18 22:30:20 +00:00
bc7b1f5252 [AOTI] Use libstdc++ only for fbcode cpu case (#158659)
Differential Revision: D78567218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158659
Approved by: https://github.com/kflu, https://github.com/zoranzhao
2025-07-18 22:27:10 +00:00
07c4c2a792 [dynamo][be] hide warnings without invalidating warnings cache (#158520)
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.

So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.

There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
    warnings.simplefilter("error")  # turns all warnings into errors
    compiled_fn()  # will throw if any of the muted warnings fire
```

FIXES https://github.com/pytorch/pytorch/issues/128427

A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-07-18 22:02:31 +00:00
89850bbc07 [Dynamo] Use proper sources for constructing dataclass defaults (#157993)
Partially fixes https://github.com/pytorch/pytorch/issues/154009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157993
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2025-07-18 21:51:40 +00:00
3bb729df97 Revert "Fix test consolidate hf safetensors (#157386)"
This reverts commit fa1c20ae9285f7994a73d2d06025065f96b67a57.

Reverted https://github.com/pytorch/pytorch/pull/157386 on behalf of https://github.com/jithunnair-amd due to Need to revert this so we can revert PR 156705, which introduced errors on ROCm CI. These errors were not seen on CUDA CI because CUDA CI docker images do not have safetensors installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/157386#issuecomment-3090706074))
2025-07-18 21:00:12 +00:00
e3351b3ddf Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069)"
This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50.

Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))
2025-07-18 20:54:19 +00:00
1ab1ab38a0 Use linux.12xlarge.memory to build for H100/sm_90 (#158598)
Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100 or newer GPUs, so it doesn't benefit much from existing compiler cache from trunk. Also use a memory-intensive runner here because memory is usually the bottleneck

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158598
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-07-18 20:31:56 +00:00
8b2a650572 pt2_remote_cache: Log sample for failures, and log the explicit reason we're faling. (#156874)
Summary: This allows us to start alerting on cache failures, based on scuba data

Test Plan:
Added new tests explicitly for the Remote Cache API.

Note that we have existing tests for memcache, but not for manifold AFAICT.

There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work).

The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages.

Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243

Rollback Plan:

Differential Revision: D77054339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-07-18 20:28:27 +00:00
ec0b538961 [inductor] Make times and repeat parameters command line args (#158590)
Summary: Small change to make the `times` and `repeat` variables controllable as command line args.

Test Plan:
Execute:
```
buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1
```
Only runs once, and without passing the args it runs with default values of 10.

Rollback Plan:

Reviewed By: malfet

Differential Revision: D78458680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590
Approved by: https://github.com/FindHao, https://github.com/malfet
2025-07-18 20:07:55 +00:00
599f94e7b9 [AOTI] add Windows file ext to package loader. (#158578)
Add `object` and `extension` file type for Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158578
Approved by: https://github.com/angelayi
2025-07-18 19:57:12 +00:00
04ac258cf6 [BE][testing] Fix test_cudacodecache.py (#158259)
Summary: According to internal test failures, looks like we're missing a check for cuda: https://fburl.com/testinfra/eznzkyha

Test Plan:c`buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158259
Approved by: https://github.com/exclamaforte, https://github.com/BoyuanFeng
2025-07-18 19:56:13 +00:00
1b5fdb23b9 [BE] Add pre-push hook for lintrunner to the PyTorch repo (#158389)
Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo.
**This is currently an opt-in feature**, which one can opt into by running `python scripts/setup_hooks.py` locally.

### Features
- **Run Lintrunner Before Push**: Before every `git push`, automatically runs lintrunner on your changes.
  - Really need to skip the checks? Run `git push --no-verify`
- **Consistent, Isolated, Lintrunner Environment**: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment.  No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?)
- **Dependencies Automatically Updated**: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified

### Installation
- Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner.

### Additional details
- The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit`
- Does not affect your regularly used lintrunner
  - Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them.
- Does not run `lintrunner -a` for you.
  - You still need to manually run that (can be changed later though!)
- Have staged/unstaged changes? No worries
  - This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested

### Downsides
- No streaming UI updates
  - While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first)
- `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available.
  - This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in.

### Opting out
- Disable hook for a single push: Run `git push --no-verify`
- Disable hook permanently: If something goes wrong and you need to wipe your setup:
  - Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo.
  - You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want.

### Potential Future Changes
Things that could be done to make this even better if folks like these ideas:
- Automatic setup
  - Our `CONTRIBUTING.md` file tells devs to run `make setup-env`.  That could be a good entry point to hook the installation into
- Fix the console output streaming
- Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389
Approved by: https://github.com/seemethere
2025-07-18 19:55:35 +00:00
75e2628782 Add lower bounds for fsspec and networkx dependencies (#158565)
Fixes #156587

This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt.

- fsspec>= 0.8.5 (released December 15, 2020)
- netowrkx>= 2.5.1 (released April 3, 2021)

These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages.

Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565
Approved by: https://github.com/janeyx99
2025-07-18 19:42:09 +00:00
79e49efadd Pull latest Sphinx theme (#158595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595
Approved by: https://github.com/albanD
2025-07-18 18:46:47 +00:00
b87e50db5e [BE][testing] Fix internal test failures in test/dynamo/test_unspec (#158485)
Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan:
```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485
Approved by: https://github.com/williamwen42
2025-07-18 18:41:03 +00:00
656885b614 [Dynamo][Better Engineering] Type devices, resume_execution and testing utils (#158593)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a set of utilities in dynamo, `device_interface.py`, `resume_execution.py`, `tensor_version_ops.py`, `test_case.py`, and `test_minifier_common.py`

Running
```
mypy torch/_dynamo/device_interface.py torch/_dynamo/resume_execution.py torch/_dynamo/tensor_version_op.py torch/_dynamo/test_case.py torch/_dynamo/test_minifier_common.py  --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  976 | 1672 | 58.37% | 76 | 112 | 67.86% |
| This PR | 1719 | 1719 | 100.00% | 112 | 112 | 100.00% |
| Delta    | +743 | +47 | +41.63% | +36 | 0 | +32.14% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158593
Approved by: https://github.com/mlazos
2025-07-18 18:22:06 +00:00
6e07d6a0ff [Dynamo][Better Engineering] Add typing support for _dynamo/repro and debug_utils (#158504)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py`

Running
```
mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  905 | 3268 | 27.69% | 22 | 81 | 27.16% |
| This PR | 3368 | 3368 | 100.00% | 81 | 81 | 100.00% |
| Delta    | +2463 | +100 | +72.31% | +59 | 0 | +72.84% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504
Approved by: https://github.com/mlazos
2025-07-18 18:15:55 +00:00
b4358c5e87 [inductor] Explicitly link c10 in inductor. (#158622)
MSVC have error "unresolved external symbol" when compiling inductor. Explicitly link c10 in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158622
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-18 18:00:50 +00:00
86675af3f0 Revert "[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)"
This reverts commit 9308261a2afb69d807ea06508bb8582b066d9ccd.

Reverted https://github.com/pytorch/pytorch/pull/158602 on behalf of https://github.com/ZainRizvi due to The lint job failure was hiding a real lint failure. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/16375911199/job/46275682191) [HUD commit link](6f73e06796) ([comment](https://github.com/pytorch/pytorch/pull/158602#issuecomment-3090209891))
2025-07-18 17:46:11 +00:00
725cdb218e Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612)
Differential Revision: D78493333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612
Approved by: https://github.com/d4l3k
2025-07-18 17:33:12 +00:00
8c3f84908b [aot] fix greater_than_max build fail on Windows. (#158479)
Error snapshot:
<img width="937" height="110" alt="image" src="https://github.com/user-attachments/assets/10195f84-83c4-42db-af3c-76f875a6a983" />

Reason:
`std::numeric_limits::max` is confilct to windef.h:`max(a, b)`

Fix code:
<img width="488" height="269" alt="image" src="https://github.com/user-attachments/assets/3328c37b-7c89-435e-944c-4ca7c9b6c5b6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158479
Approved by: https://github.com/desertfire
2025-07-18 17:18:10 +00:00
6f73e06796 [iter] exhaust ListIterator when unpack_var_sequence is called (#156370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156370
Approved by: https://github.com/zou3519
ghstack dependencies: #156369
2025-07-18 16:48:27 +00:00
acffd1a297 [iter] Update some of the tests to not call pickle (#156369)
Some tests in test_iter only fail because of pickle. I'm skipping the pickle section as Dynamo doesn't support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156369
Approved by: https://github.com/zou3519
2025-07-18 16:48:27 +00:00
bf4aa78279 Revert "[DTensor] Fix default_strategy and rename for clarity (#158490)"
This reverts commit d8b084312b54e97bdbaf6a178fe2fc628a23243b.

Reverted https://github.com/pytorch/pytorch/pull/158490 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
50f33a6fca Revert "[DTensor] fix copy_ strategy (#158538)"
This reverts commit 7b05bdd925f0f4b49e68662f9761fabaa27f2faf.

Reverted https://github.com/pytorch/pytorch/pull/158538 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
35df895d05 [AOTI] package loader normalize path separator (#158630)
Add `normalize_path_separator` to handle Windows path simplify.

This solution is working well on `torch/_inductor/cpp_builder.py`: a00cd8cf25/torch/_inductor/cpp_builder.py (L406-L409)

Let's copy it to package loader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158630
Approved by: https://github.com/angelayi
2025-07-18 15:55:24 +00:00
193b29ee0c [BE][EZ] Minor doc fixes (#158574)
[BE] Minor doc fixes
2025-07-18 10:34:55 -05:00
036eb1f65d [precompile] Filter out ID_MATCH family of guards with caching_precompile. (#158368)
Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization.

Test Plan:
test_dynamo -- -k test_id_match_with_config

Rollback Plan:

Differential Revision: D78363609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368
Approved by: https://github.com/jamesjwu
2025-07-18 14:47:11 +00:00
e882c761dd Add STD_TORCH_CHECK to headeronly (#158377)
Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377
Approved by: https://github.com/albanD
2025-07-18 14:35:20 +00:00
0eae6b68f4 Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-18 14:05:52 +00:00
a4ec381302 [build] pin setuptools>=77 to enable PEP 639 (#158104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-18 11:49:54 +00:00
27af877f84 [ATen][CUDA][SDPA] Flash Attention: Refactor sm version checks (#158558)
The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558
Approved by: https://github.com/eqy
2025-07-18 09:59:41 +00:00
7b05bdd925 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
2025-07-18 09:59:37 +00:00
ead80f3202 Fix s390x CI: ensure that all python dependencies are installed when … (#158552)
…building pytorch for tests on s390x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158552
Approved by: https://github.com/huydhn
2025-07-18 09:13:41 +00:00
32aade9d8d Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))
2025-07-18 07:47:46 +00:00
be896d6b41 Revert "Forward-fix unused variables warning/error (#158549)"
This reverts commit eeda1a75ace75ce8a6763050fb91d236a6d3287b.

Reverted https://github.com/pytorch/pytorch/pull/158549 on behalf of https://github.com/jithunnair-amd due to Sorry, need to revert this first, so we can revert PR 158037, which broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/158549#issuecomment-3087942475))
2025-07-18 07:44:14 +00:00
a3396a9b85 [hop] set capture_scalar_outputs=True by default for compiled hops (#158480)
We want to do it for two reasons:
1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default.
2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480
Approved by: https://github.com/zou3519
2025-07-18 07:16:50 +00:00
fda3f3b2ec [while_loop] fix constant tensor used as carried inputs (#158381)
Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381
Approved by: https://github.com/zou3519
2025-07-18 07:08:11 +00:00
a00cd8cf25 Add a way to disable compile for debugging flex-attention (#158534)
Finally got around to doing this, this flag lets us do:

```Python

#!/usr/bin/env python3
"""
FlexAttention Debug: Using breakpoints and unwrap
"""

import torch
import torch.nn.attention.flex_attention as fa

unwrap = torch._C._functorch.get_unwrapped

def score_mod(score, batch, head, q_idx, kv_idx):
    # Set breakpoint here to debug
    breakpoint()

    # In debugger, unwrap to see actual tensor values:
    # >>> actual_score = unwrap(unwrap(unwrap(unwrap(score))))
    # >>> actual_batch = unwrap(batch)
    # >>> actual_head = unwrap(head)
    # >>> actual_q_idx = unwrap(q_idx)
    # >>> actual_kv_idx = unwrap(kv_idx)
    # >>> print(actual_score)
    # >>> print(f"q_idx: {actual_q_idx}, kv_idx: {actual_kv_idx}")

    return torch.where(q_idx >= kv_idx, score, torch.tensor(float('-inf')))

def main():
    # Enable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = True

    # Small example
    B, H, S, D = 1, 2, 4, 8
    q = torch.randn(B, H, S, D)
    k = torch.randn(B, H, S, D)
    v = torch.randn(B, H, S, D)

    # Run - will hit breakpoint
    output = fa.flex_attention(q, k, v, score_mod=score_mod)

    # Disable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = False

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158534
Approved by: https://github.com/Chillee, https://github.com/zou3519
2025-07-18 05:33:45 +00:00
eb73650723 [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-18 05:23:00 +00:00
9308261a2a [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-18 05:02:31 +00:00
9a7c2f1f64 Revert "Add torch compile force disable caches alias (#158072)"
This reverts commit 2ecf083b7247f265a03ec296ba9d7b795f035118.

Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))
2025-07-18 04:58:24 +00:00
d8b084312b [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
2025-07-18 04:09:32 +00:00
1e86fa2e5b Add stack trace to Inductor IR nodes if inductor.config.trace.provenance_tracing=True (#158576)
Summary:
- Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and  ` create_node_mapping_kernel_to_post_grad`
- Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace`
- Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node
- When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to.
  - A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299))

Example:

```
scheduling ExternKernelOut(
  python_kernel_name='extern_kernels.mm',
  name=buf0,
  layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]),
  inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView(
    StorageBox(
      ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1]))
    ),
    FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]),
    origins=OrderedSet([mm_default_1]),
    stack_traces = {,
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
        x = self.fc1(x),
      File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
        return F.linear(input, self.weight, self.bias),
    }
  )],
  constant_args=(),
  kwargs={},
  output_view=None,
  python_kernel_name=extern_kernels.mm,
  cpp_kernel_name=at::mm_out,
  ordered_kwargs_for_cpp_kernel=(),
  op_overload=None,
  arg_properties=[{}, {}],
  allarg_properties={},
  kwarg_properties=None,
  unbacked_bindings={},
  mutation_outputs=[],
  origin_node=mm_default_1,
  origins=OrderedSet([mm_default_1]),
  stack_traces = {,
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
      x = self.fc1(x),
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
      return F.linear(input, self.weight, self.bias),
  }
)
```

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D78365534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576
Approved by: https://github.com/angelayi
2025-07-18 04:05:17 +00:00
86dbc0ef67 [NativeRT] Remove makeProxyExecutor from ModelRunner interface (#158587)
Summary: makeProxyExecutor shouldn't be exposed to ModelRunner Interface.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78501011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158587
Approved by: https://github.com/yiming0416, https://github.com/henryoier
2025-07-18 03:20:40 +00:00
89d842fec5 Make torch.distributed.breakpoint() set a long timeout (#158481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158481
Approved by: https://github.com/d4l3k
ghstack dependencies: #158469
2025-07-18 02:18:43 +00:00
ce4554352b Shunt fx_interpreter graphmodule print on error into tlparse (#158469)
Include both the error stacktrace and the graphmodule in a new
structured trace artifact.  Log the shortened version to the console,
and also log a hint to look at the tlparse for more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469
Approved by: https://github.com/ezyang
2025-07-18 02:18:43 +00:00
583138d170 [Dynamo][Better Engineering] Add typing for comptime, cache, and convert_frame (#158379)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for`comptime.py` but also `cache_size.py` and `convert_frame.py`.

Running
```
mypy torch/_dynamo/comptime.py torch/_dynamo/cache_size.py torch/_dynamo/convert_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1837 | 2215 | 82.93% | 45 | 82 | 54.88% |
| This PR | 2230 | 2230 | 100.00% | 82 | 82 | 100.00% |
| Delta    | +393 | +15 | +17.07% | +37 | 0 | +45.12% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158379
Approved by: https://github.com/mlazos
2025-07-18 02:11:57 +00:00
eqy
6fd6fc418d [B200] Fix flex-attention heuristic for test_tma_with_customer_kernel_options_cuda (#158494)
Otherwise fails with
```
torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused__to_copy_ones_sort_sum_zeros_2 Required: 264224 Hardware limit: 232448 Reducing block sizes or `num_stages` may help.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158494
Approved by: https://github.com/drisspg
2025-07-18 02:03:49 +00:00
ddbecdfb66 [DTensor] Document redistribute_costs (#158495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158495
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-18 01:43:38 +00:00
ef38edb284 Add stride check for attn_mask on non-cpu device (#158424)
Fixes #158374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424
Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman
2025-07-18 01:10:58 +00:00
6673ac746c Fix test linalg for MKL upgrading (#158312)
Fixes #158054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158312
Approved by: https://github.com/albanD
2025-07-18 01:08:33 +00:00
7b72e5b3ad Fix Pandas version mismatch upon reinstalling numpy (#158584)
If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place.
```
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list
Package            Version
------------------ -----------
attrs              25.3.0
build              1.2.2.post1
certifi            2025.7.14
charset-normalizer 3.4.2
cmake              4.0.3
exceptiongroup     1.3.0
expecttest         0.3.0
filelock           3.18.0
fsspec             2025.5.1
hypothesis         6.135.32
idna               3.10
importlib_metadata 8.7.0
Jinja2             3.1.6
lintrunner         0.12.7
MarkupSafe         2.1.5
mpmath             1.3.0
networkx           3.2.1
ninja              [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4)
opt-einsum         3.3.0
optree             0.16.0
packaging          25.0
pip                25.1
psutil             7.0.0
pyproject_hooks    1.2.0
python-dateutil    2.9.0.post0
pytz               2025.2
PyYAML             6.0.2
requests           2.32.4
setuptools         78.1.1
six                1.17.0
sortedcontainers   2.4.0
sympy              1.14.0
tomli              2.2.1
typing_extensions  4.14.0
tzdata             2025.2
urllib3            2.5.0
uv                 0.7.21
wheel              0.45.1
zipp               3.23.0
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4
Collecting numpy==1.22.4
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.22.4
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3
Collecting pandas==2.0.3
  Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4)
Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
Successfully installed pandas-2.0.3
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2
Collecting numpy==2.0.2
  Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
Successfully installed numpy-2.0.2
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python
Python 3.9.23 (main, Jun  5 2025, 13:40:20)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584
Approved by: https://github.com/huydhn
2025-07-18 00:14:16 +00:00
33c9b414aa [CI][MPS] Enable test_indexing on MPS (#158582)
- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with
```
/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
```
while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')`

- Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet)
- Adjust `test_index` to avoid using DoubleTensors for MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales
2025-07-17 23:33:52 +00:00
b0e325c2c8 [Dynamo][Better Engineering] Add type coverage to decorators (#158509)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important file in dynamo, `decorators.py`

NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time

Running
```
mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  209 | 908 | 23.02% | 9 | 39 | 23.08% |
| This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% |
| Delta    | +661 | +35 | +76.98% | +27 | 0 | +76.92% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509
Approved by: https://github.com/williamwen42
2025-07-17 23:31:26 +00:00
f63988ae00 [BE]Clean up old APIs in AOTI c shim (#158400)
Summary:
The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h`

This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h`

The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py`

Test Plan:
CI

Rollback Plan:

Differential Revision: D78378373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-07-17 23:24:50 +00:00
2df2e3bb51 [ROCm][CI] Last known good HIP patch (#158596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158596
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-17 22:52:16 +00:00
0ecfb93a0b Avoid globally modifying torch.testing._internal.common_methods_invocations.wrapper_set_seed (#158548)
Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548
Approved by: https://github.com/janeyx99
2025-07-17 22:31:59 +00:00
74f4cf4bd5 Add missing <vector> in c10/util/WaitCounter.h (#158354)
It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158354
Approved by: https://github.com/janeyx99
2025-07-17 22:23:05 +00:00
cyy
1b91954b9f Suppress volatile type error (#158435)
Fixes
```
/var/lib/jenkins/workspace/torch/csrc/dynamo/guards.cpp:5320:10:
error: compound assignment to object of volatile-qualified type 'volatile char' is deprecated [-Werror,-Wdeprecated-volatile]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158435
Approved by: https://github.com/janeyx99
2025-07-17 22:21:04 +00:00
41b2c4d119 Reduce random reads for offset metadata when calling torch.load under FakeTensorMode (#157931)
We already test the `_get_offset` functionality with that TORCH_SERIALIZATION_DEBUG flag that is set in CI, so I didn't add more testing specifically for FakeTensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157931
Approved by: https://github.com/albanD
2025-07-17 22:17:52 +00:00
af6624023e [dynamo] Skip training flag check id already guarding on nn modules (#158492)
This might help some legacy models that still have
inline_inbuilt_nn_modules False for some reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158492
Approved by: https://github.com/StrongerXi
2025-07-17 21:42:19 +00:00
a00442421a [CI][TD] Enable TD on all test configs (#158163)
I think the main one that was missing is dynamo_wrapped

There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways

dynamo_wrapped is the second longest jobs for pull right now
<img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158163
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2025-07-17 21:05:25 +00:00
ced5cf042d Revert "Cleanup old caffe2 scripts (#158475)"
This reverts commit 94d7f0c1ef9a4cb4db0eb5d6b1ffc55941cbeab1.

Reverted https://github.com/pytorch/pytorch/pull/158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158475#issuecomment-3085447409))
2025-07-17 20:58:34 +00:00
1b88da1cac [MPS] Improve performance of max_pool3d (#157875)
To check how the changes from this PR affect performance, I wrote a script here: 55ef32a127/max_pool_mps/perf.py.

Before this PR, I get this:

```
===================
max_pool3d
===================
0: 0.013105 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.038003 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.212963 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 1.224645 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 7.317867 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 34.679233 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 34.626383 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 44.835892 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.083579 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.936575 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 5.329883 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 11.713617 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 25.450454 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.058375 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 3.757558 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 33.451588 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

After this PR, I get this:

```
===================
max_pool3d
===================
0: 0.007202 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.018596 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.130717 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 0.966795 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 4.095804 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 12.833446 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 12.859346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 14.080529 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.029283 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.175700 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 0.742750 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 1.939596 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 4.074821 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.028425 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 0.384375 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 2.623346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

Every case is improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157875
Approved by: https://github.com/malfet
2025-07-17 20:34:12 +00:00
66c9bc5062 [export] Add runnable code to export docs (#158506)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html

Yay I can add runnable code to export docs now
Also moved export API reference to a different file.

With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506
Approved by: https://github.com/pianpwk, https://github.com/svekars
2025-07-17 20:15:22 +00:00
80ac73c057 [ca] reset between tests (#158418)
CA reset is much faster than dynamo reset, so it's probably okay to run it every time. I'm not sure if this will fix the flaky autograd tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158418
Approved by: https://github.com/jansel
2025-07-17 20:14:29 +00:00
eeb0783fe6 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-17 20:04:42 +00:00
ef256ad17b Make Inductor imports TYPE_CHECKING only (#158524)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158524
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-17 19:55:19 +00:00
fd51bcdd21 check if USE_ROCM is defined (#158571)
Summary:
check if USE_ROCM is defined

D78424375 broke some builds: see T231304402

Test Plan:
rerunning failed builds

Rollback Plan:

Reviewed By: Camyll

Differential Revision: D78493019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158571
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-07-17 19:48:26 +00:00
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007ddaa8029e7e01bb1af12f358bf597c2 #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf277feb2d99ab88d7b6da526488baea93

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
8dcebaa7b0 [AOTI] add WIN32 implement for create_temp_dir (#158570)
add Windows implement for `create_temp_dir`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158570
Approved by: https://github.com/angelayi
2025-07-17 19:22:59 +00:00
7e34f9c292 Add torch._C._log_api_usage_once to datapipes (mapper) (#155489)
This is to get a better understanding of how datapipes is used right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155489
Approved by: https://github.com/ramanishsingh
2025-07-17 19:01:49 +00:00
25f4d7e482 Use new type statement to fix public API of types (#158487)
Since type statement breaks older python version, trying to find equivalent behavior without the type mechanics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158487
Approved by: https://github.com/andrewor14
2025-07-17 18:46:44 +00:00
ad223a6c5f Add FP8 Types (#158430)
Summary: Add FP8 Types

Test Plan:
sandcastle

Rollback Plan:

Differential Revision: D78395110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158430
Approved by: https://github.com/henryoier
2025-07-17 18:09:56 +00:00
f92a2035e4 ci: Update lint workflow to only run on changed files for PRs (#158518)
This modifies the lint workflow to use the new get-changed-files
workflow to optimize lint execution by only running on files
that have actually changed in pull requests.

This more closely mirrors the type of behavior that users
expect when running lint locally on their PRs.

This also leaves the default behavior as a fallback for when
you're not running on a pull request.

Since lint runs on the pull_request event I'm not really worried about
any type of ciflow shenanigans in this.

This also splits mypy into its own job since mypy needs to run on all-files all the time.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518
Approved by: https://github.com/huydhn
ghstack dependencies: #158517
2025-07-17 18:00:44 +00:00
bff69f25c2 [BE][testing] fix test/dynamo/test_repros:test_longtensor_list (#158458)
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
2025-07-17 17:27:00 +00:00
6d31d38965 recovering node source from dict (#158373) (#158473)
Summary:

this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78434648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158473
Approved by: https://github.com/angelayi
2025-07-17 17:00:19 +00:00
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
94d7f0c1ef Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-17 16:50:06 +00:00
23550ab735 Revert "DDE-Free select with unbacked index. (#157605)"
This reverts commit 79d7c754ab8ae0e5c3a614521632d2cfbfa0fdba.

Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks  ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))
2025-07-17 16:20:02 +00:00
16b21fa8b2 [AOTI] skip ld and objcopy on Windows. (#158545)
Skip `ld` and `objcopy` on Windows. They are not support on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545
Approved by: https://github.com/desertfire
2025-07-17 15:43:24 +00:00
2ecf083b72 Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-07-17 15:40:36 +00:00
813c76b98d Revert "Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)"
This reverts commit 58c7cf9ede6311da5533dbcaf238a912176a6a85.

Reverted https://github.com/pytorch/pytorch/pull/158537 on behalf of https://github.com/albanD due to This broke C++ tests ([comment](https://github.com/pytorch/pytorch/pull/158537#issuecomment-3084425920))
2025-07-17 15:06:43 +00:00
288bf54a23 Revert "Move off of deprecated API in 2.9 (#158527)"
This reverts commit 9636e2cfd3e995ef977f670ad47e8e895296d992.

Reverted https://github.com/pytorch/pytorch/pull/158527 on behalf of https://github.com/albanD due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/158527#issuecomment-3084385585))
2025-07-17 14:55:28 +00:00
da4c7b4ced [AOTI] align signature to model_base.h (#158554)
Remove `const` keyword, align its signature to `model_base.h` eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554
Approved by: https://github.com/desertfire
2025-07-17 14:44:32 +00:00
a04bd11895 [AOTI] Use format_consts_to_cpp on Windows. (#158543)
`format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543
Approved by: https://github.com/desertfire
2025-07-17 14:40:34 +00:00
58c7cf9ede Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-17 13:39:25 +00:00
38c04415a9 [oss][hf][bug fix] Remove buggy consolidation logic (#158380)
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D78366701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
2025-07-17 13:05:06 +00:00
7892f5a007 [inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158459)
Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton.

Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459
Approved by: https://github.com/atalman
2025-07-17 12:50:46 +00:00
d5af0eca8d [BE][3/5] fix typos in aten/ (aten/src/ATen/native/) (#157552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157552
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550, #157551
2025-07-17 12:08:34 +00:00
f57ef62ebc [BE][2/5] fix typos in aten/ (aten/src/ATen/native/) (#157551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157551
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550
2025-07-17 12:08:33 +00:00
4c8b408d16 [BE][1/5] fix typos in aten/ (#157550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157550
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637
2025-07-17 12:08:33 +00:00
c8d43cbc6e [BE][3/6] fix typos in test/ (#157637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157637
Approved by: https://github.com/yewentao256, https://github.com/albanD
ghstack dependencies: #156605
2025-07-17 12:08:33 +00:00
3f8e2e91ad [BE][15/16] fix typos in torch/ (torch/distributed/tensor/) (#156605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156605
Approved by: https://github.com/wanchaol, https://github.com/albanD
2025-07-17 12:08:33 +00:00
eeda1a75ac Forward-fix unused variables warning/error (#158549)
Introduced in https://github.com/pytorch/pytorch/pull/158037, didn't seem to trigger on PR, but trunk CI is failing in some `linux-jammy-cpu-py3.12-gcc11-inductor-*` jobs where this warning is turned into an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158549
Approved by: https://github.com/danthe3rd
2025-07-17 09:44:19 +00:00
f4d8bc46c7 Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520)
### Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.

```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```

Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-17 08:57:34 +00:00
39ac189808 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-17 08:26:27 +00:00
d76323d417 [NativeRT] Remove normalizeDevice (#158489)
Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).

tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.

TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.

For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.

Device-to-device mapping must be done between concrete device!

Test Plan:
CI

Rollback Plan:

Differential Revision: D78443109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
2025-07-17 06:48:25 +00:00
04349f9ee5 [PT2]: Skip AOTI Weight Loading during Init (#158416)
Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares.

Test Plan:
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=895279202
SNAPSHOT_ID=0
MODULE=merge

buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice ""  --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000
```

Rollback Plan:

Differential Revision: D78383881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
2025-07-17 06:47:47 +00:00
09db3a22e8 [BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453)
BUILD_SPLIT_CUDA logic has been removed for a while

Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453
Approved by: https://github.com/albanD
ghstack dependencies: #158358, #158365
2025-07-17 06:47:10 +00:00
a38f433be2 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-17 06:33:08 +00:00
9f37cce693 Revert "[Docker builds] Move from Miniconda to Miniforge (#158370)"
This reverts commit 0a99b026d6bd0f67dc2c0a20fe3228ddc4144854.

Reverted https://github.com/pytorch/pytorch/pull/158370 on behalf of https://github.com/laithsakka due to this fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/158370#issuecomment-3082744071))
2025-07-17 06:28:49 +00:00
9636e2cfd3 Move off of deprecated API in 2.9 (#158527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158527
Approved by: https://github.com/danielvegamyhre
2025-07-17 06:18:13 +00:00
d9426a81d2 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290, #158291
2025-07-17 05:56:26 +00:00
0b9fb91f17 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-17 05:56:26 +00:00
a6de309ca1 [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-17 05:56:18 +00:00
1a4268b811 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-17 05:56:07 +00:00
79d7c754ab DDE-Free select with unbacked index. (#157605)
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
2025-07-17 05:08:11 +00:00
415dfabe9b [Easy] Fix the format (#158450)
When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg,
some unrelated format error occurred.

```Python
Lint for torch/_inductor/fx_passes/fuse_attention.py:

  Error (CODESPELL) spelling error
    Failed due to ValueError:
    /pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt
    ==> different

    Please either fix the error or add the word(s) to the dictionary file.
    HINT: all-lowercase words in the dictionary can cover all case variations.

Lint for torch/fx/traceback.py:

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "str", variable has
    type "None")

        101  |
        102  |    def _get_action_string(self):
        103  |        if self._action_string is None:
        104  |            self._action_string = "+".join([a.name.lower() for a in self.action])
        105  |        return self._action_string
        106  |
        107  |    def print_readable(self, indent=0):

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "dict[str, Any]",
    variable has type "None")

        121  |        if self._dict is None:
        122  |            # Convert the object to a dictionary
        123  |            action_string = self._get_action_string()
        124  |            self._dict = {
        125  |                "name": self.name,
        126  |                "target": self.target,
        127  |                "graph_id": self.graph_id,

  Error (MYPY) [return-value]
    Incompatible return value type (got "None", expected "dict[Any, Any]")

        130  |                "from_node": [node.to_dict() for node in self.from_node],
        131  |            }
        132  |
        133  |        return self._dict
        134  |
        135  |    def __eq__(self, other: object):
        136  |        if not isinstance(other, NodeSource):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450
Approved by: https://github.com/Skylion007
2025-07-17 04:56:10 +00:00
8eaa9f2701 Fix mask construction when dispatching index_put to masked_fill (#158472)
Fixes #158413
Previously trailing Nones in the index were incorrectly handled as implicit broadcasting dims in the mask, whereas they should just be ignored.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158472
Approved by: https://github.com/ezyang
2025-07-17 04:21:43 +00:00
ebf83b8b77 [audio hash update] update the pinned audio hash (#158402)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158402
Approved by: https://github.com/pytorchbot
2025-07-17 04:19:06 +00:00
24b49b9881 [Fix] Rework CUDA error explanation framework to be less destructive … (#158484)
…in fbsource

Fix-forward for #158395

Added `std::string c10::cuda::get_cuda_error_help(const char* error_string)` to provide a framework for appending clarifying messages to CUDA errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158484
Approved by: https://github.com/aorenste
2025-07-17 03:36:47 +00:00
1839e8d04b [DTensor] Assert DTensorSpec has valid placements (#158133)
This helped identify buggy sharding rules during debugging, why not
check it in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158133
Approved by: https://github.com/XilunWu, https://github.com/zpcore
ghstack dependencies: #158132
2025-07-17 02:32:26 +00:00
2ad5c25cfc Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-07-17 01:56:01 +00:00
1179e33323 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-07-17 01:56:01 +00:00
f6d138807f Always disable ShardingPropagation cache if compiling (#156868)
Fixes #151106

Addresses issue (2) in #152963 for the DTensor sharding propagation cache being brittle under compile. The existing `_are_we_tracing` from `distributed._functional_collectives`, which mostly determines if currently tracing based on Fake Tensor dispatch mode, is reused here.

**Test Plan**:
There are already tests for DTensor + Compile with dynamic shape ([test_dtensor_dynamic](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L260),
[test_dynamo_dtensor_from_local_dynamic_shapes](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L402)) that cover the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156868
Approved by: https://github.com/xmfan
2025-07-17 01:33:53 +00:00
c09eba877f [Device] Add support for PrivateUse1 device type in parse_type function (#157609)
This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency.

This help to migrate existed backend to PrivateUse1 smoothly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609
Approved by: https://github.com/jgong5, https://github.com/albanD
2025-07-17 01:27:44 +00:00
2179afd714 [easy][guards] Add developer comment for posterity (#158471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158471
Approved by: https://github.com/StrongerXi
2025-07-17 01:17:04 +00:00
d7e1b8b11d [dynamo] Constant fold torch.autograd._profiler_enabled (#158482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158482
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-07-17 01:07:42 +00:00
b6454a9058 [AOT_inductor] model_base.h add Windows include files. (#158477)
model_base.h add Windows include files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158477
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-17 00:57:48 +00:00
e9367a7a42 ci: Add reusable workflow to get changed files in PRs (#158517)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158517
Approved by: https://github.com/huydhn
2025-07-17 00:57:43 +00:00
clr
e78f2ac92b inductor: Fix crash in split_cat when tensors is a Node (#157155)
If there is only one node passed to aten::cat, the argument is a single node,
rather than a list of nodes with a valid length.

Example stack
```
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply
    self.handler(match, *match.args, **match.kwargs)
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten
    if len(cat_inputs) < threshold_to_cat:
torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len()
```

This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15

I've attached a test which reproduces this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155
Approved by: https://github.com/jansel
2025-07-17 00:57:38 +00:00
82a1ee1135 Refactor Provenance Tracking (#158399)
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`

- change the guard flag from `trace.enabled` to `trace.provenance_tracking`.  It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.

In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.

See more motivation in internal Diff

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D78287976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
2025-07-17 00:23:00 +00:00
306dd19216 update expeced results (#158497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497
Approved by: https://github.com/xmfan
2025-07-17 00:02:52 +00:00
1d58476162 [PP] Add eval() API to schedule (#157795)
These change add an `eval()` API to PP schedules

## Context

Currently, you can run "Forward only" for a schedule in two ways:
1. Use a custom schedule `_ScheduleForwardOnly`
2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed.

However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this.

```python
if self.rank == 0:
    schedule.eval(x)
elif self.rank == self.world_size - 1:
    losses = []
    schedule.eval(target=target, losses=losses)
else:
    schedule.eval()
```

TODO:
- in later PRs, we will deprecate the `_ScheduleForwardOnly`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795
Approved by: https://github.com/wconstab
2025-07-16 23:48:45 +00:00
a4d753295e [Dynamo][Better Engineering] Add enhanced typing support to _dynamo/eval_frame.py (#158276)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to the main entrypoint for dynamo, `eval_frame.py`

Running
```
mypy torch/_dynamo/eval_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  623 | 2232 | 27.91% | 19 | 68 | 27.94% |
| This PR | 2285 | 2285 | 100.00% | 68 | 68 | 100.00% |
| Delta    | +1662 | +63 | +72.09% | +49 | 0 | +72.06% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158276
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <williamwen@meta.com>
2025-07-16 23:31:10 +00:00
a9f902add0 [CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)
Reopen https://github.com/pytorch/pytorch/pull/156097

Fixes https://github.com/pytorch/pytorch/issues/154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-16 23:14:36 +00:00
e311886e3d Add transpose to torch/csrc/stable (#158160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158160
Approved by: https://github.com/janeyx99
2025-07-16 22:50:57 +00:00
3cb11877aa [aoti][mps] Enable test_aot_inductor.py tests (#155598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155598
Approved by: https://github.com/yushangdi
2025-07-16 22:26:57 +00:00
5951fcd50a [Dynamo][Better Engineering] Support typing in codegen.py (#158386)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for `codegen.py` but also `config.py`

Running
```
mypy torch/_dynamo/codegen.py torch/_dynamo/config.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  347 | 1330 | 26.09% | 24 | 50 | 48.00% |
| This PR | 1334 | 1334 | 100.00% | 50 | 50 | 100.00% |
| Delta    | +987 | +4 | +73.91.% | +26 | 0 | +52.00% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158386
Approved by: https://github.com/StrongerXi
2025-07-16 22:09:01 +00:00
ada44e5ba7 [Dynamo][Better Engineering] Add typing to bytecode analysis and transform (#158293)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, `bytecode_transformation.py` and by extension, `bytecode_analysis.py`

Running
```
mypy torch/_dynamo/bytecode_transformation.py torch/_dynamo/bytecode_analysis.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1422 | 1920 | 74.06% | 73 | 93 | 78.49% |
| This PR | 1968 | 1968 | 100.00% | 93 | 93 | 100.00% |
| Delta    | +546 | +48 | +25.94% | 20 | 0 | +21.51% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158293
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007
2025-07-16 21:50:55 +00:00
9df0176408 [BE][testing] Disable test_static_cuda_launcher:test_floats internally (#158296)
Summary: it seems the check for 'Offd' vs. 'Offf' doesn't work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158296
Approved by: https://github.com/davidberard98
2025-07-16 21:27:40 +00:00
94c746bb43 [DTensor][BE] add document to ShardingPropagator.register_op_strategy (#158362)
**Summary**
Add document to `ShardingPropagator.register_op_strategy` on how to draft
`strategy_func` and when to use `schema_info`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158362
Approved by: https://github.com/zpcore
2025-07-16 21:08:59 +00:00
473208cb18 [ez][lint] Add pr_time_benchmarks to merge conflictless csv linter (#158353)
Discovered this when looking at a PR I was trying to revert and was surprised that the PR got rid of the spaces but didn't trigger the linter.  Turns out the file was following the rule but wasn't actually being checked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158353
Approved by: https://github.com/seemethere, https://github.com/Camyll
2025-07-16 20:31:07 +00:00
fb731fe371 Add warning about removed sm50 and sm60 arches (#158301)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2025-07-16 20:11:18 +00:00
a9ee4250d5 [4/n] Remove references to TorchScript in PyTorch docs (#158317)
Summary: jit.rst

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-16 20:01:34 +00:00
14ecc03361 Revert "recovering node source from dict (#158373)"
This reverts commit 4d055982e38f59fdb2a4c9d8855e58548bc42c12.

Reverted https://github.com/pytorch/pytorch/pull/158373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158373#issuecomment-3080093479))
2025-07-16 19:55:21 +00:00
1cc62c2cb9 [export] Update docs (#157750)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html

Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
2025-07-16 19:53:12 +00:00
f58a680d09 [c10d]Prototype of remote_group_merge (#158287)
Tentative implementation of merge_remote_group per the proposal here: [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158287
Approved by: https://github.com/d4l3k
ghstack dependencies: #157716
2025-07-16 19:33:57 +00:00
944a140e90 Revert "[cuda][cupy] Improve cupy device placement when device is provided (#158320)"
This reverts commit 59f9b25f3cfc635053843372ea29ff4bf754da3f.

Reverted https://github.com/pytorch/pytorch/pull/158320 on behalf of https://github.com/wdvr due to reverting because most likely causing test/test_numba_integration.py::TestNumbaIntegration::test_from_cuda_array_interface_inferred_strides to fail ([comment](https://github.com/pytorch/pytorch/pull/158320#issuecomment-3079960616))
2025-07-16 19:15:33 +00:00
cyy
79ab84e9b8 Fix invalid formatting (#158436)
It causes errors under C++20
```
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40:
error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression
```
Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436
Approved by: https://github.com/Skylion007
2025-07-16 18:47:09 +00:00
2b0f9b1f61 Move c10/macros/Macros.h to headeronly (#158365)
^

Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365
Approved by: https://github.com/swolchok
ghstack dependencies: #158358
2025-07-16 18:46:52 +00:00
b40f48d191 Move the rest of c10/macros/Export.h (#158358)
Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358
Approved by: https://github.com/swolchok
2025-07-16 18:46:52 +00:00
4d055982e3 recovering node source from dict (#158373)
Summary: this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78363882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158373
Approved by: https://github.com/yushangdi
2025-07-16 18:46:09 +00:00
bc9091a524 Fix indexing with multi-dimensional boolean mask (#158369)
Fixes #71673

This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples:
```python
>>> import torch
>>> x = torch.ones([2, 2, 3])
>>> m = torch.tensor(((True, False), (False, False)))  # (2x2 boolean mask)

>>> x[m].shape  # this works fine (the boolean mask acts on the 2x2 subspace selecting one row)
torch.Size([1, 3])

>>> x[m, 0]  # this should produce a tensor of shape (1,)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1

>>> x[m, ::2]  # this should produce a tensor of shape (1, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1

>>> x[m, None]  # this should produce a tensor of shape (1, 1, 3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369
Approved by: https://github.com/ngimel
2025-07-16 18:30:57 +00:00
a26bf38927 Don't need to handle PyTrace_EXCEPTION in pyProfileFn (#154392)
According to the [document](https://python.readthedocs.io/fr/stable/c-api/init.html#c.PyTrace_EXCEPTION) and [comment](https://github.com/python/cpython/blob/3.9/Modules/_lsprof.c#L407), we don't need to handle PyTrace_EXCEPTION in pyProfileFn.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154392
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 18:00:11 +00:00
da05b7fb94 [cond] add _FlopCounterMode support for cond (#158067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158067
Approved by: https://github.com/zou3519
ghstack dependencies: #158077
2025-07-16 17:26:20 +00:00
82b1c48292 [hop] add supports_higher_order_operators flag to TorchDispatchMode (#158077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158077
Approved by: https://github.com/zou3519
2025-07-16 17:26:20 +00:00
a369350065 enable compiled autograd on CPU windows (#158432)
compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code.
However these code can be compiled on CPU. This PR enable these code on CPU windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158432
Approved by: https://github.com/jansel, https://github.com/xmfan

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-16 17:22:37 +00:00
ff611d971f [ROCm] check stream graph capture status in memcpy_and_sync inline function (#158165)
Check for stream graph capture when using hipMemcpyWithStream.

Fixes https://github.com/pytorch/pytorch/issues/155684, https://github.com/pytorch/pytorch/issues/155231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158165
Approved by: https://github.com/jeffdaily
2025-07-16 17:17:34 +00:00
4805a6ead6 [aot][XPU] switch xpu to use consts cpp build. (#158425)
Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425
Approved by: https://github.com/jansel
2025-07-16 16:19:33 +00:00
a8b9736737 [BE][testing] disable test_custom_op_square internally (#158367)
Summary: test is failing with `ld.lld: error: unable to find library -laoti_custom_ops`

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:test_aot_inductor_custom_ops -- --exact 'caffe2/test/inductor:test_aot_inductor_custom_ops - test_custom_op_square_cuda (caffe2.test.inductor.test_aot_inductor_custom_ops.AOTInductorTestABICompatibleCuda)' --run-disabled`

Differential Revision: [D78364617](https://our.internmc.facebook.com/intern/diff/D78364617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158367
Approved by: https://github.com/desertfire
2025-07-16 16:16:14 +00:00
4b11428cb5 [BE][testing] Skip test_repeated_masked_load internally (#158355)
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
2025-07-16 16:15:44 +00:00
a04a13c449 [BE][testing] Skip test_triton_interpret internally (#158260)
Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine.

Test Plan: `buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260
Approved by: https://github.com/eellison
2025-07-16 16:14:44 +00:00
a23f4471b9 [ROCm][Windows] Fix finding ROCm/HIP version (#156486)
This commit fixes Windows build issue related to trying to use rocm-core (rocm-core doesn't exist on HIP SDK)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156486
Approved by: https://github.com/jeffdaily, https://github.com/stellaraccident
2025-07-16 15:31:43 +00:00
06a67a8948 Fix sha256 for aotriton ROCm7.0 tarball (#158420)
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
       file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
    /var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
  does not match expected value
    expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
      actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
  Each download failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
2025-07-16 15:24:20 +00:00
9513b9d03f Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit bc65253369933160a2da3fc786d027a572faf6b7.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))
2025-07-16 15:04:10 +00:00
0b19d463d9 forward fix lint (#158448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158448
Approved by: https://github.com/adamomainz
2025-07-16 14:55:33 +00:00
5763ec5f8d [BE] Replace lib with TORCH_INSTALL_LIB_DIR (#158235)
Their values are actually the same. Just staying in line with other `INSTALL` commands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158235
Approved by: https://github.com/Skylion007
ghstack dependencies: #158234
2025-07-16 14:20:19 +00:00
2043f6911e [BE] Rename libnvshmem_extension to libtorch_nvshmem (#158234)
`libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234
Approved by: https://github.com/albanD
2025-07-16 14:20:19 +00:00
bc65253369 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-16 13:54:09 +00:00
51a708ffc6 [nativert] libtorch kernel registry (#157150)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-07-16 12:36:55 +00:00
55d888a616 Add framework for explanations for common CUDA errors (#158395)
As popularly requested in user groups.

Test plan:
```
import torch

a = torch.randn(10000)
device = torch.device('cuda:1')
a = a.to(device)
```

Before:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

After:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-07-16 12:31:18 +00:00
0a99b026d6 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-16 10:52:47 +00:00
ac706bfc7f disable multi kernel rocm (#158299)
Fixes https://github.com/pytorch/pytorch/issues/158274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158299
Approved by: https://github.com/huydhn
2025-07-16 10:20:09 +00:00
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
ea74fdd24a [Inductor][Triton] Update TMA Compatibility Requirements (#157881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-07-16 09:31:44 +00:00
e71bb021b9 Add a periodic test for older NVIDIA driver (#158300)
This is needed because of the botched landing of https://github.com/pytorch/pytorch/pull/156097 which crashed on older NVIDIA drivers `525.*`.  I add a periodic job to install the `525.105.17` on CI, then run:

1. A smoke to make sure that CUDA can be initialized
2. And the whole the test suite on the older driver
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158300
Approved by: https://github.com/ngimel
2025-07-16 08:18:18 +00:00
fb9a5d248f Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297)
Fixes #141563

In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.

This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```

Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.

Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
2025-07-16 08:11:53 +00:00
ddf502c988 [AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325)
Differential Revision: D78123716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325
Approved by: https://github.com/desertfire
2025-07-16 07:55:08 +00:00
555f356254 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-16 07:33:22 +00:00
59f9b25f3c [cuda][cupy] Improve cupy device placement when device is provided (#158320)
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.

I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.

I will leave it to pytorch developers to see if the improvement is worthwhile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
2025-07-16 07:12:36 +00:00
fedbd1a48e Enable ROCm 7.0 Alpha docker builds for PyTorch CI (#158390)
This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware.

Highlights:
* Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows
* Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-16 06:09:37 +00:00
5484890539 Add better typing to avaialbe kernel options for flex attention (#158383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383
Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng
2025-07-16 06:06:29 +00:00
61a7b09ef3 [BE][Easy] split build system requirements.txt to a separate file (#158111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158111
Approved by: https://github.com/ezyang
2025-07-16 05:03:30 +00:00
e92e3eaf4e [Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586)
Hi team,

Please help review this trivial fix.

Without this change:

``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

    capture_overload_names (bool) : whether to include ATen overload names in the profile
```

With this change:

```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
    profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events.
       If this list contains values Kineto runs in CUPTI profiler mode
    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
       or for the entire measurement duration.
    verbose (bool) : whether the trace file has `Call stack` field or not.
    performance_events : a list of profiler events to be used for measurement.
    enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
       that expose CUDA device, stream and event synchronization activities. This feature is new
       and currently disabled by default.
    adjust_profiler_step (bool) : whether to adjust the profiler step to
       match the parent python event duration. This feature is new and currently disabled by default.
    disable_external_correlation (bool) : whether to disable external correlation
    profile_all_threads (bool) : whether to profile all threads
    capture_overload_names (bool) : whether to include ATen overload names in the profile

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 04:10:49 +00:00
0a9d450168 [DTensor] implement histc (#158298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-16 04:10:32 +00:00
e265b719bd Extract out prepare_aot_module_simplified for use in next PR (#158319)
Also a small amount of extra code cleanup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319
Approved by: https://github.com/jingsh
ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251
2025-07-16 03:59:41 +00:00
7637c9718a Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176, #158213
2025-07-16 03:59:41 +00:00
49d0332cef Introduce stages to aot_dispatch (#158213)
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it.  Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.

So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
  * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
    * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
      * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
      * Do some base/export/autograd specific post-processing on the graph

Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.

Here is the new structure:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
  * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
  * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
  * Call aot_stage1_graph_capture to do the graph capture
  * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want

With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
2025-07-16 03:59:32 +00:00
84dec060b7 Hoist choose_dispatcher to top level, remove unnecessary returns (#158176)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173
2025-07-16 03:56:25 +00:00
5b0df2565e Pipeline _create_aot_dispatcher_function (#158173)
Two main things of note:

- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
  stages, I am using the ExitStack trick: there is an ExitStack which is
  in scope for the entire pipeline, and inside of the individual
  pipeline stages we push context managers onto this stack when we want
  them to survive into the next pipeline stage.  This is not obviously
  what the best final form of the code is, but
  create_aot_dispatcher_function is called from multiple locations so I
  can't just inline the context managers into the call site.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
2025-07-16 03:56:25 +00:00
0cb36e2d62 cache dict and string rep for better perf (#158372)
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78298501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
2025-07-16 02:15:32 +00:00
584a0510b3 [inductor] fix windows path for fresh cache. (#158324)
`normalize_path_separator` for windows path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324
Approved by: https://github.com/jansel
2025-07-16 01:54:35 +00:00
9768d393fa add sfdp pattern (#155792)
add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2.
Improve the inference performance of these model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-07-16 01:52:05 +00:00
900fba4c07 Update warning of TF32 (#158209)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209
Approved by: https://github.com/jansel
2025-07-16 01:28:50 +00:00
03852ddc22 Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)"
This reverts commit 1ea9cde598ead20194dbb6c5cb26e74e36e6ad55.

Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))
2025-07-16 01:28:46 +00:00
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
651b4a68f2 [hop][dynamo] track run-ahead sym variables in side effects (#158273)
Before the PR, for code like this:
```
        class Example2(torch.nn.Module):
            def forward(self, x, trigger, target):
                return torch.cond(
                    trigger == 1,
                    lambda: x + target,
                    lambda: x * target,
                    (),
                )

        m = Example2()
        x = torch.randn(2)
        trigger = 0
        target = 2
        args = (x, trigger, target)
        ep = torch.export.export(
            m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC)
        )
```
dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph.

The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing:
1. it's not in the _can_lift_attrs_to_input list (the change in builder.py)
2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py)

Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them.

Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-07-15 23:48:20 +00:00
144965ca9a [BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269)
Summary: check will be crit, causing program to exit, which is quite dangerous

Test Plan:
CI

Rollback Plan:

Differential Revision: D78050595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-07-15 22:04:12 +00:00
ee0992871c Add test for user-managed weights with load_state_dict (#157496)
Summary:
Adds a unit test to verify that when 'user_managed=True' is passed to 'update_constant_buffer', the compiled AOTI model properly shares parameter storage with the eager model.

The test specifically covers the following:
1. Passes model weights to the AOTI model with 'user_managed=True''.
2. Updates the eager model weights using 'load_state_dict()', which performs in-place
3. Asserts that the compiled AOTI model reflects the updated weights, confirming shared memory behavior.

Fixes: #157474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157496
Approved by: https://github.com/desertfire
2025-07-15 21:17:24 +00:00
05dfd312cf [3/n] Remove references to TorchScript in PyTorch docs (#158315)
Summary:
- cpp_index.rst
- fx.md
- jit_builtin_functions.rst
- jit_python_reference.md
- jit_unsupported.md

cpu_threading
large_scale_deployment

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 21:14:18 +00:00
abeae997a3 Use brew suggested miniconda install command (#158347)
Use ```brew install --cask miniconda``` as specified by https://formulae.brew.sh/cask/miniconda

Forward fix After: https://github.com/pytorch/pytorch/pull/156898#issuecomment-3074207175

Seeing in CI:
```
Run if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
==> Caveats
Please run the following to setup your shell:
  conda init "$(basename "${SHELL}")"

Alternatively, manually add the following to your shell init:
  eval "$(conda "shell.$(basename "${SHELL}")" hook)"

==> Downloading https://repo.anaconda.com/miniconda/Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
Already downloaded: /Users/ec2-user/Library/Caches/Homebrew/downloads/2e356e8b147647692e4da77ce4c0c14eefee65ec86f29cc7e8c21a26ac9397ca--Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
==> Installing Cask miniconda
==> Running installer script 'Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh'
PREFIX=/opt/homebrew/Caskroom/miniconda/base
Unpacking payload ...
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working...
done
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
installation finished.
==> Linking Binary 'conda' to '/opt/homebrew/bin/conda'
🍺  miniconda was successfully installed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158347
Approved by: https://github.com/seemethere
2025-07-15 21:08:25 +00:00
3f83e3eeca [ONNX] Remove legacy registration and dispatcher (#158283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262, #158282
2025-07-15 21:00:49 +00:00
0640cfa38c [2/n] Remove references to TorchScript in PyTorch docs (#158306)
Summary: Removed jit_language_reference.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 20:57:23 +00:00
e4c17d5e1c [ONNX] Remove fx_onnx_interpreter.py (#158282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262
2025-07-15 20:46:06 +00:00
cc0faeb80f [dynamo][guards] Instruction count for guard eval for development work (#158214)
Its turned off  by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214
Approved by: https://github.com/StrongerXi
ghstack dependencies: #158215
2025-07-15 20:29:23 +00:00
205241a0d5 [ONNX] Remove legacy dynamo graph extractor (#158262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262
Approved by: https://github.com/justinchuby
ghstack dependencies: #158258
2025-07-15 20:21:49 +00:00
19625daf88 [1/n] Remove references to TorchScript in PyTorch docs (#158305)
Summary: Removed jit_language_reference_v2.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-07-15 20:16:53 +00:00
dbf7d421da [BE][testing] fix aot_inductor_package internally (#158270)
Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld
```

To:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

This PR changes to strings like:
```
-Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270
Approved by: https://github.com/desertfire
2025-07-15 20:15:18 +00:00
b86d5cef68 [dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215
Approved by: https://github.com/StrongerXi
2025-07-15 20:10:47 +00:00
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
250ae2531c Fix types in graphs.py (#158192)
Added type annotations for torch/cuda/graphs.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192
Approved by: https://github.com/oulgen
2025-07-15 19:49:38 +00:00
011026205a make node source hashable (#158322)
Summary: as title

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78296410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322
Approved by: https://github.com/yushangdi
2025-07-15 19:31:00 +00:00
4657a84bc5 [Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129)
Summary:
We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log.

Rollback Plan:

Differential Revision: D78173788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129
Approved by: https://github.com/Skylion007, https://github.com/avicizhu
2025-07-15 19:22:26 +00:00
5606c516fd [ONNX] Remove legacy Dort (#158258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-07-15 19:14:06 +00:00
7afb834f93 Inline dispatch_and_compile into its call site. (#158150)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149
2025-07-15 19:08:55 +00:00
148789ddd8 Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149)
The general context for the upcoming stack of commits is I am attempting
to "pipeline" AOTAutograd.  Instead of having function f call function g
which is the next "stage" of compilation, instead f should return with
its outputs, which are then piped to g for the next stage.  This will
make it easier to implement early exit / resume pipeline without forcing
callback structure, which is good for export-style use cases.  It also
reduces the size of our stack traces, which makes tools like Perfetto
happy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149
Approved by: https://github.com/jamesjwu
2025-07-15 19:08:55 +00:00
3beb915004 Update CODEOWNERS for dataloading (#158348)
Adding Scott

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158348
Approved by: https://github.com/scotts, https://github.com/janeyx99
2025-07-15 19:06:18 +00:00
cf3247b74a Standalone compile API in _Exporter (#158139)
Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling:
```python
package._compiled_and_package(
                tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True
            )
```

`tmp_dir` will contains:
- `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models)
- `CMakeLists.txt`
- `pt2_pacakge_name/` (this is where the models are)
- `pt2_pacakge_name.pt2`
- `inputs.pt` files if package_example_inputs is True

Remaining TODOs
- support loading contants/weights
- the `package_example_inputs = True` option only supports a list of Tensors for now
- eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead.

Test Plan:
```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter
```

Example generated `main.cpp`:

```cpp
#include <dlfcn.h>
#include <fstream>
#include <iostream>
#include <memory>
#include <torch/torch.h>
#include <vector>
#include <torch/csrc/inductor/aoti_torch/tensor_converter.h>
#include "package/data/aotinductor/Plus__default/Plus__default.h"
#include "package/data/aotinductor/Minus__default/Minus__default.h"

using torch::aot_inductor::AOTInductorModelPlus__default;
using torch::aot_inductor::AOTInductorModelMinus__default;
using torch::aot_inductor::ConstantHandle;
using torch::aot_inductor::ConstantMap;

int main(int argc, char* argv[]) {
    std::string device_str = "cpu";
    try {
        c10::Device device(device_str);
        // Load input tensors for model Plus__default
        std::vector<at::Tensor> input_tensors1;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors1.push_back(ivalue.toTensor().to(device));
        }

        // Load input tensors for model Minus__default
        std::vector<at::Tensor> input_tensors2;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors2.push_back(ivalue.toTensor().to(device));
        }

// Create array of input handles
        auto input_handles1 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1);
        auto input_handles2 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2);

// Create array for output handles
        AtenTensorHandle output_handle1;
        AtenTensorHandle output_handle2;

// Create and load models
        auto constants_map1 = std::make_shared<ConstantMap>();
        auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>();
        auto model1 = AOTInductorModelPlus__default::Create(
            constants_map1, constants_array1, device_str,
            "package/data/aotinductor/Plus__default/");
        model1->load_constants();
        auto constants_map2 = std::make_shared<ConstantMap>();
        auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>();
        auto model2 = AOTInductorModelMinus__default::Create(
            constants_map2, constants_array2, device_str,
            "package/data/aotinductor/Minus__default/");
        model2->load_constants();

// Run the models
        torch::aot_inductor::DeviceStreamType stream1 = nullptr;
        model1->run(&input_handles1[0], &output_handle1, stream1, nullptr);
        torch::aot_inductor::DeviceStreamType stream2 = nullptr;
        model2->run(&input_handles2[0], &output_handle2, stream2, nullptr);

// Convert output handles to tensors
        auto output_tensor1 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1);
        auto output_tensor2 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1);

// Validate outputs
        std::cout << "output_tensor1" << output_tensor1 << std::endl;
        std::cout << "output_tensor2" << output_tensor2 << std::endl;
        return 0;
    } catch (const std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
}

```

Rollback Plan:

Differential Revision: D78124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139
Approved by: https://github.com/desertfire
2025-07-15 18:47:56 +00:00
46915b1361 Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601)"
This reverts commit 1e8e9f745e43fa38bbfc7b67b30bc66c0e7ebbd6.

Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))
2025-07-15 18:40:59 +00:00
8c3f206457 Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117)
This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.

Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).

We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12`

Fixes #157626

fixes these tests failures when pytorch is built in GCC12 and above
```
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117
Approved by: https://github.com/malfet
2025-07-15 18:26:38 +00:00
41971335c9 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit e241a07e6b88aa49d604803bc5a6562f0d9f94d2.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))
2025-07-15 18:24:36 +00:00
ea5f88dca6 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))
2025-07-15 18:24:36 +00:00
f2ecf6145f Revert "Enable AcceleratorAllocatorConfig key check (#157908)"
This reverts commit 65fcca4f8c97de82d35d51ad9b790d10433e9b91.

Reverted https://github.com/pytorch/pytorch/pull/157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per https://github.com/pytorch/pytorch/pull/157908#discussion_r2208204782 ([comment](https://github.com/pytorch/pytorch/pull/157908#issuecomment-3074833696))
2025-07-15 18:17:43 +00:00
b26da7741b Revert "[CI] Fixes CI for CUDA Version > 12.9 (#157385)"
This reverts commit 6c5227ba00a2904365af566c24b4681cd01a041c.

Reverted https://github.com/pytorch/pytorch/pull/157385 on behalf of https://github.com/clee2000 due to broke some slow tests test_cpp_extensions_jit.py::TestCppExtensionJIT::test_jit_cuda_archflags [GH job link](https://github.com/pytorch/pytorch/actions/runs/16286465717/job/45986677885) [HUD commit link](6c5227ba00) ([comment](https://github.com/pytorch/pytorch/pull/157385#issuecomment-3074737541))
2025-07-15 18:06:52 +00:00
243b12e565 [Optimus] add einsum_to_pointwise_pass pattern (#155666)
Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0

Test Plan:
### how to enable

```
torch._inductor.config.pre_grad_fusion_options={
            "einsum_to_pointwise_pass": {},
        },
```

### unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization
```
Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168
Network: Up: 61KiB  Down: 675KiB  (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571)
Loading targets.   Remaining     0/1                                                            1 dirs read, 2310 targets declared
Analyzing targets. Remaining     0/345                                                          284 actions, 329 artifacts declared
Executing actions. Remaining     0/18334                                                        8.0s exec time total
Command: test.     Finished 6 local
Time elapsed: 1:15.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### local reproduce

baseline:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 196.06 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 39.13       |
| MFU                   | 4.89%       |
| Activation/example    | 1.51 MB     |
| CPU time total        | 602.28 ms   |
| GPU time total        | 798.60 ms   |
| Estimated avg BW      | 234.62 GB/s |
| Estimated avg BW util | 9.78%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces

with the pattern:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 184.94 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 41.48       |
| MFU                   | 5.18%       |
| Activation/example    | 1.15 MB     |
| CPU time total        | 562.44 ms   |
| GPU time total        | 754.36 ms   |
| Estimated avg BW      | 201.40 GB/s |
| Estimated avg BW util | 8.39%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces

### E2E

baseline: f713998364
with patter:

Rollback Plan:

Differential Revision: D76400889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666
Approved by: https://github.com/Yuzhen11
2025-07-15 17:50:23 +00:00
b7b1109f49 Expose opt_einsum in torch.backends (#157740)
Fixes the following issue:
```
:/tmp# python -c "import torch; print(torch.__version__)"
2.7.1+cu126
:/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'torch.backends' has no attribute 'opt_einsum'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740
Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1
2025-07-15 17:46:43 +00:00
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a3598d27c2d6572db7c0f4ff90a84cc84.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
4f36743f5e Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)"
This reverts commit 5a54db14e3843cfa87fd8d27487dbf2f2dfb6c47.

Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))
2025-07-15 16:31:13 +00:00
05d7288e31 Fix incorrect bin edge description in histogramdd docs (#158275)
Fixes #124435

This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275
Approved by: https://github.com/albanD
2025-07-15 16:25:01 +00:00
5a54db14e3 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-15 14:27:57 +00:00
90618581e9 Fix grouped MM output strides when compiled but not max-autotuned (#158143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143
Approved by: https://github.com/ngimel
2025-07-15 11:53:13 +00:00
4e13eca713 [BE] Remove CUDA 11.8 artifacts (#158303)
We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303
Approved by: https://github.com/Camyll, https://github.com/cyyever
2025-07-15 11:52:08 +00:00
156a377f4c [AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949)
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.

Differential Revision: D77915987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
2025-07-15 10:51:43 +00:00
6200584193 [cutlass backend][BE] remove force disable cache in tests (#158053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158053
Approved by: https://github.com/coconutruben
2025-07-15 10:35:34 +00:00
e40ade5182 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #150312
2025-07-15 10:14:35 +00:00
e241a07e6b Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
2025-07-15 10:14:35 +00:00
7f9fc7e67c [Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183)
Differential Revision: D78209993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183
Approved by: https://github.com/houseroad
2025-07-15 10:07:25 +00:00
1b389025ba Refactor and Improve the OpenReg Module (#158090)
----
# Refactor and Improve the OpenReg Module

## Background

Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following:

- https://github.com/pytorch/pytorch/issues/155864
- https://github.com/pytorch/pytorch/issues/144955
- https://github.com/pytorch/pytorch/issues/144845

Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations:

- Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code.
- Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual.

## Design Principles:

- Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference.
- Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch.

## More Infos:

Pleaes refer to [this](6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md) for more information about `OpenReg`.

## Current Progress:
- Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand.
- Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring.

## Next Steps:
- Add more features to cover all integration points.
- Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090
Approved by: https://github.com/seemethere, https://github.com/albanD
2025-07-15 08:10:05 +00:00
6c5227ba00 [CI] Fixes CI for CUDA Version > 12.9 (#157385)
Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385
Approved by: https://github.com/huydhn
2025-07-15 07:04:54 +00:00
c8c221c0b3 [Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684)
Fix assert issue.
Add float8_e4m3fn into dtype list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684
Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-15 06:02:01 +00:00
3341c131b7 [SymmMem] Fix NCCL Hang in NVSHMEM Triton Wait Until Test (#158167)
The `test_triton_wait_until` test was hanging due to an NCCL synchronization issue stemming from mismatched NVSHMEM operations. Specifically, the flag variable was updated using `nvshmemx_signal_op` (a signaling operation), but waited on with `nvshmem_wait_until` (intended for put/get updates). Per NVSHMEM documentation (see documentation reference section below), signal-updated variables require `nvshmem_signal_wait_until` for proper completion guarantees, so the mismatch caused a deadlock and NCCL hang.

**Fix:**
- A simple fix was to replace the flag update with a regular `nvshmem_putmem_block` (via `put_kernel`) to match `nvshmem_wait_until`. I also added a fence (`nvshmem_fence`) between data and flag puts on the sender (Rank 1) for ordered delivery.

- In a follow-up PR I will add a kernel/test to demonstrate usage of `nvshmemx_signal_op`

**Testing:**
- I ran `python test/distributed/test_nvshmem_triton.py` and  `python test/distributed/test_nvshmem_triton.py  -k test_triton_wait_until`

- I also verified with debug prints (Sender completes puts/fence before receiver's wait returns, and assertions confirm correct state). Multiple runs show no hangs or failures.

**Documentation Referenced:**
- [NVSHMEM Point-To-Point Synchronization](https://docs.nvidia.com/nvshmem/api/gen/api/sync.html) explicitly states: *"the sig_addr object at the calling PE is expected only to be updated as a signal, through the signaling operations available in Section NVSHMEM_PUT_SIGNAL and Section NVSHMEM_PUT_SIGNAL_NBI"*
- [NVIDIA's Official Ring Broadcast Example](https://docs.nvidia.com/nvshmem/api/examples.html) demonstrates the correct pairing: `nvshmemx_signal_op` with `nvshmem_signal_wait_until` (not `nvshmem_wait_until`)
- [NVSHMEM Signaling Operations](https://docs.nvidia.com/nvshmem/api/gen/api/signal.html) documents that signal operations work on special "signal data objects" with specific atomicity guarantees distinct from regular RMA operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158167
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-07-15 05:57:27 +00:00
9cd521de4d Fix torchrec multiprocess tests (#158159)
Summary: The new version of `get_device_tflops` imported something from testing, which imported common_utils.py, which disabled global flags.

Test Plan:
Fixing existing tests

Rollback Plan:

Differential Revision: D78192700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158159
Approved by: https://github.com/nipung90, https://github.com/huydhn
2025-07-15 05:44:37 +00:00
058fb1790f Fix compilation and "import torch" issues for cpython 3.14 (#158184)
Beginning of process for 3.14 bringup.

State of things from this PR:
- Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42
- The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness.
- The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser
- Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang

EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details.
For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184
Approved by: https://github.com/msaroufim
2025-07-15 05:06:55 +00:00
add0b450bd [DTensor][BE] improve DTensor ops correctness check utils (#158112)
**Summary**
Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is:
1. allowing users to specify the `Partial` placement.
2. supporting tree-like output structure.

**Test**
so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112
Approved by: https://github.com/Skylion007, https://github.com/zpcore
ghstack dependencies: #158051
2025-07-15 04:50:34 +00:00
4c1fabf2c9 [DTensor] have split_strategy return OpStrategy instead of TupleStrategy (#158051)
**Summary**
`split_strategy` used `TupleStrategy` as return type because DTensor sharding
propagation's `OpStrategy` support on multi-returns only applies to `Tuple`.

However, `TupleStrategy`'s not a good fit for `split` op. `TupleStrategy` was
initially introduced to handle the sharding strategy of `foreach_*` ops where
the input args can be split into independent subsets regarding sharding decisions,
so are the outputs.

To address the misuse, this PR adds `OpStrategy` propagation for `List[Tensor]`
(note that this support is INCOMPLETE because it only checks the return type
to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding `OpStrategy` support to ops having `List[Tensor]` return type,
this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`.

**Test**
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158051
Approved by: https://github.com/wconstab, https://github.com/zpcore
2025-07-15 04:50:34 +00:00
a2ad16be72 [ONNX] Remove legacy Dort tests (#158294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158294
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255, #158256, #158257
2025-07-15 04:44:14 +00:00
5fb07acbc3 [ONNX] Remove legacy modularization (#158257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158257
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255, #158256
2025-07-15 04:36:01 +00:00
336bff6d58 [ONNX] Remove legacy graph passes (#158256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158256
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255
2025-07-15 04:27:30 +00:00
12151c96d9 [ONNX] Remove legacy io_adapter (#158255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158255
Approved by: https://github.com/justinchuby
2025-07-15 03:39:18 +00:00
4486a6dbfd [DTensor] Fix grouped_mm strategy for invalid stride cases (#158245)
local_tensor input to grouped_mm has a stride requirement.

(see `_meta_grouped_mm_common` in meta_registrations.py or
`check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp)

Don't allow sharding a tensor if its shape would result in an
incompatible local_tensor stride.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158245
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-15 03:29:49 +00:00
a5e68814d5 Allow dynamic shapes for DTensor slice (#157953)
This PR allows for symints in `gen_slice_strategy` which is the strategy for `aten.slice.Tensor`. Previously, using dynamic shapes with slicing would result in
```
   File ".../pytorch/torch/distributed/tensor/_ops/_tensor_ops.py", line 348, in gen_slice_strategy
     assert isinstance(end, int)
 torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(DTensor(local_tensor=FakeTensor(..., device='cuda:0', size=(s3, 2)), device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Shard(dim=0),)), slice(None, (s77//2), None)), **{}): got AssertionError()
```

Questions before merge:
1. `dim` is still asserted to be int. Is this fine, or is this potentially dynamic as well?
2. I'm using argtype ignore for `normalize_dim`. Should I instead change types for `normalize_dim` and further dependency to be `IntLike` as well?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157953
Approved by: https://github.com/wconstab
2025-07-15 00:54:01 +00:00
ef4cca2d79 [precompile] Increment frame and add compile ids when loading packages (#158028)
When loading a package and calling package.install(backends), we create a new frame and compile id for each package load, so that tlparse and chromium events still show compile times on warm start.

There is an argument for not doing this in AOT precompile, as no "compile" occurs. So for now, we put it in `package.install`, which hopefully won't be a thing for AOT precompile.

## Recompiles
Recompiles get saved to the same frame and code entry, so on warm start, each recompile will get collapsed into the same entry. Therefore, dynamo compiles that have recompiles on cold start (0/0, 0/1, 0/2, etc) will all get collapsed into a single compile id (0/0), as warm start will load all of the entries properly.

## Graph breaks
Graph breaks get their own compile id, and therefore their own code entry. These are replicated on warm start, so if cold start you had 4 different graphs (and therefore 4 compile ids), you'll have 4 compile ids on warm start as well.

## Test plan
Added a frame counter check to existing unit tests for automatic dynamic, showing that old and new frame counter between old and new load is the same.

This is the chromium event for test_automatic_dynamo_graph_breaks_device_cuda:
```
python test/dynamo/test_package.py -k test_automatic_dynamo_graph_breaks_device_cuda
```

<img width="2216" height="508" alt="image" src="https://github.com/user-attachments/assets/f604ed33-5c31-464b-9320-d67b2e6f57a1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158028
Approved by: https://github.com/oulgen
2025-07-15 00:53:52 +00:00
1c6057fd17 add eq function to NodeSource (#158170)
Summary: add eq function to NodeSouce by comparing their dict representation.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78200762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158170
Approved by: https://github.com/ezyang, https://github.com/yushangdi
2025-07-15 00:50:06 +00:00
7e433d5f42 [cutlass backend] cache a few things for codegen and properties (#158158)
Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158
Approved by: https://github.com/ColinPeppler
2025-07-15 00:18:31 +00:00
b7def5ff1c dist2: add support for passing custom configs directly to PG (#158147)
This is intended to make it easier to have backend specific "hints" that can be provided by the user to hint about certain options.

```py
import torch.distributed._dist2 as dist2

pg = dist2.new_group(backend="my_custom_backend", device=..., timeout=..., foo=1234, bar="1234")
pg.allreduce(...)
```

Test plan:

```
pytest test/distributed/test_dist2.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158147
Approved by: https://github.com/fduwjj
2025-07-15 00:02:54 +00:00
7cf31b4a42 [dynamo] fix NamedTupleVariable cloning (#158190)
FIXES https://github.com/pytorch/pytorch/issues/157945

## Explanation
1. Some VTs add additional attrs e.g. NamedTupleVariable has "dynamic_attributes"
a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)

2. VT.clone passes everything by dict, includes "dynamic_attributes"
a0308edb6c/torch/_dynamo/variables/base.py (L255-L259)

3. Non-handled args become kwargs in VT's `__init__`, `super().__init__()` passes kwargs to Base VT
a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)

4. Base VT's `__init__` gets unexpected "dynamic_attributes" kwarg
a0308edb6c/torch/_dynamo/variables/base.py (L609-L613)

You could also let Base VT's `__init__` ignore additional kwargs, but that seemed a bit too permissive, and I don't think many VT's add these derived class only attrs.

## After fix

```python
 ===== __compiled_fn_1_7f9541ed_e166_43fe_8322_c5225ce4207f =====
 /home/xmfan/core/miniconda3/envs/0712/lib/python3.12/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, L_x_: "f32[4, 8, 6][48, 6, 1]cpu"):
        l_x_ = L_x_

         # File: /home/xmfan/core/a/torchtitan/wtf.py:10 in forward, code: U, S = torch.linalg.svd(x)[:2]
        linalg_svd = torch._C._linalg.linalg_svd(l_x_);  l_x_ = None
        U: "f32[4, 8, 8][64, 1, 8]cpu" = linalg_svd[0]
        S: "f32[4, 6][6, 1]cpu" = linalg_svd[1];  linalg_svd = None

         # File: /home/xmfan/core/a/torchtitan/wtf.py:11 in forward, code: reduced = U[:, :, :self.k] @ torch.diag_embed(S[:, :self.k])
        getitem_3: "f32[4, 8, 5][64, 1, 8]cpu" = U[(slice(None, None, None), slice(None, None, None), slice(None, 5, None))];  U = None
        getitem_4: "f32[4, 5][6, 1]cpu" = S[(slice(None, None, None), slice(None, 5, None))];  S = None
        diag_embed: "f32[4, 5, 5][25, 5, 1]cpu" = torch.diag_embed(getitem_4);  getitem_4 = None
        reduced: "f32[4, 8, 5][40, 5, 1]cpu" = getitem_3 @ diag_embed;  getitem_3 = diag_embed = None
        return (reduced,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158190
Approved by: https://github.com/StrongerXi
2025-07-14 23:39:25 +00:00
08799217ae [CI] Move main branch rocm binary builds to its own workflow (#158161)
Petition to move out of ciflow/trunk and into ciflow/rocm because it's a long pole for TTS

<img width="1192" height="312" alt="image" src="https://github.com/user-attachments/assets/b12a097a-3763-4c62-b09f-094ee9ae1c37" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158161
Approved by: https://github.com/seemethere
2025-07-14 23:07:49 +00:00
48315181c7 [CI] Do not run inductor rocm on ciflow/inductor (#158162)
Petition to only run inductor-rocm on ciflow/inductor-rocm and not ciflow/inductor because it's a long pole for TTS
<img width="1266" height="315" alt="image" src="https://github.com/user-attachments/assets/b3587bf7-b1a6-45f3-9b6a-c0e6d473d13b" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158162
Approved by: https://github.com/seemethere
2025-07-14 23:07:45 +00:00
38371f693b ci: Switch lintrunner-noclang to use linter image (#158261)
This changes the image the lintrunner jobs utilizes to be the base linter image
instead of the CUDA image. This is done to reduce the image size and speed up the
build time.

This was switched in https://github.com/pytorch/pytorch/pull/110502 when
clang used to run in the lintrunner jobs but it is now split out so we can
use the default image for non-clang jobs.

Difference in pull time (from running job): ~5min --> ~1min (80% reduction), this should result in an overall runtime decrease of ~25min --> ~20min (20% reduction)

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158261
Approved by: https://github.com/Camyll, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/Skylion007
2025-07-14 22:54:51 +00:00
c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-14 22:27:21 +00:00
9345279c6e skip inductor/test_torchinductor_opinfo in windows (#158225)
During enabling inductor CI in Windows, `test_torchinductor_opinfo.py` cost too many time (about 12 hours). This UT was seriously exceeding the time limit of CI. The compiler building was slower 4x in Windows than Linux after analyzing.

Thus, we decide to skip the UT temporary and @xuhancn will keep searching the solution of compiler building in Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158225
Approved by: https://github.com/jansel

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-14 22:14:52 +00:00
194539e9c3 Address NaNs if SDPA is called with all values masked from query (#157727)
Fixes #156707

Detect if all values along the softmax axis are infs and overwrite the outputs for those computations with zeros before the final matmul. The behavior should be aligned with the CPU implementation.

These types of cases where all values along the dimension in the attention mask are false leading to the undefined outputs in softmax occur with left padded batches for generation in HF transformers according to the original issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157727
Approved by: https://github.com/malfet
2025-07-14 22:09:35 +00:00
bcf50636ba [CI] Removing --user flag from all pip install commands (#154900)
Related to https://github.com/pytorch/pytorch/issues/148335

python virtualenv doesn't support using `--user` flag:

```
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
+ python3 -m pip install --progress-bar off --user ninja==1.10.2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154900
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-07-14 21:09:42 +00:00
6b2bef10af [c10d] Prototype of group_split for dist2 work (#157716)
This is to implement group_split as proposed in [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157716
Approved by: https://github.com/d4l3k
2025-07-14 21:04:12 +00:00
1e4d8b5a4a Fix land race typos from #157290 (#158272)
TSIA, this is a new grammar linter being added recently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158272
Approved by: https://github.com/clee2000
2025-07-14 20:55:13 +00:00
725c327284 [nativert] add memory overlap debug assertion (#157290)
Summary: better safe than sorry. will throw if memory overlap detected when using planned tensors and debug mode is enabled -- this will make our planning unit tests more robust.

Test Plan:
ci

Rollback Plan:

Differential Revision: D77327841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157290
Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17
2025-07-14 19:12:41 +00:00
f87d117939 redo of [Inductor][Cutlass] verify cutlass has cache_file attribute before moving...resolves cutlass cute exception (#158206)
trying to land https://github.com/pytorch/pytorch/pull/156672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158206
Approved by: https://github.com/lessw2020, https://github.com/Skylion007
2025-07-14 18:50:23 +00:00
5633283574 [reland][DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#158204)
This PR is identical to https://github.com/pytorch/pytorch/pull/157216, which got reverted because of removing an outdated import of `torch._dynamo` https://www.internalfb.com/diff/D78021229?transaction_fbid=1713683499308113

The issue has been fixed by @weifengpy by D78199546, so this PR should be good to re-land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158204
Approved by: https://github.com/weifengpy
2025-07-14 18:07:21 +00:00
5b10b0a96f Slightly improve error message from repeat_interleave kernel (#157996)
Summary:
In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`:
```
Assertion `result_size == cumsum_ptr[size - 1]` failed.
```

This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q

```
Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).
```

In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.

## What are my concerns with this change?
1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout
2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`.
3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too.

Test Plan:
* Built an ephemeral package with my changes:
https://www.internalfb.com/intern/servicelab/build/736441058/

* Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg

* I will export to GH and run CI/CD tests.

Rollback Plan:
steps:
  - manual.note:
      content: >-
        Just reverting this diff should be sufficient. Since this change is in
        CUDA kernels, I do not believe there is a way to change the error
        message via a JK.

Reviewed By: mradmila

Differential Revision: D77904753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157996
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-07-14 17:55:14 +00:00
fb462cec8d Normalize placeholder names in AOTAutogradCache (#157916)
This PR adds a pass to sanitize_gm_for_cache which normalizes all placeholder names across input dynamo graphs to AOTAutogradCache. This is safe because nothing underneath AOTAutograd uses the node names on the
original dynamo graph: AOTAutograd re-traces with its own nodes, and guards are
in terms of original sources rather than placeholder names.

Note that the dynamo output graphs traced by tlparse will not show this change because it's done before this sanitization step. The aot autograd outputs also will not change because AOTAutograd's own traced graphs don't use the original placeholders of the dynamo graph. Thus, this change is essentially a no-op from everyone's perspective except for cache key checks.

Fixes #157792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157916
Approved by: https://github.com/zou3519
2025-07-14 17:45:11 +00:00
9b0013c6bb [CI] Update mobile build docker image (#158153)
The docker image got removed and then the job started building its own -> takes a long time

I don't know why it uses the asan image

<img width="1906" height="330" alt="image" src="https://github.com/user-attachments/assets/72fbf40c-3cd6-44ea-b61b-6335d2a4b589" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158153
Approved by: https://github.com/Skylion007
2025-07-14 17:35:58 +00:00
6ea91f0672 Revert "[Inductor] Set the default value of min_chunk_size to 512 (#150762)"
This reverts commit 3321acc92e24859dbe2ac6499067d1afde5622c3.

Reverted https://github.com/pytorch/pytorch/pull/150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150762#issuecomment-3070286787))
2025-07-14 16:58:13 +00:00
6fe7456aa1 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit 03b307575a98dc1d953c9d3521a9489e0e61e70c.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))
2025-07-14 16:33:48 +00:00
e8cca7bac7 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))
2025-07-14 16:33:48 +00:00
59c3cac454 Tag CPython test files with the commit or tag they were copied from. (#158038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158038
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801, #157802, #156981
2025-07-14 15:42:19 +00:00
826f12b829 [SymmMem] Avoid library mismatch in CMake search (#157836)
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```

The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
      # In pip install case, the lib suffix is `.so.3` instead of `.so`
      NAMES nvshmem_host nvshmem_host.so.3
      HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
      PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```

This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.

After this PR:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
2025-07-14 14:13:02 +00:00
86d8af6a6c Add sm_70 to windows 12.9 build (#158126)
Please see: https://github.com/pytorch/pytorch/issues/157517
Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-07-14 13:11:10 +00:00
0bb733ba23 Add cuda 12.4 build in CI (#157958)
Fixes to https://github.com/pytorch/pytorch/issues/156747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157958
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-07-14 13:01:16 +00:00
0f21fa84fb Documentation Fix: torch.empty_like memory preservation (#158050)
updated docs for torch.empty_like to reflect view and dense memory behavior

Fixes #158022

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158050
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-07-14 06:02:54 +00:00
aa11628576 Issue warning with reference to user code rather than torch (#155112)
Re-raising of #129959 as that was closed.

Warning message before:
```
/home/admin/.local/share/hatch/env/virtual/toms-project-1/Qv9k_r_5/dev/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
```

Warning message after:
```
/path/to/my/code:91: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
```

Helps the user find where the issue stems from in their code. What do you think?

(Looks like "skip_file_prefixes" is not available until Python 3.12 minimum...)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155112
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-14 05:24:23 +00:00
9ca080db87 [MPS] Extend atomic operations to all int types (#158179)
That fixes `index_put(..., accumulate=True)` for all dtypes

int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064, #158178
2025-07-14 04:25:05 +00:00
1ea9cde598 [ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)
Fixes #156012

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily, https://github.com/XilunWu
2025-07-14 02:50:36 +00:00
edb92e16ba feat(dynamo): raise UnsupportedError for ndarray.astype(object) (#157810)
Fixes #157720

###  What's in this PR?

This PR improves the error handling in `torch.compile` for `ndarray.astype('O')` (or `object`). It now explicitly raises a `torch._dynamo.exc.Unsupported` exception with a clear explanation, instead of failing with a less intuitive error during fake tensor propagation.

This is achieved by adding a check within `NumpyNdarrayVariable.call_method` for this specific `astype` pattern.

A new test, `test_ndarray_astype_object_graph_break`, is also added to `test/test_numpy_interop.py` to verify this new behavior.

### Background

Previously, attempting to `torch.compile` a function containing `ndarray.astype('O')` would result in a `TorchRuntimeError` wrapping a `TypeError: data type 'O' not understood`. This error message, originating deep within the tensor mechanism, was not very user-friendly and didn't clearly state *why* it was unsupported.

This change makes the failure more explicit and provides a better user experience by giving a direct, actionable error message.

**Old Behavior (Error Traceback):**
```
torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: ... got TypeError("data type 'O' not understood")
```

**New Behavior (Error Message):**
```
torch.dynamo.exc.Unsupported: ndarray.astype(object)
Explanation: ndarray.astype('O') or ndarray.astype(object) is not supported by torch.compile, as there is no equivalent to object type in torch.
```

### Testing

A new test has been added to `test_numpy_interop.py` which decorates a function containing `ndarray.astype("O")` with `torch.compile`. The test asserts that a `torch._dynamo.exc.Unsupported` exception is raised, confirming the new error handling works as expected.

The test can be run with:
`pytest test/test_numpy_interop.py -k test_ndarray_astype_object_graph_break`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157810
Approved by: https://github.com/jansel
2025-07-14 01:22:49 +00:00
3321acc92e [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-14 01:14:30 +00:00
1f57e0e04d [CPU] Support GQA for flash attention (#157893)
As many models require GQA, we support it in flash attention for CPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-13 09:49:02 +00:00
c68af9af1b Fix XPU CI UT test_circular_dependencies (#158189)
# Motivation
fix https://github.com/pytorch/pytorch/issues/110040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158189
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-13 09:30:57 +00:00
5aee022d8b [BE] Move repeated code into helper functions (#158178)
Namely `index_get_offsets`, giving thread index computes offsets into
input, output and indices tensors
And `index_apply_indices` applies offests to either input or output
tensor index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158178
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064
2025-07-12 18:24:12 +00:00
31326a9ad7 Fix typo in torch.set_float32_matmul_precision docs (#158191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158191
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-12 18:23:11 +00:00
a0308edb6c [build] remove wheel from build requirements (#158027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158027
Approved by: https://github.com/Skylion007
2025-07-12 16:45:51 +00:00
9508d73307 remove allow-untyped-defs from torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py (#157848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157848
Approved by: https://github.com/Skylion007
ghstack dependencies: #157847
2025-07-12 15:42:12 +00:00
066bf29334 remove allow-untyped-defs from torch/_higher_order_ops/run_const_graph.py (#157847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157847
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-07-12 15:42:12 +00:00
5221448574 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-12 15:08:21 +00:00
191693ac85 adding arg values and arg types to Strobelight USDT (#155185)
Summary: This diff makes changes to the USDT added by RihamSelim in D44636587. The "operator_start" USDT passes in the memory addresses of operator arguments and the argument types. This is so we can record argument values and types in the Strobelight GPUEvent Profiler. The previous diff records the ATEN operator, and this diff lays the groundwork to record ATEN op arguments.

Test Plan: I ensured this code builds by running the example in this diff, and testing profiler changes in this diff.

Reviewed By: RihamSelim

Differential Revision: D75606556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155185
Approved by: https://github.com/malfet
2025-07-12 12:00:08 +00:00
aacb944079 [aot inductor] fix clang-asan for consts_cpp. (#158175)
From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes.

But it still raise clang ASAN `stack alloction`, when build large size consts.

This PR:
1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list.
2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175
Approved by: https://github.com/jansel
2025-07-12 07:14:05 +00:00
6b84cb29f9 [dynamo] trace through torch.get_device_module (#157980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157980
Approved by: https://github.com/anijain2305
2025-07-12 06:25:46 +00:00
7f14b42adf [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
e90148c91d Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit 4b9a6f7211123511e856ac8c8524bc332a741241.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))
2025-07-12 04:52:11 +00:00
a529a5daf5 [test][distributed][vllm] stabilize the p2p sharing through ipc (#158089)
vLLM's RLHF integration cf75cd2098/examples/offline_inference/rlhf_utils.py (L93) depends on this hidden feature, adding the test so that PyTorch will not break it in a backward-incompatible way.

The goal is to create p2p shared tensors across devices, say sharing process 0's memory on GPU 0, to process 1's memory space on GPU 1, when GPU 0 and GPU 1 can use GPU direct p2p access.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158089
Approved by: https://github.com/houseroad, https://github.com/ngimel
2025-07-12 04:41:13 +00:00
e15f4248ad Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312)"
This reverts commit 7a92b5119654c07d15f5c0818e6ae804b01e836c.

Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))
2025-07-12 04:40:52 +00:00
9056279f81 don't error out in empty_cache under mempool context (#158152)
Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152
Approved by: https://github.com/zou3519, https://github.com/eqy
2025-07-12 04:37:05 +00:00
f45f6e86b9 Fix torch._numpy advanced indexing to match NumPy when indices are separated (#157676)
Written with Claude Code.

Fixes https://github.com/pytorch/pytorch/issues/157569
Fixes https://github.com/pytorch/pytorch/issues/158134

 NumPy and PyTorch handle advanced indexing differently when advanced indices are separated by slices (e.g., arr[:, [0], :, 0]). PyTorch uses "outer" indexing placing result dimensions in original positions, while NumPy uses "vectorized"
 indexing moving advanced index dimensions to the front.

This adds _numpy_style_advanced_indexing() to detect separated advanced indices and transpose results to match NumPy's dimension ordering, ensuring torch._numpy maintains compatibility with NumPy's indexing behavior.

Fixes cases like:
- arr[:, [0], :, 0] now returns shape (1, 5, 7) instead of (5, 1, 7)
- arr[:, [0, 1], :, 0] now returns shape (2, 5, 7) instead of (5, 2, 7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157676
Approved by: https://github.com/manuelcandales

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-12 04:35:04 +00:00
9c189ed29a Revert "multi-kernel matmuls based on varying hint sizes (#156628)"
This reverts commit 6c795306378c47341d58109da03371bba2bec46e.

Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))
2025-07-12 03:48:39 +00:00
2eff14c445 [ONNX] Delete torch.onnx.dynamo_export (#158130)
It's deprecated since torch==2.7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158130
Approved by: https://github.com/justinchuby
2025-07-12 02:30:47 +00:00
7a92b51196 [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 01:47:22 +00:00
8b97e4dd8c #IS157973/numpy version issue (#158036)
Fixes #157973

`THPUtils_unpackNumberAsBool` now recognises `numpy.bool_ scalars` explicitly (using `torch::utils::is_numpy_bool`).
If the object is a NumPy boolean, we retrieve its truth value via `PyObject_IsTrue` and return it, avoiding the previous failing path that attempted to treat it as an integer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158036
Approved by: https://github.com/jansel
2025-07-12 01:36:28 +00:00
627ba41136 [DCP][HF] [ez]Change where sharded tensors are saved (#158069)
Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D78108144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069
Approved by: https://github.com/teja-rao
2025-07-12 01:02:17 +00:00
f4406689b8 fix MPCT destroy_pg call (#157952)
I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952
Approved by: https://github.com/fduwjj
ghstack dependencies: #157589
2025-07-12 00:46:19 +00:00
7444debaca Revert "Fix logdet returning finite values for singular matrices on CUDA (#157910)"
This reverts commit 7d4228dbfd13d1ac8fac2c78c042dbb8314f042d.

Reverted https://github.com/pytorch/pytorch/pull/157910 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some internal tests accuracy ([comment](https://github.com/pytorch/pytorch/pull/157910#issuecomment-3064368647))
2025-07-12 00:22:51 +00:00
8c928372b3 Make Q Indices optional (#157997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157997
Approved by: https://github.com/BoyuanFeng, https://github.com/Chillee
2025-07-12 00:16:20 +00:00
22f3347fd9 [MTIA Aten Backend] Change relu / relu_ back to use relu kernel (#158101)
# Context
In D75803582, we migrated relu/relu_ from out-of-tree to pytorch in-tree. With that, we also changed it to use the ATen op-layer logic:
https://www.internalfb.com/code/fbsource/[04ec3fcd0b09b601ae26a785e595ab960a6ba684]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

To summarize:
**The behavior before D75803582:**
The Relu operator calls this code(https://fburl.com/code/pezspv40) and launches Relu kernel.

**The behavior after D75803582:**
The Relu operator uses the ATen logic, which delegates to the clamp_min operator, and no longer launch Relu kernel.

-----------------

But according to my discussion with @vvk, we should keep using the Relu kernel, instead of adopting ATen logic that delegates to clamp_min, because MTIA's Relu kernel has special optimization for MTIA device.

# This diff

Change relu / relu_  to launch relu kernel, which is same as the original behavior before D75803582.

Note: this doesn't mean to revert D75803582, because we still want to move relu/relu_ to in-tree.

Differential Revision: [D78109262](https://our.internmc.facebook.com/intern/diff/D78109262/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158101
Approved by: https://github.com/albanD
2025-07-12 00:12:29 +00:00
0d77364ee3 dist2: cleanup non-option methods on PG (missing, timeouts) (#158123)
This updates the ProcessGroup.* API to include timeouts on all non-option based overloaded methods. This also adds 2 missing ones `alltoall_base` and `barrier`.

Following design in: https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

Test plan:

```
pytest test/distributed/test_dist2.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158123
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-07-12 00:06:37 +00:00
f44a9eee47 [AOTI] Add missing ops to set of C-shim ops which can have nullptr returns (#158073)
Most added ops are backwards ops, which have not been well-tested previously (thus why they were missed). Necessary ops were identified by manual examination of torch/_meta_registrations.py return values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158073
Approved by: https://github.com/desertfire
2025-07-11 23:35:26 +00:00
ff7dd1776f [cutlass backend] Global filter ops before situation based filter ops (#157866)
The idea of this PR is that, sometimes we are filtering ops based not based on the node specific information. For example, we always filter out simt ops. So I want to group them together into a global filtering function.

This can help shrink the config space as well. 20s -> 6s for instantiation 3332.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157866
Approved by: https://github.com/ColinPeppler
2025-07-11 23:13:20 +00:00
2a8795a981 [c10d] ProcessGroupGloo: support per operation timeouts (#158128)
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.

* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.

Test plan:

```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2025-07-11 23:09:50 +00:00
a8ec7babcf [dynamo] expand_hints does exc() to expand graph_break_hints (#158078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158078
Approved by: https://github.com/williamwen42
2025-07-11 22:51:28 +00:00
beed033b6e [MPS] Fix index_kernel for large tensors (#158064)
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator

Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   383.5    |    379.8     |    470.9     |     1232.9     |     4410.3
      __getitem__ (torch.float16, torch.int64)  |   379.6    |    354.5     |    533.2     |     1290.3     |     4442.2
      __getitem__ (torch.float32, torch.int64)  |   360.8    |    338.6     |    478.6     |     1348.9     |     4870.4

Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   349.8    |    330.5     |    432.6     |     764.5      |     1961.2
      __getitem__ (torch.float16, torch.int64)  |   342.5    |    330.7     |    434.7     |     741.0      |     1969.4
      __getitem__ (torch.float32, torch.int64)  |   332.2    |    326.1     |    445.4     |     751.3      |     1972.6

Times are in microseconds (us).
```

While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint

Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
2025-07-11 22:35:44 +00:00
93854e83b7 [DTensor] Rewrite doc of TupleStrategy (#158132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158132
Approved by: https://github.com/XilunWu
2025-07-11 22:08:57 +00:00
4b9a6f7211 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-11 21:07:57 +00:00
4ff9b7fa31 Fix diagnostic message for CUDA version mismatch in cuda.cmake (#157370)
This PR fixes  #157354

It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed.

The problem was caused by this line:

set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING})

This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to:

set(cuda_version_from_findcuda ${CUDA_VERSION_STRING})

Now the diagnostic message properly displays the CUDA version reported by FindCUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370
Approved by: https://github.com/soulitzer
2025-07-11 20:58:35 +00:00
eqy
00ae620b9f [CUDA] Allow cuDNN or flash attn in test_activation_checkpointing pattern match check (#153272)
Seems more robust than maintaining a mirror of dispatch condition based on compute capability etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153272
Approved by: https://github.com/soulitzer
2025-07-11 20:58:12 +00:00
702a304b07 Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit 9a5278225fc5e7b46d54a65ae1a3f049ee49824f.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))
2025-07-11 20:36:36 +00:00
eqy
9963845a4e [CUDA] Support family-conditional compute capabilies in TORCH_CUDA_ARCH_LIST (#157999)
Similar to arch-conditionals, such as 9.0a  and 10.0a, family conditionals such as 10.0f enable features specific to a family of architectures, such as between sm100 and sm103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157999
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 20:34:59 +00:00
6c79530637 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-11 19:38:10 +00:00
bd364c901d Fix serialization of nans in torch.export (#155359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155359
Approved by: https://github.com/angelayi
2025-07-11 19:33:15 +00:00
b487003182 [PyTorch Core] MTIA supports arbitrary strides (#157883)
Summary:
Currently, on MTIA the following case will return false

```
options.device().supports_as_strided()
```
As a result, whenever moving a tensor from CPU to MTIA, strides will not be preserved ([see here](e5edd013ab/aten/src/ATen/native/TensorConversions.cpp (L351))). This is a primary reason why deserializing tensors from .pt files will be contiguous.

Reviewed By: egienvalue, andyanwang

Differential Revision: D77843224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157883
Approved by: https://github.com/albanD, https://github.com/andyanwang
2025-07-11 18:54:21 +00:00
cyy
b0556110e5 Remove unsafe PyTorchError constructor (#154961)
Use libfmt in call sites of PyTorchError.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154961
Approved by: https://github.com/albanD
2025-07-11 18:22:53 +00:00
1cb0597a89 [PyTorch] Deprecate numpy serialization for MTIA (#157884)
Summary:
NumPy based tensor rebuilding from serialization has been deprecated by other backends (eg. [XLA](https://github.com/pytorch/pytorch/pull/137444)). The new flow has CPU storage being constructed with data from the file and then moved to the target backend device.

Furthermore, relying on numpy for serialization will fail loudly when torch.load flips weights_only.

Reviewed By: andyanwang

Differential Revision: D77843238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157884
Approved by: https://github.com/albanD
2025-07-11 17:57:33 +00:00
157683d862 [Reducer] Remove custom handling of view tensors for MTIA (#157882)
Summary: Following implementation of the updated ATen Backend for mtia, and diffs enabling in tree view ops (D75266206, D75385411), we can remove custom logic from reducer to handle MTIA view operations.

Test Plan:
CI

Rollback Plan:

Reviewed By: egienvalue

Differential Revision: D77843212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157882
Approved by: https://github.com/albanD, https://github.com/andyanwang
2025-07-11 17:56:45 +00:00
92ee5bd9f6 Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)"
This reverts commit d75d30eeb610b164e69d0678a2e2b2dea81eec0f.

Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it turns out that the internal failure was legit ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3063075001))
2025-07-11 17:07:26 +00:00
c4cdcda754 [aot] add format_consts_to_cpp function for further development. (#157608)
Changes:
1. Split `format_consts_to_asm` function, which is current way to convert consts to object.
2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`.
3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control.
4. Add UT for `format_consts_to_cpp`.

For `format_consts_to_cpp`, I have local tested it:
Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html
Run it and `cat` cpp code:
<img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-11 17:02:41 +00:00
bb3c911c2d [DTensor] support split op on Partial placement (#157991)
**Summary**
To enable use case where the input DTensor to `split` op has `Partial()` placement,
this PR treats `Partial()` in the same way with `Replicate()`. That means, `split` op
only unshards the `Shard(dim=x)` if `x == split_dim` and keep other placement
untouched.

**Test**
Added a new test because `test_dtensor_ops` doesn't test `Partial()` placement.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157991
Approved by: https://github.com/zpcore
2025-07-11 16:19:31 +00:00
1f1f22991d Restore fake device (#157972)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157972
Approved by: https://github.com/ezyang
2025-07-11 16:12:01 +00:00
27c50799c1 Use new cuBLAS row-wise fp8 matmul for scaled-mm (#157905)
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available.

I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode).

I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:
![image](https://github.com/user-attachments/assets/43dfb816-9ccf-40c5-8b2a-571ce9cb511d)
![image](https://github.com/user-attachments/assets/be7ac6f2-e16c-479b-ad5c-f8039caba4b1)

We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better.

I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157905
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
2025-07-11 16:11:55 +00:00
0797b2b6a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 16:07:54 +00:00
7a08755c5f [BE][Ez]: Update ruff to 0.12.2 (#157937)
Updates to the latest version of ruff and apply some fixes that it flagged and silence a few new lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157937
Approved by: https://github.com/ezyang
2025-07-11 15:16:20 +00:00
0d17029fea [BE][6/6] fix typos in test/ (test/distributed/) (#157640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640
Approved by: https://github.com/yewentao256, https://github.com/malfet
2025-07-11 14:09:37 +00:00
4283d96bcd [build] pin setuptools>=70.1.0 for integrated bdist_wheel command (#157783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157783
Approved by: https://github.com/Skylion007
2025-07-11 12:10:42 +00:00
b4476ca378 Add cudaMallocAsync/cudaFreeAsync to cuda_to_hip_mappings (#158056)
Summary: Adding both functions as they're required for Hipification of https://fburl.com/code/165r7qhr

Test Plan:
Tested in D78090513

Rollback Plan:

Reviewed By: malfet, jiangyurong609

Differential Revision: D78090693
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158056
Approved by: https://github.com/Skylion007
2025-07-11 11:48:19 +00:00
85857181eb Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
2025-07-11 11:41:34 +00:00
03b307575a Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
2025-07-11 11:25:43 +00:00
8088958793 port 4 dynamo test files to Intel GPU (#157779)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path
- added xfailIfXPU to skip xpu test when there is a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-07-11 10:11:49 +00:00
e1a20988f3 [Quant][CPU] Enable fp8 qconv (#157076)
**Summary**
Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either.

So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.

Note:
OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation.
- oneDNN version < 3.9 (now)
  - Always go to the reference kernel
- oneDNN version >= 3.9 (future)
  - Go to reference kernel on old platforms (without AMX)
  - Use oneDNN on new platforms (with AMX)

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-07-11 10:00:57 +00:00
ed508cc018 [inductor][triton] Add experimental use_tensor_descriptor config option (#157906)
Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met.

Some implementation details:
- The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store
- The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met.

Testing:
- Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option.

This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906
Approved by: https://github.com/jansel
2025-07-11 09:32:40 +00:00
02724b5f64 [Bugfix][Inductor] Fix dependency list merged incorrectly for a custom op with multiple mutated inputs and None return type. (#157133)
This is an attempt to fix a memory allocation issue when using `torch.compile` with a custom layernorm kernel in vllm:
```C++
  // In-place fused Add and RMS Normalization.
  ops.def(
      "fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, "
      "float epsilon) -> ()");
  ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm);
```
We observed abnormal extra memory allocations with this op enabled using `torch.compile`:
<img width="738" alt="{374E9FCF-FB46-4750-8B60-D31E3ADCE00A}" src="https://github.com/user-attachments/assets/6c45e1aa-ccde-4c56-99dc-bf4776d699d5" />
and without this op:
<img width="738" alt="{9BB08EFE-FFE3-4D06-82C0-C70BBE6ADD56}" src="https://github.com/user-attachments/assets/56e2ee43-ab87-492d-834c-69e9cafbb0df" />

After investigation, we found that this is because the compiler considers the two buffers for the two mutated inputs `Tensor input` and `Tensor residual` should share a same dependency list, which makes it can not reuse the buffer of `Tensor input`.
```
buf1.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
buf16.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
```
```
op13: ExternKernelSchedulerNode(FallbackKernel)
op13.writes =
    [   StarDep(name='buf17', mode=None),
        StarDep(name='buf18', mode=None),
        StarDep(name='buf19', mode=None)]
op13.unmet_dependencies =
    [   StarDep(name='buf13', mode=None),
        StarDep(name='buf16', mode=None),
        WeakDep(name='buf11', mutating_buf='buf18'),
        WeakDep(name='buf12', mutating_buf='buf18'),
        WeakDep(name='buf13', mutating_buf='buf18'),
        WeakDep(name='buf2', mutating_buf='buf18'),
        WeakDep(name='buf3', mutating_buf='buf18')]
op13.met_dependencies = [StarDep(name='arg11_1', mode=None)]
op13.outputs = [
    buf17: FallbackKernel
    buf17.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf17.aliases = ['buf16', 'buf1']
    buf17.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
    buf18: MutationOutput
    buf18.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf18.mutations = ['buf16']
    buf18.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op14'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=True),
    ]
    buf19: MutationOutput
    buf19.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf19.mutations = ['buf1']
    buf19.users = [NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False)]
]
op13.node.kernel = torch.ops._C.fused_add_rms_norm.default
```
Here we can see `buf16` shares the same dependency list with `buf1` because `buf16` and `buf1` are in the aliases list of `buf17`. This is incorrect since those two are two separate tensors. And this makes the compiler could not reuse `buf16` for subsequent ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157133
Approved by: https://github.com/jansel
2025-07-11 09:06:31 +00:00
44303caabf [APS] Expose max_autotune lookup table config to frontend (#158070)
Summary: As titled. We reuse optimus config to receive the yaml config file from users

Test Plan:
### how to enable max_autotune lookup table hardcode config

```
            inductor.config.post_grad_fusion_options = {
                "inductor_autotune_lookup_table":  <your yaml manifold path>
            }
```
for example, "manifold://ads_training_p9e/tree/max_autotune/mast_omnifm_v3_1kgpu/mast_omnifm_v3_lookup_table.yaml",

see D78052050

Rollback Plan:

Reviewed By: PaulZhang12, jackiexu1992

Differential Revision: D77202285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158070
Approved by: https://github.com/Mingming-Ding
2025-07-11 09:02:52 +00:00
11d6ad8b2e [Docs] Update PT2 Profiler Torch-Compiled Region Image (#158066)
Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself

Test Plan:
{F1980179183}

Rollback Plan:

Differential Revision: D78118228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066
Approved by: https://github.com/aaronenyeshi
2025-07-11 07:56:45 +00:00
cd80f9a4c3 xpu: support custom ops with torch.library on xpu backend (#152879)
Fixes: https://github.com/intel/torch-xpu-ops/issues/1626

This PR started enabling of tests for `torch.library`, but more work is needed. Tests are using `torch._custom_ops` deprecated API planned for removal at pytorch 2.6 (not done). I think cleanup of pytorch would be nice before enabling more tests for xpu.
a2ccda3c60/torch/_custom_op/impl.py (L47)

CC: @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152879
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/guangyey, https://github.com/albanD
2025-07-11 07:36:04 +00:00
442aca44d6 Fix XPU broken CI (#158092)
# Motivation
https://github.com/pytorch/pytorch/pull/157739 introduces the new UT `test_sdpfa` that block XPU CI since `_scaled_dot_product_flash_attention is not supported on XPU yet`.

# Additional Context
See https://github.com/pytorch/pytorch/actions/runs/16201010860/job/45741815895?pr=138222#step:15:6399
fix https://github.com/pytorch/pytorch/issues/158095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158092
Approved by: https://github.com/jansel, https://github.com/malfet
2025-07-11 07:23:27 +00:00
d89f30ad45 [MPS] Avoid calling tensor ops in max_pool3d impl (#157874)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157874
Approved by: https://github.com/malfet
2025-07-11 06:47:29 +00:00
b4fc42ca80 Add torch.segment_reduce docs (#154352)
Fixes #153138

## Test Result

![image](https://github.com/user-attachments/assets/62346d62-d048-4259-906b-f8261e10b4cc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154352
Approved by: https://github.com/albanD
2025-07-11 06:16:38 +00:00
cec59b76ca [2/N] cost coverage improvment (#157738)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Details:
1. Fill in missing redistribute_cost in `cat` and `slice_scatter`;
2. Expand the `cat` strategy based on placement of each input tensor. Previously `cat` only outputs one strategy. Now it output at the level of number_of_input_tensor*number_OpSpec_each_tensor_input_strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157738
Approved by: https://github.com/wconstab
2025-07-11 05:54:16 +00:00
ecd73c58ee Revert "[BE] Replace std::runtime_error with TORCH_CHECK [2/N] (#152080)"
This reverts commit b85f10ea5006e8ae8fc769f48659ab7ad5eafb69.

Reverted https://github.com/pytorch/pytorch/pull/152080 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some internal tests ([comment](https://github.com/pytorch/pytorch/pull/152080#issuecomment-3060337857))
2025-07-11 03:58:31 +00:00
94995eba07 [Log] add a hook for recompile user context (#157961)
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157961
Approved by: https://github.com/masnesral
2025-07-11 03:41:33 +00:00
11a86ad2fa Remove pytorch quant docs since we are moving to torchao (#157766)
Summary:
att

Test Plan:
doc page generated from CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766
Approved by: https://github.com/Skylion007
2025-07-11 03:21:47 +00:00
dd93883231 [exported_program] Remove _postprocess_graph_module_outputs (#158059)
Summary: Appears to be dead as of https://github.com/pytorch/pytorch/pull/120019.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78112302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158059
Approved by: https://github.com/angelayi
2025-07-11 02:40:15 +00:00
326e751d07 [AOTI] Add device guard when launching autotune kernels (#158034)
Summary: Fix https://github.com/pytorch/pytorch/issues/157737. When launching Triton kernels in the autotune block, we need to consider the fact that the model may not always be on device 0. The reason this was not caught on CI is because test_on_gpu_device1 requires multi_gpu and was not run on a multi_gpu instance. Added test_on_gpu_device1 and other similar multi_gpu tests back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158034
Approved by: https://github.com/eqy, https://github.com/yushangdi
2025-07-11 02:34:31 +00:00
7d4228dbfd Fix logdet returning finite values for singular matrices on CUDA (#157910)
Fixes https://github.com/pytorch/pytorch/issues/154312

Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312
https://github.com/pytorch/pytorch/issues/154312)

PyTorch's logdet function returns mathematically incorrect finite values for
singular matrices on CUDA devices instead of the expected -inf. This occurs
because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16)
instead of exact zeros for singular matrices.

**Problem:**
Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices.

**Solution:**
Implemented NumPy-style two-tier singularity detection with GPU sync point removal:

1. **Primary detection**: Use LAPACK's built-in singularity detection via info parameter
2. **Backup detection**: Apply threshold-based detection for numerical edge cases
3. **Zero GPU sync points**: Eliminated all .item(), std::get<0>(), and scalar extractions
4. **Pure tensor operations**: All computations use tensor operations throughout

**Performance Impact:**
Based on comprehensive benchmarking across matrix sizes and data types:

- **Overall Impact**: 0.85× average speedup (+18.0% overhead)
- **CPU Performance**: 0.84× average speedup (+18.8% overhead)
- **CUDA Performance**: 0.85× average speedup (+17.3% overhead)

**Performance Trade-offs:**
- **Small matrices (16×16, 64×64)**: Higher overhead due to tensor operation setup costs
- **Large matrices (512×512, 2048×2048)**: Near-zero overhead, with some cases showing slight improvements
- **GPU sync elimination**: Removes expensive GPU→CPU synchronization bottlenecks

**Results:**
-  All singular matrices now correctly return -inf on both CPU and CUDA
-  Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly
-  Results match NumPy's slogdet behavior exactly
-  Zero GPU synchronization points for improved performance
-  Comprehensive edge case testing added

**Verification:**
Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect)
After:  torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) 

The implementation uses pure tensor operations to eliminate GPU sync points while
maintaining robust singularity detection through a two-tier approach.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910
Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-11 02:23:46 +00:00
65fcca4f8c Enable AcceleratorAllocatorConfig key check (#157908)
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
2025-07-11 02:11:08 +00:00
905b084690 Add size_hints to cache key (#158026)
Differential Revision: D78089705

Previously to support overriding autotune configs for post fusion kernels in Inductor with a lookup table, we only keyed on the source code. However, the same source code could have multiple optimal configs, due to the input sizes. With this, we have many collisions in our lookup table, leading to subpar configs. A way around this is to add the size_hints to the lookup key as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158026
Approved by: https://github.com/jansel
2025-07-11 01:47:50 +00:00
37ccc532f7 Update'unit_batch_dynamic_prepacked' tests to use ASSERT_NEAR instead of ASSERT_EQ (#157860) (#157861)
Summary:

Replaced ASSERT_FLOAT_EQ which defaults to fixed kMaxUlps ( = 4-ULP , See gtest-internal.h) with ASSERT_NEAR which lets us set epsilon to 1e-3, (approximately 3 ULPs). This allows for slightly stricter and tunable comparison.

Test Plan:
**Before Fix**

✗ Fail:
qnnpack:pytorch_qnnpack_testApple - FULLY_CONNECTED_SPARSE_OP_8x1/unit_batch_dynamic_prepacked (0.0s)
'Expected equality of these values:
  output_dynamic[i * outputChannels() + c]
    Which is: 9.9160004
  accumulators_float[i * outputChannels() + c]
    Which is: 9.9159956
at 0, 17: reference = 9.9159955978393555, optimized = 9.9160003662109375

------------------------------

**After Fix**

Everything passes

Rollback Plan:

Differential Revision: D77911682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157861
Approved by: https://github.com/kimishpatel, https://github.com/lucylq, https://github.com/malfet
2025-07-11 01:05:50 +00:00
7599bebead Add CPython test test_itertools (#156981)
Test the itertools module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156981
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801, #157802
2025-07-11 00:12:50 +00:00
397ca98510 Add CPython test test_with (#157802)
Test with statement behavior and dunder methods __enter__ and __exit__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157802
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801
2025-07-11 00:12:50 +00:00
4809f43867 Add CPython test test_numeric_tower (#157801)
Test abstract numeric types and dunder methods like __int__, __float__, __index__, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157801
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800
2025-07-11 00:12:50 +00:00
0ebf2447da Add CPython test test_operator (#157800)
Test operators via operator module like add, sub, eq, lt, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157800
Approved by: https://github.com/zou3519
ghstack dependencies: #157799
2025-07-11 00:12:50 +00:00
91041f559d Add CPython test test_bool (#157799)
Test dunder methods `__bool__` and `__len__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157799
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-11 00:12:50 +00:00
ae86e8f6c8 [1/N] cost coverage improvment (#157504)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Details:
1. Fill missing redistribute_cost for ops like `aten::detach`, `aten::bernoulli `, `aten::_to_copy`, `aten::bucketize.Tensor`, `aten::stack`, `aten::clone`, `aten::copy_`, `aten::zero_ `.
2.  Fix redistribute_cost error in new_factory_strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157504
Approved by: https://github.com/wconstab
2025-07-10 23:55:45 +00:00
8b68e5b1bb [ROCm][Inductor][CK] update API for gemm-multiD change (#156122)
Fixes for the compilation errors in the generated code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156122
Approved by: https://github.com/chenyang78
2025-07-10 23:12:20 +00:00
e517066f41 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 178fe7aa98987111a73534375099f4ad255e8b59.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/huydhn due to This fails some internal tests and needs to be relanded ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3059463896))
2025-07-10 23:11:18 +00:00
1a195bf7d6 Tests for #158030 (#158033)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158033
Approved by: https://github.com/bdhirsh, https://github.com/albanD
ghstack dependencies: #158030
2025-07-10 22:51:28 +00:00
bfcababbcb [OrderedDict] Implement explicit OrderedDict dunder method call (#154943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154943
Approved by: https://github.com/zou3519
ghstack dependencies: #154003, #154793, #154794, #154942
2025-07-10 22:50:39 +00:00
ba71eb496b [dict] Implement dict.__eq__ and dict.__ne__ (#154942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154942
Approved by: https://github.com/zou3519
ghstack dependencies: #154003, #154793, #154794
2025-07-10 22:50:39 +00:00
ba8d19ec02 [dict] Allow Dynamo to trace through explicit dict dunder method call (#154794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154794
Approved by: https://github.com/mlazos
ghstack dependencies: #154003, #154793
2025-07-10 22:50:39 +00:00
57d64298a0 [dict] Add dict.popitem (#154793)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154793
Approved by: https://github.com/mlazos, https://github.com/zou3519
ghstack dependencies: #154003
2025-07-10 22:50:39 +00:00
e84710d1e7 [dict] Raise TypeError in dict methods (#154003)
Raise TypeError in the following scenarios:
* #args mismatch
* arg is unhashable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154003
Approved by: https://github.com/mlazos, https://github.com/zou3519
2025-07-10 22:50:39 +00:00
9bf41633d7 Allow Custom Time Unit When Printing Profiler Table (#157913)
## Overview
This PR adds a kwarg to the `table()` method of the profiler allowing users to specify a time unit to be used for all results in the profiling table. The available options are: `s`, `ms` and `us`. If an invalid unit or no unit is provided, then a time unit is selected based on the size of the value (current default behaviour).

## Testing
A unit test has been added to verify this works correctly.

## Documentation
I couldn't find any documentation specific to the `table()` function beyond doc strings which have been updated.

## Example Output
```
import torch
from torch.profiler import profile

with profile() as prof:
    res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024))

print(prof.key_averages().table(time_unit="s"))
print(prof.key_averages().table(time_unit="ms"))
print(prof.key_averages().table(time_unit="us"))
print(prof.key_averages().table())

```

```
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%        0.000s        10.36%        0.014s        0.007s             2
           aten::empty         0.04%        0.000s         0.04%        0.000s        0.000s             2
        aten::uniform_        10.27%        0.014s        10.27%        0.014s        0.007s             2
              aten::mm        89.64%        0.119s        89.64%        0.119s        0.119s             1
    aten::resolve_conj         0.00%        0.000s         0.00%        0.000s        0.000s             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 0.133s

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%       0.055ms        10.36%      13.735ms       6.868ms             2
           aten::empty         0.04%       0.054ms         0.04%       0.054ms       0.027ms             2
        aten::uniform_        10.27%      13.626ms        10.27%      13.626ms       6.813ms             2
              aten::mm        89.64%     118.892ms        89.64%     118.896ms     118.896ms             1
    aten::resolve_conj         0.00%       0.004ms         0.00%       0.004ms       0.001ms             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132.631ms

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%      55.495us        10.36%   13735.202us    6867.601us             2
           aten::empty         0.04%      54.121us         0.04%      54.121us      27.061us             2
        aten::uniform_        10.27%   13625.586us        10.27%   13625.586us    6812.793us             2
              aten::mm        89.64%  118892.284us        89.64%  118895.981us  118895.981us             1
    aten::resolve_conj         0.00%       3.697us         0.00%       3.697us       1.232us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132631.183us

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%      55.495us        10.36%      13.735ms       6.868ms             2
           aten::empty         0.04%      54.121us         0.04%      54.121us      27.061us             2
        aten::uniform_        10.27%      13.626ms        10.27%      13.626ms       6.813ms             2
              aten::mm        89.64%     118.892ms        89.64%     118.896ms     118.896ms             1
    aten::resolve_conj         0.00%       3.697us         0.00%       3.697us       1.232us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132.631ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157913
Approved by: https://github.com/sraikund16
2025-07-10 22:44:34 +00:00
83700b4488 dist2: add group context manager (#157988)
This adds new context manager based PG management to dist2. This allows for managing the active process group much in the same way as a stream

```py
with dist2.process_group(pg):
   dist2.current_process_group().allreduce(...).wait()
```

matches

```py
with torch.cuda.stream(stream):
    torch.cuda.current_stream().synchronize()
```

Test plan:

```
pytest test/distributed/test_dist2.py -k context
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157988
Approved by: https://github.com/fduwjj
2025-07-10 22:30:19 +00:00
fca7013f85 Fix DCE eliminating random operations by improving is_impure() (#151524) (#157981)
DCE was incorrectly eliminating unused random operations like torch.rand() that have global RNG side effects, causing inconsistent results between eager and compiled execution modes.

**Root cause**: Python random functions (torch.rand, torch.randn, etc.) don't have the _nondeterministic_seeded attribute, so node.is_impure() returns False, allowing DCE to eliminate them despite advancing global RNG state.

**Solution**: Enhanced is_impure() in torch/fx/node.py to recognize Python random functions and mark them as impure when they use global RNG, regardless of the impure_random parameter setting. This ensures consistency between eager and compiled execution even when config.fallback_random=False.

**Key features**:
- Handles comprehensive list of random functions: rand, randn, randint, randperm, rand_like, randn_like, randint_like, normal, poisson, bernoulli, multinomial
- Generator optimization: Only marks as impure when using global RNG (no generator or generator=None). Operations with explicit generators don't affect global state and can be optimized.
- Works with both impure_random=True and impure_random=False cases
- Cleaner architecture: addresses root cause rather than working around it

**Tests**: Enhanced test_impure_random to verify both FX tracing and AOT compilation codepaths, ensuring random operations are preserved and eager/compiled execution consistency is maintained.

🤖 Generated with [Claude Code](https://claude.ai/code)

Fixes https://github.com/pytorch/pytorch/issues/151524

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157981
Approved by: https://github.com/mlazos

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-10 22:24:29 +00:00
590607c599 [cuDNN][SDPA] Bump cuDNN frontend submodule version to 1.12.1 (#158044)
Really we are just interested in this change which fixes an apparent regression for d=256 support on Hopper bc5f4fd88d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158044
Approved by: https://github.com/Skylion007
2025-07-10 22:01:18 +00:00
5f1225ef48 [EZ][BE] Delete redundant header (#157966)
Not sure why it was there in the first place. And why `Indexing.m`` needed to include QScheme.h is also unclear
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157966
Approved by: https://github.com/Skylion007
2025-07-10 21:59:36 +00:00
96897e721b Return false in statically_known_multiple_of if numerator has more than 20 unique symbols (#157855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157855
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #155590, #157845
2025-07-10 21:00:57 +00:00
d7e0098bf3 Fix is_unaligned usage of statically_known_true (#157845)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157845
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-10 21:00:57 +00:00
76ca23c41c [dynamo] Add FakeProcessGroup support for fx_graph_runnable with distributed collectives (#157162)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Summary:
- Modified generate_compiler_repro_string() to automatically detect distributed operations and inject FakeProcessGroup setup code
- Added distributed collective tests in test/dynamo/test_fx_graph_runnable.py using FakeProcessGroup API to test distributed collective operations
- Generated fx_graph_runnable code now runs successfully standalone when containing distributed operations

```import os
os.environ['TORCHINDUCTOR_CACHE_DIR'] = '/var/folders/fd/kcv8m1kn0lqgxz42wvgr46sc0000gn/T/torchinductor_skarjala'

import torch
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
import torch._inductor.inductor_prims
import torch.distributed as dist
from torch.testing._internal.distributed.fake_pg import FakeStore

import torch._dynamo.config
import torch._inductor.config
import torch._functorch.config
import torch.fx.experimental._config

torch._functorch.config.functionalize_rng_ops = False
torch._functorch.config.fake_tensor_allow_unsafe_data_ptr_access = True
torch._functorch.config.unlift_effect_tokens = True

isolate_fails_code_str = None

# torch version: 2.9.0a0+gitf23d314
# torch cuda version: None
# torch git version: f23d31463ca452918e23063409a2bdc55efc0d46

# torch.cuda.is_available()==False, no GPU info collected

from torch.nn import *
class Repro(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, arg0_1):
        all_reduce = torch.ops._c10d_functional.all_reduce.default(arg0_1, 'sum', '0')
        wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_reduce);  all_reduce = None
        mul = torch.ops.aten.mul.Tensor(wait_tensor, 2)
        copy_ = torch.ops.aten.copy_.default(arg0_1, wait_tensor);  arg0_1 = wait_tensor = copy_ = None
        return (mul,)

def load_args(reader):
    buf0 = reader.storage(None, 64)
    reader.tensor(buf0, (4, 4), is_leaf=True)  # arg0_1
load_args._version = 0
mod = Repro()
if __name__ == '__main__':
    from torch._dynamo.repro.after_aot import run_repro
    # Initialize FakeProcessGroup for distributed operations
    store = FakeStore()
    dist.init_process_group(
        backend="fake",
        rank=0,
        world_size=2,
        store=store
    )
    with torch.no_grad():
        run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None)
        # To run it separately, do
        # mod, args = run_repro(mod, load_args, accuracy=False, command='get_args', save_dir=None, tracing_mode='real', check_str=None)
        # mod(*args)
    dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157162
Approved by: https://github.com/xmfan
2025-07-10 20:30:27 +00:00
a3ec6d64b2 Update test after CUTLASS upgrade (#157903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157903
Approved by: https://github.com/ngimel
2025-07-10 20:10:20 +00:00
8c5b070d1f Documentation Fix: torch.tensor.scatter_ docs (#157929)
updated torch.tensor.scatter_ docs to reflect proper broadcasting behavior

Fixes #157419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157929
Approved by: https://github.com/albanD
2025-07-10 19:22:52 +00:00
da4e7c77a1 [caffe2] Enable auto vectorization (#157984)
Summary:
We are testing enabling back autovectorization in some codepaths.
These resulted in crashes when compiling using clang17, we are now relying on clang19.

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

We are going to deploy it on ads workloads

Rollback Plan:

Differential Revision: D77448445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157984
Approved by: https://github.com/Skylion007
2025-07-10 19:19:45 +00:00
5bd7804be2 Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990)
Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass.

Test Plan: More unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990
Approved by: https://github.com/oulgen
2025-07-10 19:17:11 +00:00
e172309880 Documentation Fix: Torch gather broadcasting (#157920)
updated torch gather docs to reflect proper broadcasting behavior for specific backends

Fixes #157425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157920
Approved by: https://github.com/albanD
2025-07-10 19:08:51 +00:00
e2f64eedaf Fix DTensor handling of conjugate bit. (#158030)
Fixes https://github.com/pytorch/pytorch/issues/130646 specifically for DTensor

Fixes https://github.com/pytorch/torchtitan/issues/267

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158030
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2025-07-10 18:28:12 +00:00
2db1a54465 Add deprecation hint for accelerator APIs (#158013)
[torch.accelerator.set_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.set_device_idx.html#torch.accelerator.set_device_idx) and [torch.accelerator.current_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.current_device_idx.html#torch.accelerator.current_device_idx) are deprecated, but not reflect in their docs.

## Test Result

### Before
![image](https://github.com/user-attachments/assets/6e0d8c4a-d5e5-420c-8f3a-b2742f0fe263)
![image](https://github.com/user-attachments/assets/4bd99b15-31dc-4043-82e8-3d2c1dfcb57b)
![image](https://github.com/user-attachments/assets/a3d342da-79f2-4950-b17a-d01257603c97)

### After

![image](https://github.com/user-attachments/assets/faf138a8-bd92-4f31-bd7c-4414aee6da5b)
![image](https://github.com/user-attachments/assets/212456bc-1c6b-48c6-9d8c-075d5096b900)
![image](https://github.com/user-attachments/assets/49bb9c8c-203e-424e-bdc0-0f197239146e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158013
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-07-10 18:09:22 +00:00
e3f8141c25 Fix UB in BFloat16 round_to_nearest_even (#157942)
Type punning using unions is undefined behavior in C++ (you may not access a member of a union that is not the active member). bit_cast is the right way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157942
Approved by: https://github.com/Skylion007
2025-07-10 18:03:39 +00:00
a9ac9f2635 [cutlass backend] Change serialization protocol to use more json and cache (#157840)
Differential Revision: [D77949177](https://our.internmc.facebook.com/intern/diff/D77949177/)

What this diff does:
* use lru_cache for serialization and deserialization
* json dumps more. This seems to help perf.

For instantiation level 3332, the loading time decreases from 33s to 20s (roughly 40%) decrease.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157840
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #157839
2025-07-10 17:44:33 +00:00
1d0f45d5d1 [c10d][PGNCCL] Cleanup unused params for nccl comm split (#157978)
Previously we add global ranks as a input params for nccl comm. Now this is not needed, let's clean that up.

Differential Revision: [D78051047](https://our.internmc.facebook.com/intern/diff/D78051047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157978
Approved by: https://github.com/Skylion007
2025-07-10 17:36:23 +00:00
b40c0b61eb Make guard collective logging less chatty (#157995)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157995
Approved by: https://github.com/Microve, https://github.com/albanD, https://github.com/Skylion007
2025-07-10 17:18:37 +00:00
fb45649df7 [cutlass backend] Make config request key depend on serialization.py and cutlass_utils.py (#157839)
Differential Revision: [D77893241](https://our.internmc.facebook.com/intern/diff/D77893241/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157839
Approved by: https://github.com/ColinPeppler
2025-07-10 17:09:32 +00:00
7caf6c801d [ez][CI] Add docker instructions for linux build (#157974)
Copied from linux-test.yml

I'm not sure how necessary this is because the wiki also has this info, and has more details about it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157974
Approved by: https://github.com/huydhn
2025-07-10 16:15:28 +00:00
493bd625e2 Revert "[BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791)"
This reverts commit 9bdf87e8918b9a3f78d7bcb8a770c19f7c82ac15.

Reverted https://github.com/pytorch/pytorch/pull/157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058091176))
2025-07-10 16:14:06 +00:00
4781d72faa [AOTI] codegen for static linkage (#157129)
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)

- Add codegen for static linkage
- refactor test code for test_compile_after_package tests

For now,  the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,

Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.

```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
2025-07-10 16:03:50 +00:00
9bdf87e891 [BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791)
NVCC apparently has a [compression-mode flag](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#compress-mode-default-size-speed-balance-none-compress-mode) to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files.

From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode
```
size
Uses a compression mode more focused on reduced binary size, at the cost of compression and decompression time.
```

Up to 37.2%  reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup).

694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX
vs
1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX

CUDA 12.9 ***694MB*** vs ***1.08GB***

CUDA 12.8 ***604MB*** vs ***845MB***

This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case.

This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR https://github.com/pytorch/pytorch/pull/157516 and https://github.com/pytorch/pytorch/pull/157634

<img alt="Screenshot 2025-07-08 at 5 36 44 PM" width="1061" src="https://private-user-images.githubusercontent.com/7563158/463890713-a53ec774-b036-4c0b-a5d5-301756e3644f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTIwNzY3OTIsIm5iZiI6MTc1MjA3NjQ5MiwicGF0aCI6Ii83NTYzMTU4LzQ2Mzg5MDcxMy1hNTNlYzc3NC1iMDM2LTRjMGItYTVkNS0zMDE3NTZlMzY0NGYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDcwOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA3MDlUMTU1NDUyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Yzg1OGExN2VjYmI3ZDFhNjIwZDk0NTBjOWFlZDIzYzY3MmExYTFiOGZhZjc0NTI1ZTk2YzM3YzdhYzkyYzZlMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.2-YmmfXrBFuXCrjDCQ_iTgbtbwv9xNFqM6Goc_liDKE">

More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157791
Approved by: https://github.com/malfet, https://github.com/atalman
2025-07-10 15:51:04 +00:00
f85954e043 Update OpenBLAS commit (#151547)
Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install`

Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2006%20Jun%202025%2009%3A46%3A35%20GMT&stopTime=Fri%2C%2013%20Jun%202025%2009%3A46%3A35%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=0218b65bcf61971c1861cfe8bc586168b73aeb5f&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers.
2. Overall torch-bench pass-rate (cpp_wrapper mode) increased `[87%, 65/75 → 96%, 72/75]`

<img width="676" alt="Screenshot 2025-06-20 at 17 05 15" src="https://github.com/user-attachments/assets/2ca9c1bc-80c6-464a-8db6-b758f2476582" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547
Approved by: https://github.com/malfet, https://github.com/snadampal, https://github.com/fadara01

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Ryo Suzuki <ryo.suzuki@arm.com>
Co-authored-by: Ye Tao <ye.tao@arm.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-07-10 14:58:12 +00:00
7702855228 [logging] dynamo_timed the synchronize in CachingAutotuner make_launchers (#157747)
Summary: There's some evidence that some very long compile times are actually attributable to the sync. This should make it easier to say for sure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157747
Approved by: https://github.com/aorenste, https://github.com/mlazos
2025-07-10 14:48:51 +00:00
9a5278225f [CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-10 14:38:18 +00:00
8532033679 RPC tutorial audit (#157938)
Fix [T228333894](https://www.internalfb.com/intern/tasks/?t=228333894)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157938
Approved by: https://github.com/AlannaBurke
2025-07-10 14:15:37 +00:00
8dff457f42 [simple_fsdp] Port fx pass to bucket reduce_scatters (#157780)
Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780
Approved by: https://github.com/wconstab
2025-07-10 14:04:43 +00:00
a9537b626c [standalone_compile] Fix single Tensor outputs from split_module (#157803)
We assumed that the output in an FX graph would always just be a
list[Tensor], even in the single tensor return case.
It is possible for the output to be a single Tensor. This can happen
by calling torch.fx.split_module on the module.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157803
Approved by: https://github.com/oulgen
2025-07-10 12:49:03 +00:00
82765dad16 Fix logging of config_suppress_errors and config_inline_inbuilt_nn_modules (#157947)
Currently ~50% of the time we fail or crash before logging metrics, so moving where this is logged will let us have more comprehensive (less-null) data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157947
Approved by: https://github.com/masnesral, https://github.com/jovianjaison
2025-07-10 12:05:43 +00:00
cd995bfb2a [inductor] re-enable TMA templates w/ AOTI (#157819)
Follow-up from #155896: now that AOTI can codegen non-null TMA workspace args, we can re-enable TMA templates w/ AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157819
Approved by: https://github.com/drisspg
2025-07-10 08:35:29 +00:00
1e8e9f745e Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-10 07:05:39 +00:00
af3d069094 [BE][Easy] remove unused build-time dependency astunparse and change astunparse.unparse -> ast.unparse (#157907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157907
Approved by: https://github.com/Skylion007
2025-07-10 07:04:42 +00:00
ba0d0de5e6 Enable set SDPA backend by torch.nn.attention.sdpa_kernel on XPU (#156669)
Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability.

### Backend Support and Selection Enhancements:
* Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`).
* Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU.
* Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`)

### Test Additions for Backend Fallback and Selection:
* Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,)

### Codebase Updates for Backend Integration:
* Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`)
* Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-07-10 06:52:22 +00:00
4cc13c4af6 [dynamic shapes] avoid unnecessary slices (#157528)
Fixes #157289, by extending optimization to slices where the end index exceeds the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157528
Approved by: https://github.com/angelayi
2025-07-10 06:34:46 +00:00
565fd07909 [Easy] Make the error message shown by THPUtils_unpackLong to be clearer (#157886)
As the title stated.

The error message of `THPUtils_unpackLong` is the same as `THPUtils_unpackInt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157886
Approved by: https://github.com/Skylion007
2025-07-10 06:26:13 +00:00
b85f10ea50 [BE] Replace std::runtime_error with TORCH_CHECK [2/N] (#152080)
Part of: #148114

Related commits:

- #151880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-10 06:02:47 +00:00
fadc936fad Updates to build and test on Noble (Ubuntu24.04) and py3.12 (#152240)
This PR enables Ubuntu24.04 testing on CI:
* Builds a base docker image using Noble (Ubuntu24.04) and py3.12 for ROCm N version
* Builds and tests PyTorch on Ubuntu24.04 as part of the `rocm-mi300` workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152240
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2025-07-10 05:55:42 +00:00
b7860c7863 Implement fast exp for AVX2 and AVX512 for the flash attention (#151441)
**Implement fexp for avx2 and avx512**

Cristiano and all propose a clever exp using the IEEE representation with a fine control of the precision, especially useful
for mix computation of the flash attention.

- Implement Fast Exponential Computation on SIMD Architectures
  A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, and Alessandro Curioni
- AVX2 and AVX512 float only, up to 20% faster for mix precision flash attention
  than the current implementation.
- For the other types legacy implementation.

**Precision**

1 ULP only valid in hybrid mode fp32 -> f16 due to the cast during the
store operation in the flash attention:

**Benchmark**

Machine Xeon 6972P, results in TOPs, Python forward pass flash attention

numhead 16, Head dimension 64

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 0.8  | 1.3  |
| 1024  | 1.7  | 1.7  |
| 2048  | 6    | 6.1  |
| 4096  | 16   | 16.8 |
| 8192  | 30.6 | 32.3 |
| 16384 | 40   | 40.8 |
| 32768 | 44.9 | 51.4 |
| 65536 | 45.8 | 54.4 |

numhead 16, Head dimension 128

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 2.5  | 4.1  |
| 1024  | 3.3  | 4    |
| 2048  | 11.4 | 10.5 |
| 4096  | 27.4 | 28.4 |
| 8192  | 44.4 | 46   |
| 16384 | 64.2 | 68.1 |
| 32768 | 77.8 | 83   |
| 65536 | 82.1 | 88.1 |

numhead 16, Head dimension 256

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 1.7  | 3.4  |
| 1024  | 4.2  | 6.5  |
| 2048  | 14.6 | 16.1 |
| 4096  | 30.1 | 31.1 |
| 8192  | 60   | 62   |
| 16384 | 83.3 | 87.3 |
| 32768 | 98.7 | 106  |
| 65536 | 102.2| 107.1|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151441
Approved by: https://github.com/mingfeima
2025-07-10 05:51:31 +00:00
9222552572 [non-strict export] uncovered cases of select and slice (#157821)
Summary:
`None` and `Ellipsis` in multi-dimensional indexing was previously not covered.

Moreover, we introduce a small optimization for `slice(None)` and a passthrough when symints do not appear in the indexing.

The remaining case is where indexing is by tensor, which is fairly complicated; we passthrough in that case.

Test Plan:
added tests

Rollback Plan:

Differential Revision: D77943929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157821
Approved by: https://github.com/pianpwk
2025-07-10 05:48:12 +00:00
3584e84c24 Fixed the function to get the origin nodes of fused triton kernel. (#157578)
Summary:
This DIFF is to fix the following issue:
In python source code for CompiledFxGraph,the FX graph segment for the Triton kernel is broken. For example, the following function
  def fn(a, b, c):
      x = torch.nn.functional.linear(a, b)
      x = x.sin()
      x = x.t() + c
      return x
Inductor compiled this FX graph into two nodes: the first one is mm, the second one is a triton kernel for sin + transpose + add. The FX graph segment for the triton kernel is like the following:
Graph fragment:
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %arg2_1), kwargs = {})
Basically only "add" node in the FX graph.
The root cause is function caffe2/torch/_inductor/utils.py:gather_origins does not detect the realized node correctly.
To fix this issue, the IRNode is checked if it is one of the following IRNode:
    ir.ComputedBuffer,
    ir.InputsKernel,
    ir.InputBuffer,
    ir.ReinterpretView,
    ir.TemplateBuffer,

If it is one of them, it is realized, otherwise, it is not.

Test Plan:
buck2 run mode/opt caffe2/test/inductor:provenance_tracing -- caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact.test_triton_kernel_to_post_grad_tracing_cuda

Rollback Plan:

Differential Revision: D77748371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157578
Approved by: https://github.com/mlazos
2025-07-10 05:34:50 +00:00
b146ca74f0 docs: add get_default_backend_for_device to distributed documentation (#156783)
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-07-10 05:11:30 +00:00
eddddea908 Upgrade MKL in CI (#154198)
This PR is to upgrade MKL in CI as PyTorch release uses MKL 2024.2 while MKL in CI is 2021.4. MKL 2021.4 can't trigger issues like https://github.com/pytorch/pytorch/issues/154477 caused by MKL upgrading in Torch release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154198
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
ghstack dependencies: #154585
2025-07-10 05:09:51 +00:00
80bcaa4195 have dynamic sources only apply to sizes and not strides (#157960)
@animesh pointed out using whitelist for strides can result in confusing graphs as follows

```
s60: "Sym(s60)", L_hidden_states_: "bf16[1, 4096, 3072][s60, 3072, 1]cuda:0"
```

We probably want to capture the relationship between sizes and strides anyways so let's make it so the whitelist only makes the sizes dynamic. That same graph now looks lik ethis

```
L_hidden_states_: "bf16[1, 4096, 64][262144, 64, 1]cuda:0"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157960
Approved by: https://github.com/pianpwk
2025-07-10 05:03:51 +00:00
88cd9f34b0 [audio hash update] update the pinned audio hash (#157873)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157873
Approved by: https://github.com/pytorchbot
2025-07-10 04:59:50 +00:00
2b19d85d70 FractionalMaxPool3d add kernel_size check (#155549)
Fixes #96316

## Test Result

```python
>>> import torch
>>> from torch.func import jacrev, grad, vmap
>>>
>>> torch.manual_seed(420)
<torch._C.Generator object at 0x7fe4767810d0>
>>>
>>> input = torch.randn(1, 1, 5, 5, 5, requires_grad=True)
>>>
>>> def func(input):
...     model = torch.nn.FractionalMaxPool3d(kernel_size=0, output_size=(1, 1, 1))
...     output = model(input)
...     return output
...
>>>
>>> func(input).sum().backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in func
  File "/home/zong/code/pytorch/torch/nn/modules/pooling.py", line 1054, in __init__
    raise ValueError(f"kernel_size must greater than 0, but got {kernel_size}")
ValueError: kernel_size must greater than 0, but got 0

```

![image](https://github.com/user-attachments/assets/52780ce7-3951-4d1c-95a4-5ce2bf65c727)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155549
Approved by: https://github.com/albanD
2025-07-10 04:55:06 +00:00
06a40b6850 Fix MKL error: Inconsistent configuration parameters (#154585)
Fixes #154477.

PyTorch release uses 2024.2 MKL, which has some changes to the usage of DFTI: if `DFTI_NUMBER_OF_TRANSFORMS > 1`, `DFTI_INPUT_DISTANCE` and `DFTI_OUTPUT_DISTANCE` also needs to be explicitly set to a positive integer. In addition, the requirement "the datasets to be transformed cannot contain common elements" should also be satisfied. This means that we need to avoid the case where the input strides have 0.

See https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2024-2/configuring-data-layouts.html and https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-2/dfti-number-of-transforms.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154585
Approved by: https://github.com/leslie-fang-intel, https://github.com/soumith, https://github.com/malfet
2025-07-10 03:42:38 +00:00
0a624c2dc5 Fix from_node's graph_id in unlift() (#157943)
Summary: We should use the node before deepcopy in NodeSource

Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```

Rollback Plan:

Differential Revision: D78022070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157943
Approved by: https://github.com/angelayi, https://github.com/Gasoonjia
2025-07-10 03:23:55 +00:00
4cfc0a3208 [Inductor] Introduce Lookup Table for Overriding Triton Kernel autotune configs post fusion (#157924)
Summary:
Introduce lookup table for kernels post fusion, hashing on inductor generated source code

Rollback Plan:

Differential Revision: D77866885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157924
Approved by: https://github.com/jansel
2025-07-10 03:23:50 +00:00
3232b57cd8 Updates to safetensors checkpoint consolidation script to be faster (#157936)
Summary:
- adding mmap-ing
- more efficient writing in larger chunks

latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks

Test Plan:
ran consolidation with the following code:

```
from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files
import time

start_time = time.time()
consolidate_safetensors_files(base_path, consolidated_path)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
```

With the old code this was taking a couple minutes and this is now down to ~6s.
Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors

Rollback Plan:

Differential Revision: D77960054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936
Approved by: https://github.com/teja-rao, https://github.com/pradeepfn
2025-07-10 02:50:20 +00:00
3404c1f0cf [HF][DCP] Upload local consolidated files to remote storage if needed (#157371)
If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written.
Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones.

Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371
Approved by: https://github.com/pradeepfn
2025-07-10 02:40:25 +00:00
aab949aa96 Deprecated pkg_resources and use distributions instead (#151915)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151915
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD
2025-07-10 01:51:26 +00:00
6442ae9256 Make the name assert actually do something, and reserve some more names (#157342)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157342
Approved by: https://github.com/albanD
2025-07-10 01:39:40 +00:00
db188503cb [BE] Remove stale pyre-fixme (#157816)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157816
Approved by: https://github.com/Skylion007, https://github.com/jingsh, https://github.com/albanD
2025-07-10 01:33:32 +00:00
693116f765 [doc] DeviceMesh invariant on DTensorSpec (#157806)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157806
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #157805
2025-07-10 01:27:40 +00:00
9a4ac71b58 [doc] Document an invariant in OpSpec (#157805)
I am not sure if this is actually true though, please reject this PR if it is not.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157805
Approved by: https://github.com/wanchaol, https://github.com/zpcore
2025-07-10 01:27:40 +00:00
8387984257 Improve error message for torch.binomial enforcing float inputs (#157658)
Fixes #157195
### Summary:
 Fixed Issue 157195 by adding a new error message for torch.binomial in **aten/src/ATen/native/Distributions.cpp**

### Explanation
 According to the issue,
```
import torch
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
```
`RuntimeError: Found dtype Float but expected Long`

 It looks like we are getting a Tensor error rather than a binomial function error. Since the error is coming from **pytorch/aten/src/ATen/TensorIterator.cpp**,  it seems like it is trying to align the tensor data to the same datatype for smooth tensor computations instead of giving a binomial function error.

I tried using both arguments as longs and both as ints and got the right binomial function error
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```

```
torch.binomial(torch.tensor([10.0]).int(), torch.tensor([0.5]).int())
NotImplementedError: "binomial_cpu" not implemented for 'Int'
```

But when I have both as different datatypes, the TensorIterator.cpp error comes back trying to align the datatypes.
`RuntimeError: Found dtype Float but expected Long`

I then tried finding where the NotImplementation Error was documented and found it in **pytorch/aten/src/ATen/Dispatch.h** in lines 193 - 211

```
#define AT_DISPATCH_SWITCH(TYPE, NAME, ...)                                 \
  [&] {                                                                     \
    const auto& the_type = TYPE;                                            \
    constexpr const char* at_dispatch_name = NAME;                          \
    /* don't use TYPE again in case it is an expensive or side-effect op */ \
    at::ScalarType _st = ::detail::scalar_type(the_type);                   \
    RECORD_KERNEL_FUNCTION_DTYPE(at_dispatch_name, _st);                    \
    switch (_st) {                                                          \
      __VA_ARGS__                                                           \
      default:                                                              \
        TORCH_CHECK_NOT_IMPLEMENTED(                                        \
            false,                                                          \
            '"',                                                            \
            at_dispatch_name,                                               \
            "\" not implemented for '",                                     \
            toString(_st),                                                  \
            "'");                                                           \
    }                                                                       \
  }()
```
 In the **AT_DISPATCH_SWITCH** function, it picks a tensor and its datatype and checks if the Tensor datatype matches the supported datatypes. If not we get the Not Implemented error. Unfortunately, I think the **AT_DISPATCH_SWITCH** function, uses the `common_dtype` from TensorIterator  in order to run. So TensorIterator.cpp needs to happen before the AT_DISPATCH_SWITCH function.

###  Summary: We are getting the wrong error message because **TensorIterator.cpp** gets called and errors out due to Tensor datatype mismatch before we can get the right error message in **Dispatch.h**  for torch.binomial not supporting that datatype.

### Options for the Fix
**Option 1**: Make the error message in TensorIterator.cpp more general so it applies to torch.binomial. An error message along the lines
`RunTime Error : "Tensor Datatypes", op.target_dtype," and ", common_dtype_, "are different "`

**Option 2**: Add an error message for the binomial function datatype mismatch before the the TensorIterator.cpp error message gets called.

Although Option 1 seemed easier I think Option 2 might be better as it is more specific to the binomial function while Option1 would affect all Tensors with datatype mismatch.

 **This PR applies the fix for Option 2**

After Fix :
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
RuntimeError: Binomial function arguments count and prob must have same datatype of type Float, got: count = Long, prob = Float
```
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```
@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157658
Approved by: https://github.com/soulitzer
2025-07-10 00:58:56 +00:00
54a7e5b598 _aot_export_function: allow keeping input mutations in the graph (#157730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157730
Approved by: https://github.com/ezyang
2025-07-10 00:47:51 +00:00
ed03492238 Add check nested_tensor_from_jagged param jagged_dim >= 1 (#157770)
Fixes #157404

## Test Result

```bash
pytest test/test_nestedtensor.py

...............................................s..........ssssss.................................................................................................s.s..sssss..s...ss............................................................. [ 44%]
...........................................................sssss....sss...s.........ss....s....sss.........s.sss...s..s......s............s.sss.ss...............s.....................s....s......................s.s.....s....s..s..ssssssssss [ 59%]
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..ssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssss...............................s........................................... [ 74%]
.......sss...................................................................................................................................................................................................................................... [ 89%]
....sss..........................................................................................................................................................                                                                                [100%]

==================================================================================================== 1317 passed, 258 skipped in 2504.27s (0:41:44) ====================================================================================================
```

![image](https://github.com/user-attachments/assets/dcc8e46d-b88f-4580-b4ad-0999bad33ec9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157770
Approved by: https://github.com/soulitzer

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
2025-07-10 00:34:39 +00:00
752f202ef3 [PGO] include module int attributes in PGO state (#157518)
Dynamo specializes on int module attributes by default. This includes them in PGO state despite specialization, if they're involved in guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157518
Approved by: https://github.com/bobrenjc93
2025-07-09 23:57:54 +00:00
ed051c3084 torch.distributed: add initial _dist2 prototype API (#157841)
This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d.

Test plan:

```
pytest test/distributed/test_dist2.py
```

Docs

```
cd docs
make html
```

![Screenshot 2025-07-08 at 13-39-23 Object Oriented Distributed API - torch distributed _dist2 — PyTorch main documentation](https://github.com/user-attachments/assets/9c03a7ec-09e5-42b9-8478-1ec28bc2b6bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841
Approved by: https://github.com/fduwjj
2025-07-09 23:40:43 +00:00
39456edbba [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-09 22:14:20 +00:00
a1dad2f2d2 [BE][Ez]: Autotype torch/profiler with ruff ANN (#157923)
Apply ruff autotyping fixes to add annotations to torch profiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157923
Approved by: https://github.com/albanD, https://github.com/sraikund16
2025-07-09 22:07:50 +00:00
53ab73090e [inductor] support unbacked symint in sdpfa (#157739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157739
Approved by: https://github.com/laithsakka
2025-07-09 22:01:29 +00:00
08e9dd280f [ONNX] Support symbolic arguments in onnx exporter (#157734)
Previous to this PR, torch.onnx.export(..., dynamo=True, veriy=True, report=True) does not support symbolic arguments. Such examples are like follwing:

```python
class M(torch.nn.Module):
    def forward(self, a, x):
        return a + torch.tensor(1) + x

op = torch.onnx.export(M(), (1, torch.ones(2)),
                       dynamic_shapes=(torch.export.Dim.DYNAMIC, {0: torch.export.Dim.DYNAMIC}),
                       dynamo=True, report=True)
```

symbolic arguments are like constant arguments that they don't have tensor_meta wither. Besides, torch.export.export supports model inputs having constants, which is different from the legacy issue: https://github.com/pytorch/pytorch/issues/99534 where we tried to get the FX directly from dynamo export. Thus, `_remove_non_tensor` is deleted from args processing.

NOTE: If the ConstantArugment shows up in exported_program, it was kept to align the length of inputs to nn.Module, but it's irrelevant to the model graph, hwich is why in ONNX model the input is omitted.

The test `test_constant_argument_user_input_is_omitted_in_onnx_graph` needs #157719
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157734
Approved by: https://github.com/justinchuby
2025-07-09 21:15:45 +00:00
163f0d8f2a [BE][Ez]: Auto add return type annotations for methods in torch/nn/module (#157925)
Automatically type a bunch of methods in nn.Module using ruff's type inference rules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157925
Approved by: https://github.com/albanD
2025-07-09 21:12:25 +00:00
f742b32a2f [dynamo] Avoid recompiling over unused objects (#156891)
Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in
`STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes
regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either
(1). exceed the recompialtion limit of 8, which results in suboptimal
performance, and (2). even if recompilation limit is increased, the
compilation time gets unnecessarily high (180s v.s. 20s for Flux).

This patch fixes the recompilation issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2025-07-09 20:14:34 +00:00
317520bf6e Add an ovrsource target for torch/headeronly (#157912)
Summary: no idea how this works

Test Plan:
will things just pass?

Rollback Plan:

Differential Revision: D77965219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157912
Approved by: https://github.com/albanD
2025-07-09 19:32:03 +00:00
dfa2649434 Revert "[Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500)"
This reverts commit c48d0f4643b7a69ebe24069e932ce1465a31cdbe.

Reverted https://github.com/pytorch/pytorch/pull/156500 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156500#issuecomment-3053680762))
2025-07-09 18:56:10 +00:00
52772765e0 Change AOTI_RUNTIME_DEVICE_CHECK to be device device specific (#157818)
Summary:
Change AOTI_RUNTIME_DEVICE_CHECK to the following depending on device:

AOTI_RUNTIME_CUDA_CHECK
AOTI_RUNTIME_XPU_CHECK
AOTI_RUNTIME_CPU_CHECK

Currently in the codebase, only `AOTI_RUNTIME_CUDA_CHECK` is used.

This shouldn't change anything as of now, but we do this to prepare for simultaneouly loading multiple backends (e..g CPU and CUDA) in AOTI standalone.

We don't want people writing `AOTI_RUNTIME_DEVICE_CHECK` for both CPU and CUDA checks. This could cause compilation problems when we statically link both CPU and CUDA models.

Test Plan:
CI

Rollback Plan:

Reviewed By: muchulee8

Differential Revision: D77742977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157818
Approved by: https://github.com/jingsh
2025-07-09 18:34:56 +00:00
c54778625e Update is_sparse doc to mention that it is sparse_coo specific (#157378)
## Issue being addressed
`is_sparse` presents itself as determining if a tensor is sparse. HOWEVER, it only does checks against the tensor for `sparse_coo`. This has lead to confusion from developers as when non-coo sparse tensors are provided it return false, despite those tensors being sparse.

## Considered Remedy
Fixing this is do-able however would result in complexity as existing systems may depend on this behavior remaining consistent, and even inside of pytorch is_sparse is used by `bform` which states that it supports only `sparse_csr and sparse_coo` meaning additional work/thought would have to go into solving for `sparse_csc` and `sparse_bsr`

## Remedy provided in this PR
In lieu of these complications the lowest risk highest gain action was to add clear warning messaging to the function for now to avoid confusion to developers utilizing the function. The rest of the function behavior remains identical

## Issue content
Addresses issue number: #101385
Original issue: https://github.com/pytorch/pytorch/issues/101385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157378
Approved by: https://github.com/soulitzer
2025-07-09 18:22:14 +00:00
81c7445eb9 [FSDP2] Use reduceOpSum for world size 1 (#157529)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157529
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/weifengpy
2025-07-09 18:08:48 +00:00
28aae93f24 [Memory Snapshot] Fix Linter for Global Annotations flag in Snapshot (#157858)
Summary: We added the ability to make Annotating Global or Local based on an input flag in PyTorch but didn't add the args to the linter

Reviewed By: mzzchy

Differential Revision: D77959409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157858
Approved by: https://github.com/mzzchy
2025-07-09 17:28:22 +00:00
b354328ecd [AOTI] add flag AOT_INDUCTOR_ENABLE_LTO (#157773)
Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time.

Rollback Plan:

Differential Revision: D77899195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773
Approved by: https://github.com/desertfire
2025-07-09 16:54:19 +00:00
d75d30eeb6 [DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.

It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2025-07-09 16:49:34 +00:00
cb711c8fa0 Revert "[BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)"
This reverts commit 754699610b0abec2fe3f5a73269b1dd09a330445.

Reverted https://github.com/pytorch/pytorch/pull/157199 on behalf of https://github.com/malfet due to It breaks lintrunner init` for default environments, see https://github.com/pytorch/pytorch/issues/152999 ([comment](https://github.com/pytorch/pytorch/pull/157199#issuecomment-3053279711))
2025-07-09 16:26:47 +00:00
981c99fdff Uninstall brew miniconda while running MacOS testing (#156898)
That results in torch.compile being unable to produce working artifacts
But reinstall it later, when done

Should fix https://github.com/pytorch/pytorch/issues/156833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-07-09 16:02:55 +00:00
054cd4ca28 [CPU Generator] Remove the unused CPUGeneratorImplStateLegacy in set_state (#153934)
As the title stated.

The old state named CPUGeneratorImplStateLegacy in set_state will not been used,
so just remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153934
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/atalman
2025-07-09 15:45:19 +00:00
f4d60a68dd Adding a change to kick off the theme pull (#157732)
Adding a small change so that Docker container is rebuild and reflects the latest changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157732
Approved by: https://github.com/malfet
2025-07-09 15:43:00 +00:00
6defd5084e Revert "[PT2][memory] mutation size correctness (#157562)"
This reverts commit 86670b39fa3df63a652a9a06b59b73f92d70c392.

Reverted https://github.com/pytorch/pytorch/pull/157562 on behalf of https://github.com/xuanzhang816 due to internal_test_failure ([comment](https://github.com/pytorch/pytorch/pull/157562#issuecomment-3053115025))
2025-07-09 15:38:29 +00:00
b4e3c9ea34 [ez][CI][testing] Set upload artifacts while running to default true if in CI (#157868)
I was confused about why the distributed tests weren't showing up quickly on HUD, its because the call of run_tests.py for distributed didn't include upload artifacts while running flag, so set it to default to IS_CI so I don't need to put the flag everywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157868
Approved by: https://github.com/huydhn
2025-07-09 15:21:25 +00:00
fcc682be4b [BE][Ez]: Fully type nn.utils.clip_grad (#154801)
Full types clip_grad and exposed typing annotations that were hidden by a bad decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801
Approved by: https://github.com/jansel
2025-07-09 14:27:51 +00:00
ed6ae20cf0 [BE][Ez]: Update mimalloc submodule to 2.2.4 (#157794)
Fixes a few minor bugfixes with the previous release and better compiler support. Should be a NOOP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157794
Approved by: https://github.com/atalman
2025-07-09 14:03:07 +00:00
02a9d9095f [BE] remove commented out code in c10/ovrsource_defs.bzl (#157856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157856
Approved by: https://github.com/swolchok, https://github.com/albanD
2025-07-09 13:28:56 +00:00
86eaf452c3 [Easy][Profiler] Fix pattern matcher of profiler (#157711)
Per title, as it fails with the following error if "+PTX" was used in `TORCH_CUDA_ARCH_LIST`:
```
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in skip
    has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in <genexpr>
    has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
                   ^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'pute_120'
```
Because slicing `arch[3:]` will not end up on having only digits for `compute_120` element of `torch.cuda.get_arch_list()`:
```python
>>> torch.cuda.get_arch_list()
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157711
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
2025-07-09 12:09:46 +00:00
297daa1d30 [aarch64] Add sm_80 to CUDA SBSA build (#157843)
related to https://github.com/pytorch/pytorch/issues/152690

This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-07-09 11:46:34 +00:00
a355158fcb [Easy] Fix the compilation warning (#157889)
**Background:**

```Shell
[1376/2332] Building CUDA object caffe2/CMakeFiles/torch_...h/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu.o
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
      size_t numelIn_ = -1;
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
      size_t numelOut_ = -1;
                         ^

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
      size_t numelIn_ = -1;
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
      size_t numelOut_ = -1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157889
Approved by: https://github.com/mlazos
2025-07-09 11:41:02 +00:00
4dce5b71a0 [build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027)
Modernize the development installation:

```bash
# python setup.py develop
python -m pip install --no-build-isolation -e .

# python setup.py install
python -m pip install --no-build-isolation .
```

Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`:

- pypa/setuptools#4955

`python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025.

- 9c4d383631/setuptools/command/install.py (L58-L67)

> ```python
> SetuptoolsDeprecationWarning.emit(
>     "setup.py install is deprecated.",
>     """
>     Please avoid running ``setup.py`` directly.
>     Instead, use pypa/build, pypa/installer or other
>     standards-based tools.
>     """,
>     see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html",
>     due_date=(2025, 10, 31),
> )
> ```

- pypa/setuptools#3849

Additional Resource:

- [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027
Approved by: https://github.com/ezyang
2025-07-09 11:24:27 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
ffe11b2bf2 [BE] fix typo in torch/distributed/tensor/: childs -> children (#156609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156609
Approved by: https://github.com/wanchaol, https://github.com/cyyever
ghstack dependencies: #156311
2025-07-09 11:02:23 +00:00
4cc8b60d1b [BE][1/16] fix typos in torch/ (#156311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311
Approved by: https://github.com/albanD
2025-07-09 11:02:22 +00:00
f5bbaa2253 Fixes typo in nccl_window_registration test (#157293)
As mentioned here: https://github.com/pytorch/pytorch/pull/155134#discussion_r2175605192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157293
Approved by: https://github.com/Skylion007
2025-07-09 11:01:18 +00:00
924fc52e18 [BE] add a linter to check consistency for cmake minimum version in requirements (#156961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156961
Approved by: https://github.com/ezyang, https://github.com/malfet
2025-07-09 10:44:17 +00:00
b83d8827bc Revert "Deprecate DataLoader pin_memory_device param (#146821)"
This reverts commit ab655816b8f76f511fb2262d45276d8d1b13d59c.

Reverted https://github.com/pytorch/pytorch/pull/146821 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146821#issuecomment-3052093902))
2025-07-09 10:29:31 +00:00
6f23f53599 [inductor] fix tensor.to(uint8) error when tensor src type is float (#157267)
The cpu inductor processes .to(torch.uint8) incorrectly, leading to numerical inconsistencies. The convert_float_to_int8 function may return incorrect results for negative inputs, such as -2.xx, when the data type is uint8_t, producing 0 instead of 255. This issue stems from the clamping logic; we should avoid converting min_val to uint8_t too early
Fixes https://github.com/pytorch/pytorch/issues/156788
@leslie-fang-intel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157267
Approved by: https://github.com/leslie-fang-intel
2025-07-09 07:03:38 +00:00
e3f2597b45 [Optimus] Fix normalization pass in the aten IR (#157857)
Summary: We found there's a special case in recent APS model where the input tensor has smaller size compared to the split size. It will be automatically truncated in split.Tensor thus we add extra condition check for split_with_sizes when do the normalization.

Test Plan:
### unit
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_aten_normalization
```

Buck UI: https://www.internalfb.com/buck2/2ecd1ef8-8efe-4245-b4c8-282c23645b3c
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7599824648585787
Network: Up: 3.9GiB  Down: 9.2GiB  (reSessionID-1396c91e-0dd2-457b-a49b-a6ab1f2a7d8f)
Loading targets.   Remaining      0/5344                                                                                                              99617 dirs read, 1074949 targets declared
Analyzing targets. Remaining      0/123279                                                                                                            4988547 actions, 5966764 artifacts declared
Executing actions. Remaining      0/728058                                                                                                            209:52:59.9s exec time total
Command: test.     Finished 12466 local, 209448 remote, 1226 cache (1% hit)                                                                           42:10.5s exec time cached (0%)
Time elapsed: 26:07.6s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

before fix:
aps-afoc_apop_pt2_v0-db2fe0449a

after fix:
aps-afoc_apop_pt2_v0-755ad0cdc6

Rollback Plan:

Differential Revision: D77961394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157857
Approved by: https://github.com/anijain2305
2025-07-09 05:38:15 +00:00
effe376db0 Adding aoti_standalone config (#157731)
Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if  `package_cpp_only` is explicitly set to False in config.

Test Plan:
```
buck2 run  mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r  TestAOTInductorConfig
```

Rollback Plan:

Differential Revision: D77889754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731
Approved by: https://github.com/desertfire
2025-07-09 04:30:04 +00:00
fcbf7c749a [Windows][Inductor] normalize_path_separator compiler path (#157835)
Fixes #157673

For the call trace:
```
......

  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\common.py", line 2569, in reduction
    return self.kernel.reduction(dtype, src_dtype, reduction_type, value)
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 2155, in reduction
    self._gen_parallel_reduction_buffers(acc, acc_type, reduction_type, init_dtype)
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 1942, in _gen_parallel_reduction_buffers
    reduction_prefix_array(
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 335, in reduction_prefix_array
    if cpp_builder.is_msvc_cl()
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 317, in is_msvc_cl
    return _is_msvc_cl(get_cpp_compiler())
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 240, in _is_msvc_cl
    subprocess.check_output([cpp_compiler, "/help"], stderr=subprocess.STDOUT)
torch._inductor.exc.InductorError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
```
On non-English language pack msvc environment, compiler path has raised `utf-8` issue. I add the `normalize_path_separator` to normalize the compiler path and avoid the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157835
Approved by: https://github.com/jansel
2025-07-09 04:02:20 +00:00
8bda95228f [autograd] Avoid creating and recording event when unnecessary (#157503)
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.

2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.

We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.

Removing this unnecessary create + record event should save a few us for each instance avoided.

Fixes https://github.com/pytorch/pytorch/issues/157407

----

Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
2025-07-09 03:36:14 +00:00
8d070187e3 fix type hints for interpolation functions (#157202)
Fixes #129053

Previously interpolate had a bad signature and not correct type hints.
This fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157202
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-07-09 03:11:37 +00:00
c515385b0a Add Intel GPU info collection to the collect env script (#157351)
https://github.com/pytorch/pytorch/pull/137846 was mistakenly closed. Reopen a PR to land the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157351
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-07-09 03:01:41 +00:00
d6237721c0 [Build] Make PyTorch compilable with gcc-14 on ARM (#157867)
Fixes numerous ICEs in vreg allocations for SVE+BF16
```
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: error: unrecognizable insn:
   25 | #pragma omp parallel
      |         ^~~
(insn 257 256 258 30 (set (reg:VNx8BF 449 [ bf16_vec1_217 ])
        (unspec:VNx8BF [
                (reg:VNx8BF 455)
                (reg:VNx8BF 456)
            ] UNSPEC_IORF)) "/pytorch/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h":228:31 discrim 1 -1
     (nil))
during RTL pass: vregs
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: internal compiler error: in extract_insn, at recog.cc:2812
0xd73c33 internal_error(char const*, ...)
	???:0
0xd73d1f fancy_abort(char const*, int, char const*)
	???:0
0x890053 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	???:0
0x890087 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	???:0
0x1379093 extract_insn(rtx_insn*)
	???:0

```
And one in RTL-expand pass while compiling Activation.cpp
```
during RTL pass: expand
In file included from /pytorch/aten/src/ATen/native/cpu/Activation.cpp:12,
                 from /pytorch/build/aten/src/ATen/native/cpu/Activation.cpp.DEFAULT.cpp:1:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp: In lambda function:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:94:7: internal compiler error: Segmentation fault
   94 |       });
      |       ^
/pytorch/aten/src/ATen/Dispatch.h:201:7: note: in definition of macro 'AT_DISPATCH_SWITCH'
  201 |       __VA_ARGS__                                                           \
      |       ^~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:72:3: note: in expansion of macro 'AT_PRIVATE_CASE_TYPE_USING_HINT'
   72 |   AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, scalar_t, __VA_ARGS__)
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:214:3: note: in expansion of macro 'AT_DISPATCH_CASE'
  214 |   AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__) \
      |   ^~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:218:34: note: in expansion of macro 'AT_DISPATCH_CASE_FLOATING_TYPES'
  218 |   AT_DISPATCH_SWITCH(TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:70:5: note: in expansion of macro 'AT_DISPATCH_FLOATING_TYPES'
   70 |     AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] {
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~
0xd73c33 internal_error(char const*, ...)
	???:0
0x134f987 rebuild_jump_labels(rtx_insn*)
	???:0
```

Interestingly enough, attempt to compile `Unfold2d.cpp` for `-march=armv8-a+sve` (i.e. without sve+bf16) support also causes ICE
```
/pytorch/aten/src/ATen/native/cpu/Unfold2d.cpp:221:1: error: unrecognizable insn:
  221 | }
      | ^
(insn 2918 2917 2919 296 (set (reg:VNx8BI 5917)
        (unspec:VNx16BI [
                (reg:VNx8BI 5920)
                (reg:VNx8BI 5922)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/usr/include/aarch64-linux-gnu/bits/string_fortified.h":29:33 discrim 1 -1
     (expr_list:REG_EQUAL (const_vector:VNx8BI [
                (const_int 1 [0x1]) repeated x9
                (const_int 0 [0])
                (const_int 1 [0x1]) repeated x2
                (const_int 0 [0]) repeated x4
            ])
        (nil)))
during RTL pass: vregs
```

Which could be worked around by adding
```patch
diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp
index 8ef0741e77af0a..59c76505dd6246 100644
--- a/aten/src/ATen/native/cpu/Unfold2d.cpp
+++ b/aten/src/ATen/native/cpu/Unfold2d.cpp
@@ -169,6 +169,10 @@ static void unfolded2d_acc_channels_last(

 /* note: due to write issues, this one cannot be parallelized as well as
  * unfolded2d_copy */
+#if defined(__GNUC__) && __GNUC__ == 14 && defined(__ARM_FEATURE_SVE)
+// Workaround for gcc-14.2.0 ICE during RTL pass: vregs when compiling for SVE
+__attribute__((optimize("no-tree-vectorize")))
+#endif
 void unfolded2d_acc_kernel(
     ScalarType dtype,
     void *finput_data,
```

Fixes https://github.com/pytorch/pytorch/issues/157842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157867
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-09 02:59:08 +00:00
ab8874bd26 Suppress warning when using native arch for jit loading cuda extensions. (#156923)
Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923
Approved by: https://github.com/ezyang
2025-07-09 02:51:20 +00:00
bc6e0661a6 Fix more H100 CI (#157829)
Follow @d4l3k 's fix in https://github.com/pytorch/pytorch/pull/157826/files. Two more fixes might be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157829
Approved by: https://github.com/davidberard98, https://github.com/d4l3k
2025-07-09 01:28:05 +00:00
e5edd013ab [AOTI] Skip test_simple_multi_arch_embed_kernel_binary_True_cuda (#157301)
Summary: For https://github.com/pytorch/pytorch/issues/156930, still no clue on what went wrong as it is not reproducible locally, but somehow the problem seems only exists when embed_kernel_binary is True. Let's skip it for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157301
Approved by: https://github.com/yushangdi
2025-07-09 01:18:36 +00:00
75f489d37f [Break XPU][Inductor UT] Align tolerance of newly added case with cuda. (#157702)
Align tolerance with cuda for the newly added case `test_comprehensive_logcumsumexp_xpu_float16` in #157512.

Fixes #157697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157702
Approved by: https://github.com/jansel
2025-07-09 00:55:01 +00:00
3eb7084f7a [ci] fix h100-distributed (#157826)
This was broken by https://github.com/pytorch/pytorch/pull/157341

This should resolve the permission issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157826
Approved by: https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/huydhn
2025-07-09 00:27:55 +00:00
86251eff40 Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601)"
This reverts commit 55108074c0795be3b617d3b13b06794f63e1f8ca.

Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3050628047))
2025-07-09 00:07:31 +00:00
1b3d69b59f Work: block_current_stream API (#156883)
This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo.

The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP.

There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780

This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set.

Test plan:

```
pytest test/distributed/test_c10d_gloo.py -v -k wait_stream
pytest test/distributed/test_c10d_nccl.py -v -k wait_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-07-08 23:55:46 +00:00
92f41ccc26 [Inductor] Support precomputed size args in the FX backend. (#157758)
# Feature
If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes.

This PR adds support for this feature to Inductor's FX IR backend.

We generate FX IR for precomputed size args in 3 steps:
1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.)
2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`.
3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions.

In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes.

# Test plan
Added a new CI test exercising this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758
Approved by: https://github.com/jansel
2025-07-08 23:22:17 +00:00
95bc3da9f8 [c10d] support dynamic shapes for all_to_all_single_autograd (#157521)
`all_to_all_single_autograd` is not an op, all the code executed until the `all_to_all_single` dispatch is visible to the compiler. This means the `all_to_all_single_autograd` wrapper code must support symints in order to be traceable with dynamic shapes.

FIXES https://github.com/pytorch/pytorch/issues/157479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157521
Approved by: https://github.com/wconstab
2025-07-08 23:19:59 +00:00
9f18482d41 [dynamo] removing string literals for weblink generation (#157820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157820
Approved by: https://github.com/williamwen42
2025-07-08 23:08:06 +00:00
c5b46b5408 [BE] Standardize CPU capabilities name (#157809)
It's weird to call default x86 CPU capability `NO AVX`, when in reality it's something different. Also it's a bit strange to have it assigned different names on different platforms

Fixes https://github.com/pytorch/pytorch/issues/157538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157809
Approved by: https://github.com/Skylion007
2025-07-08 23:06:09 +00:00
179dcc10e4 Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558)
Please see: https://github.com/pytorch/pytorch/issues/157517
We would like to keep Volta architectures by default for release 2.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558
Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet
2025-07-08 23:02:10 +00:00
7a41f20794 [inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187)
For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes:
* Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start.
* Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor.
* Add a context manager to automatically quiesce() at the end of dynamo compilation.
* Signal a wakeup() in compile_fx only when we have cuda devices.
* Add a killswitch so we can turn of quiescing.

Testing:
For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI.

For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression....

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better:

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187
Approved by: https://github.com/aorenste, https://github.com/jansel
2025-07-08 22:53:13 +00:00
178fe7aa98 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-07-08 22:11:33 +00:00
2e14069081 Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)"
This reverts commit 777eca9f16aeecd7c362a235cf25e6b8e6eda57f.

Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3050258896))
2025-07-08 20:48:51 +00:00
391473cca0 [export] Fix lift constants bug (#157719)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157719
Approved by: https://github.com/yushangdi
2025-07-08 20:33:53 +00:00
b9dc2fa4f7 Add legacy note to autograd.profiler doc. (#157459)
Via google search I got to `torch.autograd.profiler` and implemented my code with it. Only to be taken by surprise finding `torch.profile.profiler`, which has a note saying the autograd one is legacy.

This just adds such note to `autograd.profiler` to avoid this confusion and waste of time to future people in my situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157459
Approved by: https://github.com/sraikund16
2025-07-08 20:33:23 +00:00
a73d9e0aec Fix einsum strategy shard dim > ndim (#157593)
Previously we didn't constrain Shard dim to be <= the tensor's ndim. This cause an invalid strategy like `(RR, RS(2)) -> RS(2),` for einsum `bmk,kn->bmn` on the 2d mesh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157593
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2025-07-08 20:27:17 +00:00
06b3265cb1 Increase nightly C++ docs build timeout to 6h (#157759)
This job has been timing out since May 261897734a/1, maybe it's time to figure out if this makes sense.

Issues https://github.com/pytorch/pytorch/issues/157763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157759
Approved by: https://github.com/malfet
2025-07-08 19:28:48 +00:00
dea4864ce0 HF loads dcp - don't do a full deserialize on every file (#157715)
Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D77871099

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715
Approved by: https://github.com/meetv18
2025-07-08 18:13:27 +00:00
4f5be56612 [Pyrefly][Refactor] Replace dict() calls with literal dict syntax for improved readability (#157735)
There are 31 places that I spotted which construct literal dictionaries.

This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157735
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-08 18:10:33 +00:00
0f31445139 Add stack trace of exception to MultiProcContinousTest (#157589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589
Approved by: https://github.com/Skylion007
2025-07-08 17:54:35 +00:00
5b4e0255d7 Check FakeScriptObject in _resolve_name_collision (#157736)
Summary:
Fix https://github.com/pytorch/pytorch/issues/157401

torch.equal cannot handle FakeScriptObject inputs.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r  test_aoti_torchbind_name_collision
```

Rollback Plan:

Differential Revision: D77894081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157736
Approved by: https://github.com/angelayi
2025-07-08 17:51:46 +00:00
44d0800d60 [Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920)
We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920
Approved by: https://github.com/soulitzer
2025-07-08 17:49:54 +00:00
a5c61eb78d [MPS][BE] Delete as_strided_tensorimpl_mps (#157772)
Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else.

Fixes https://github.com/pytorch/pytorch/issues/152701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157772
Approved by: https://github.com/Skylion007
2025-07-08 17:02:36 +00:00
bbe681ed51 [cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)
Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/)

Motivation:
* This is the case we care the most.
* We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656
Approved by: https://github.com/ColinPeppler
2025-07-08 16:57:33 +00:00
ed911747c2 [dtensor] add support for fused optimizer with parameters across multiple meshes (#157682)
We are seeing more and more use cases where parameters in a model (under the same optimizer group) are put on different meshes. E.g.
- when FSDP and TP are both applied, some parameters are sharded only on the FSDP mesh but not TP mesh (see https://github.com/pytorch/pytorch/pull/153268).
- in [dp2ep Expert Parallel](https://github.com/pytorch/torchtitan/pull/1324), the routed experts are sharded on the (global FSDP \ EP) mesh for smaller FSDP and on the EP mesh for EP, whereas other params are sharded on the global FSDP mesh for FSDP.

This PR is, in some sense, a continuation of https://github.com/pytorch/pytorch/pull/147869 to tackle the problem when fused optimizers are used. In such cases, the [`fused_adam`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L15786) / `fused_adamw` has a scalar tensor arg `state_steps` which gets automatically cast to DTensor on the default [`compute_mesh`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_dispatch.py#L350) (one of the multiple meshes), even though the it could correspond to different meshes.

To avoid hitting the cross-mesh propagation exception in `common_pointwise_strategy` and followup redistribute problems, we manually set the target mesh and placements to be the same as input mesh and placements, so that no redistribute will be triggered. This also helps bypass the situation where [`generate_redistribute_costs`](https://github.com/pytorch/pytorch/pull/157682/files#diff-eea32a36dd2d4e58307bc5229402e48048b2ecaef64a7c085495fba1ee10ac89R597) returns infinite cost due to cross mesh redistribute.

Moreover, this PR has minimal scope (restricted to the `fused_ops`) and doesn't need to modify other files such as `_sharding_prop.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157682
Approved by: https://github.com/wanchaol
2025-07-08 15:58:30 +00:00
777eca9f16 [DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.

It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2025-07-08 15:57:37 +00:00
476874b37f [BE]: Update NCCL to 2.27.5 (#157108)
Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157108
Approved by: https://github.com/atalman
2025-07-08 15:40:54 +00:00
5dc75f72d4 Simplify the base classes of _PyFutureMeta (#157757)
Summary:

I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures.

At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance.

Test Plan:

I'm fairly confident from local testing that this should be a no-op.

But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157757
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-08 15:39:56 +00:00
f88d7a7a34 [BE] Do not add . after troubleshooting_url (#157753)
As it gets included into auto-hrefed URLs in say github logs to point to non existing location

For example from https://github.com/pytorch/pytorch/actions/runs/16130448756/job/45517004735?pr=157749#step:18:27
> W0708 00:23:20.150000 67082 torch/_dynamo/convert_frame.py:1047] [0/8] To diagnose recompilation issues, see [https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.](https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157753
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-07-08 15:38:24 +00:00
98bb0c0e78 [CI][MacOS] Add VENV_PATH to search path (#157749)
When building/testing PyTorch on MacOS

Shoudl prevent some flakiness when conda environment overtakes CI/CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-08 15:37:45 +00:00
76fe88fa56 Revert "Cleanup leftover miniconda brew installation (#156898)"
This reverts commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5.

Reverted https://github.com/pytorch/pytorch/pull/156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](https://github.com/pytorch/pytorch/pull/156898#issuecomment-3049281232))
2025-07-08 14:54:42 +00:00
86670b39fa [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-08 14:02:20 +00:00
c78bbdf410 [BE] Update xpu driver repo for CD used almalinux 8.10 (#157356)
XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157356
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-07-08 13:59:46 +00:00
b9afdd9bcc Add flag to fx.passes.split_module to normalize input names (#157733)
This is useful for vLLM, which runs AOTAutograd directly on graphs after
they have been split.

I created a new flag for this instead of reusing
`keep_original_node_name` (please let me know if you think I should reuse this).
The reasoning is:
- The names of the placeholder nodes is different from the targets of
  the placehoder nodes. The targets are the actual input names.
- Backwards compatibility: this API has been out for ~4 years, it
  looks public, and it has extensive public use. For example, this change
  would actually be BC-breaking to vLLM (they rely on the subgraph input
  names being different at the moment).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157733
Approved by: https://github.com/ezyang
2025-07-08 13:47:24 +00:00
cyy
7381c77724 Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-07-08 12:20:29 +00:00
ab655816b8 Deprecate DataLoader pin_memory_device param (#146821)
Following [ #131858 suggestion](https://github.com/pytorch/pytorch/pull/131858#pullrequestreview-2517760602) to optimize DataLoader code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146821
Approved by: https://github.com/divyanshk

Co-authored-by: Divyansh Khanna <divyanshkhanna09@gmail.com>
2025-07-08 09:24:53 +00:00
41e8b826d0 S390x update test marks (#157541)
Update s390x test marks

test_logs_out from test/dynamo/test_logging.py is updated
and no longer fails on s390x.

test_qengine from test/test_torch.py doesn't work on s390x:
no QEngine is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157541
Approved by: https://github.com/huydhn
2025-07-08 09:08:33 +00:00
5430990bd7 Added philox based RNG context for HPU device in Dtensor scenarios (#156581)
In this PR, we are enabling `HPU` device-specific function calls for random operations. These calls will manage the setting and unsetting of the `context of Random Number Generator`.
While HPU devices typically utilize a `Mersenne-based RNG`, Dtensor-specific random operations employ an `offset-based (Philox) RNG tracker` which is specifically integrated with `CUDA` in scope.
To integrate a similar offset-based RNG tracker within the `HPU backend`, a backend-specific device handle function is necessary to identify the execution context of these random operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156581
Approved by: https://github.com/jeromean, https://github.com/wanchaol
2025-07-08 08:50:24 +00:00
55108074c0 Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-08 08:40:47 +00:00
84b77ec128 [BE] add a minimal linter to check pyproject.toml consistency (#156017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156017
Approved by: https://github.com/ezyang
2025-07-08 08:17:36 +00:00
8134684d44 [inductor collectives] sink waits iterative (#157708)
Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157708
Approved by: https://github.com/wconstab
ghstack dependencies: #157706
2025-07-08 07:17:10 +00:00
2af7c67e48 Mitigate some flaky tests in trunk (#157756)
(not really fix these issues, but we should be able to close them. This also allows CI from the PR to test them)

Fixes https://github.com/pytorch/pytorch/issues/156579
Fixes https://github.com/pytorch/pytorch/issues/156580
Fixes https://github.com/pytorch/pytorch/issues/126867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157756
Approved by: https://github.com/clee2000
2025-07-08 07:07:11 +00:00
38757d94f1 Enable target-determination (TD) for ROCm CI (#156545)
Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156545
Approved by: https://github.com/jeffdaily, https://github.com/clee2000
2025-07-08 06:27:40 +00:00
1b58e7adab fix storage use_count (#157694)
# Motivation
https://github.com/pytorch/pytorch/pull/155451 decoupled `torch._C._storage_Use_Count` from CUDA and introduced a corresponding unit test:
815545f2dd/test/test_torch.py (L257-L262)
However, this test fails when PyTorch is built with debug assertions enabled. @clee2000 disabled this UT in https://github.com/pytorch/pytorch/pull/156731. The root cause is that `_cdata` is obtained from an `intrusive_ptr`, not a `weak_intrusive_ptr`. As a result, calling `c10::weak_intrusive_ptr::use_count` on it triggers the internal assertion:
815545f2dd/c10/util/intrusive_ptr.h (L912-L917)
For example:
```python
a = torch.randn(10, device=device) # refcount=1, weakcount=1
prev_cf = torch._C._storage_Use_Count(a.untyped_storage()._cdata) # violate the assertation
```
This violates the expected invariant inside `weak_intrusive_ptr::use_count`, which assumes the pointer was originally constructed from a valid `weak_intrusive_ptr`. Actually, `storage_impl` is obtained from an `intrusive_ptr`.
815545f2dd/torch/csrc/Module.cpp (L2105-L2109)

# Solution
Use `c10::intrusive_ptr::use_count` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157694
Approved by: https://github.com/albanD
2025-07-08 05:53:12 +00:00
8186af5a26 [BE][Easy] set end-of-line for .bat file to CRLF in .editorconfig (#156032)
See also:

54976bca10/.gitattributes (L1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156032
Approved by: https://github.com/seemethere, https://github.com/ezyang
2025-07-08 05:40:57 +00:00
bdacf08b86 [BE][Easy] add .editorconfig setting for C/C++/CUDA/ObjC (#157692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157692
Approved by: https://github.com/ezyang
2025-07-08 05:37:15 +00:00
987314aa96 Split batch-num-heads grid dim between y and z (#157745)
for #157018

doesn't totally fix the problem but should help alot

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157745
Approved by: https://github.com/Chillee
2025-07-08 05:17:43 +00:00
39a8f66d59 [BE] Use simdgroup_size constexpr (#157751)
Instead of every shader defining it separately, move it to `c10/metal/common.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157751
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157746
2025-07-08 03:46:20 +00:00
0b73f7c871 [EZ][BE] Move array def to c10/metal/common.h (#157746)
And use proper type aliasing instead of weird _ARRAY_NS

Also use `uint64_t` instead of `ulong`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157746
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-07-08 03:46:20 +00:00
a4c7e7f983 [PowerPC]: Fixed build issue that occur because of datatype f8 enablement for onednn in qlinear and prepack (#157469)
Getting the build issue because of enablement of data type fp8 for onednn in qlinear and qlinear_prepack file after this commit c2185dc4a5626848df37cad214b73d5ae7dd4f17

Currrently cpuinfo is disable for power system because of that  it is giving below error.

**Error:**
 ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope

Made a required changes and now build issue got fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157469
Approved by: https://github.com/malfet
2025-07-08 03:45:06 +00:00
cyy
3ee8828c87 [1/N] Don't use CUDA.cmake module (#157188)
Small changes before removing CUDA.cmake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157188
Approved by: https://github.com/ezyang
2025-07-08 03:05:35 +00:00
f56bfb3030 [CPU] Fix memory access for sbgemm bf16 (#156585)
Fixes #156022.

1. The original dtype conversion overwrites the whole `n_*ldc_` instead of `n_*m_` with stride `ldc_`, causing the potential memory issue.
2. Fix the None value issue in attention backward UT, as the sbgemm bf16 could be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156585
Approved by: https://github.com/mingfeima, https://github.com/aditew01, https://github.com/ezyang
2025-07-08 02:36:28 +00:00
12f9942b10 Fix slice op redistribute_cost compute (#157178)
For slice op backward, my understanding is that the `redistribute_cost` attribute is incorrectly assigned to previous placement strategy: 0decd966af/torch/distributed/tensor/_ops/_tensor_ops.py (L399-L400)

The mistake is hard to be tested since we didn't enforce the `redistribute_cost` for `strategy.strategies` with size one: 2815ade9a8/torch/distributed/tensor/_sharding_prop.py (L491-L499)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157178
Approved by: https://github.com/XilunWu
2025-07-08 02:28:59 +00:00
c5589074e6 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-08 01:21:59 +00:00
30a1cc11a4 Revert "[CI][MacOS] Add VENV_PATH to search path (#157749)"
This reverts commit 85111cd165f108ffabb4a90083d59d7a867ebd9f.

Reverted https://github.com/pytorch/pytorch/pull/157749 on behalf of https://github.com/huydhn due to It looks like lint was not green, so revert and reland I guess ([comment](https://github.com/pytorch/pytorch/pull/157749#issuecomment-3047032909))
2025-07-08 01:18:16 +00:00
19a01382bc Revert "[SymmMem] find_path does not search /usr/local/lib (#157695)"
This reverts commit 3effe0c293219b00a0eae7e139fe2d9aed84bc03.

Reverted https://github.com/pytorch/pytorch/pull/157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](https://github.com/pytorch/pytorch/pull/157695#issuecomment-3047020152))
2025-07-08 01:12:01 +00:00
df72078fe1 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/torch.py (#157344)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157344
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-07-08 00:46:56 +00:00
85111cd165 [CI][MacOS] Add VENV_PATH to search path (#157749)
When building/testing PyTorch on MacOS

Shoudl prevent some flakiness when conda environment overtakes CI/CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-08 00:38:37 +00:00
edf7bb4f51 Fix unbound local when an error occurs before pool is initialized (#156750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156750
Approved by: https://github.com/jamesjwu
2025-07-08 00:28:21 +00:00
bbb930aba2 Bump urllib3 from 2.2.2 to 2.5.0 in /tools/build/bazel (#156390)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-07 17:13:21 -07:00
60b41de0ca remove allow-untyped-defs from torch/ao/nn/quantized/modules/rnn.py (#157234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157234
Approved by: https://github.com/jingsh
ghstack dependencies: #157231, #157232
2025-07-08 00:11:52 +00:00
e38a335d7f remove allow-untyped-defs from torch/backends/cusparselt/__init__.py (#157232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157232
Approved by: https://github.com/jingsh
ghstack dependencies: #157231
2025-07-08 00:11:52 +00:00
9d8cf24b3b remove allow-untyped-defs from torch/_classes.py (#157231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157231
Approved by: https://github.com/jingsh
2025-07-08 00:11:52 +00:00
be56a8d7ac Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 23:57:17 +00:00
3effe0c293 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-07 23:16:45 +00:00
2fde2090d0 [inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes (#157706)
Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706
Approved by: https://github.com/wconstab
2025-07-07 23:13:58 +00:00
5d8d126249 Fix einops x torch.compile interaction (#157600)
Fixes https://github.com/pytorch/pytorch/issues/157451

If/when einops releases a version greater than 0.8.1, it will just break
(without this patch).

The history is:
- Between 2.6 and 2.7, we tried to delete the einops import (#142847)
- That didn't work so well, so we applied a hotfix in 2.7.1. (#153925)
- The hotfix wasn't completely correct (0.8.1 is the latest version of
  einops, so the condition in the hotfix just always evaluates to True!)
- It turns out we didn't need to delete the einops import. We already
  do not eagerly import einops.
- I reverted the code back to the state it was in in 2.6.
  https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py

Test Plan:
- We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600
Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305
ghstack dependencies: #157416
2025-07-07 23:04:02 +00:00
378c121d5e Remove unnecessary warnings during the ATen compilation process. (#157703)
Comparing uint32_t(num_threads()) with int(kCUDABlockReduceMaxThreads) always results in a compilation warning. Just change the return type of kCUDABlockReduceMaxThreads to uint32_t to avoid it.
Fixes https://github.com/pytorch/pytorch/issues/157701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157703
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-07-07 22:49:38 +00:00
7e83d50845 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-07 22:13:34 +00:00
6f05d58f2b [AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592)
Summary:
Prepare for https://github.com/pytorch/pytorch/pull/157129.

We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77761249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592
Approved by: https://github.com/desertfire
2025-07-07 22:13:22 +00:00
a7eb153bba [MemoryViz] Add file selector button (#157647)
In some linux desktop environments like mine, there is no drag and dropping of files. Which made the memoryviz impossible for me to use. So this adds a file selector button as an alternative. Tested that it works locally, and also works with multiple files.

![image](https://github.com/user-attachments/assets/dcb61d68-6c6f-42f6-a075-1783d747d1b0)

And the button remains when something is loaded, to allow loading something else, but it moves out of the way to save vertical space:

![image](https://github.com/user-attachments/assets/4239d13c-3d80-4790-9696-0906c75e14e6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157647
Approved by: https://github.com/sraikund16
2025-07-07 22:03:51 +00:00
ed6df0e324 correctly import torch.version (#157584)
The structure is

```
torch/
  __init__.py
  version.py
```

When we import torch, only `torch/__init__.py` is executed by default.

The submodules like `version.py` are not automatically imported or attached to the torch module.

So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
2025-07-07 21:43:35 +00:00
5c79a55e7e [oss] Add version to metadata (#155343)
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data

Test Plan:
added a test

Rollback Plan:

Differential Revision: D76135887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
2025-07-07 20:57:30 +00:00
3d06ff82a8 [release] Triton pin update to 3.4 (#156664)
Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206
Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664
Approved by: https://github.com/davidberard98
2025-07-07 20:52:25 +00:00
2efa5eaa65 swa avoid stream sync (#157705)
Summary:
When AveragedModel updates_parameters it calls self.n_averaged == 0 for each parameter, where n_averated is a buffer on GPU. Moving check before the cycle to call sync once

It improves update_parameter from 74ms to 57ms ~22% improvement
{F1980011097}
{F1980011111}

Test Plan:
CI

Rollback Plan:

Differential Revision: D77723025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157705
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2025-07-07 20:47:35 +00:00
c2510fcd86 Fix index_put propagate strategy arg unpack error (#157671)
Fix `index_put` propagate strategy didn't consider optional arg `accumulate`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157671
Approved by: https://github.com/fmassa, https://github.com/wconstab
2025-07-07 20:18:18 +00:00
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
63a96eaeb8 [DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523)
With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523
Approved by: https://github.com/fegin
ghstack dependencies: #157501
2025-07-07 19:43:51 +00:00
2b8d3b1b2b [DeviceMesh] Use user set backend and pg option even for the global mesh (#157501)
Short term solution to https://github.com/pytorch/pytorch/issues/156593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501
Approved by: https://github.com/fegin, https://github.com/lw
2025-07-07 19:43:51 +00:00
bf1ebe0531 Fix typo: 'paramter' → 'parameter' in dynamo variable comment (#157651)
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.

These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.

Thanks for your time and review!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
2025-07-07 19:42:44 +00:00
433a247102 [logging] [redo] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156840)
Summary: This is a redo of https://github.com/pytorch/pytorch/pull/156517, but with pt2_compile_events logging disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156840
Approved by: https://github.com/jamesjwu
2025-07-07 19:09:48 +00:00
8a47f9d03b [CI] Fix xpu ci test sccache issue (#157693)
With PR #157341 land, it broken the PXU CI test on sccache which has been disabled by #143851. Re-disable it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157693
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-07 18:29:38 +00:00
9e5f4a844c [FSDP2] Fix issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964)
fix https://github.com/pytorch/pytorch/issues/155223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155964
Approved by: https://github.com/weifengpy
2025-07-07 17:09:29 +00:00
cyy
7c1f627828 Fix 'dllimport attribute ignored on inline function' (#157670)
There are lots of warnings in builds:
```
 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z  5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z       |                             ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z  5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
b3b4d28f4c [submodule][cutlass] Update pin to b995f93 v4.0.0 (#157376)
@Skylion007 seems afk. https://github.com/pytorch/pytorch/pull/153541

https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157376
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 16:55:47 +00:00
ae1094b72b Revert "[WIP] Automatically load and save dynamo entries via caching_precompile (#155913)"
This reverts commit e466dab164d9236bfe5817ec8e4d24c7b9d3e392.

Reverted https://github.com/pytorch/pytorch/pull/155913 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/155913#issuecomment-3045914878))
2025-07-07 16:53:35 +00:00
eda0a9cc90 [list] Add list.__delitem__ (#156339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156339
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270, #156271
2025-07-07 14:51:32 +00:00
d74ccf4ffe [list] Add list.__mul__ and list.__imul__ (#156271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156271
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270
2025-07-07 14:51:32 +00:00
689fba032d Implement list.__add__ and list.__iadd__ (#156270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156270
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242
2025-07-07 14:51:25 +00:00
c1d69d5dd5 [list] Implement list.remove (#156242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156242
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148
2025-07-07 14:51:17 +00:00
e49acfc5c5 [list] Raise exception in invalid list method call (#156148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156148
Approved by: https://github.com/zou3519
ghstack dependencies: #153969
2025-07-07 14:51:10 +00:00
034e996d37 [list] Implement list.count (#153969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153969
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-07 14:51:03 +00:00
16c3b4143b [gtest][listing] Enable gtest json listing for the fbcode/caffe2 project (#156816)
***SUMMARY***

The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function.

In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile

***TEST PLAN***

This tests should run in modes where it applies which should be covered in the CI run.

Below shows a sample run in the dev-nosan mode which do not have the cache allocator

BEFORE
```
buck test fbcode//caffe2:cpu_caching_allocator_test
Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
⚠ Listing failed: caffe2:cpu_caching_allocator_test
Listing tests failed with error:
Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr:
```

AFTER
```
buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test
Analyzing targets. Remaining      0/46242                                                                                1871690 actions, 2251668 artifacts declared
Executing actions. Remaining      0/257870                                                                               83:28:24.4s exec time total
Command: test.     Finished 10 remote, 112314 cache (99% hit)                                                            83:22:43.5s exec time cached (99%)
Time elapsed: 2:57.7s
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
```

Rollback Plan:
steps:
  - manual.note:
      content: Revert this diff

Reviewed By: patskovn

Differential Revision: D77229077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816
Approved by: https://github.com/kimishpatel
2025-07-07 14:16:43 +00:00
54a4d34d10 [fbcode] switch to cutlass-4 (#157579)
Summary: Update cutlass version to 4. For most use cases.

Test Plan:
testing in progress

Rollback Plan:

Differential Revision: D77605011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 14:12:33 +00:00
78684e27ac [xla hash update] update the pinned xla hash (#156584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:20 +00:00
40e39ae21f Update slow tests (#157696)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157696
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:06 +00:00
e466dab164 [WIP] Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 11:56:30 +00:00
d27d36136c Don't try installing missing cuda dependencies on s390x (#157540)
Don't try installing missing cuda dependencies on s390x

Fixes #157409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157540
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-07 09:16:38 +00:00
815545f2dd [inductor] enable bf32 for mkldnn linear pointwise/binary in inductor (#127294)
When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type.

Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32
python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-07 06:03:41 +00:00
d26ca5de05 Support transpose and pack for bit8 (#156065)
To be used by CPU INT8 SDPA in torchao. https://github.com/pytorch/ao/pull/2380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156065
Approved by: https://github.com/mingfeima, https://github.com/ezyang
2025-07-07 01:40:47 +00:00
Lei
2022588295 Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)
Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
2025-07-06 09:20:31 +00:00
02715d0876 [BE][5/6] fix typos in test/ (test/dynamo/) (#157639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157639
Approved by: https://github.com/yewentao256, https://github.com/jansel
ghstack dependencies: #157638
2025-07-06 06:34:25 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
7cda4017dd Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666)
When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with:

```
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions
    compiler_name, compiler_version = self._check_abi()
                                      ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi
    _, version = get_compiler_abi_compatibility_and_version(compiler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version
    if tuple(map(int, version)) >= minimum_required_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '7+libcxx'
```

Compiler identification is a valid semantic version:
```
$ clang -dumpfullversion -dumpversion
20.1.7+libcxx
```

After adjusting parser of version, clang is able to compile extensions successfully.

Fixes #157665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666
Approved by: https://github.com/msaroufim
2025-07-06 01:35:00 +00:00
3e56a9cdfb More testing of Python arithmetic operators between tensors and scalars (see 157266) (#157632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157632
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-05 17:48:27 +00:00
ee9ac36c23 Fixing misspelling in documentation (#157565)
Fixes #157564

Fixes misspelling of the word parameter in documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157565
Approved by: https://github.com/awgu, https://github.com/cyyever
2025-07-05 17:04:13 +00:00
9be5860bc3 [dynamo] Fix dynamic shapes handling in after_aot repro generation (#157136)
Summary:
- Extract symbolic variables directly from graph placeholders and arguments
- Add symbolic variable definitions to generated repro code
- Add unit tests with ToyModel for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157136
Approved by: https://github.com/xmfan
ghstack dependencies: #157021
2025-07-05 15:38:41 +00:00
548c9d8281 Fix typo: 'paramter' → 'parameter' in quantization model report test (#157646)
This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`:

- Corrected the word "paramter" to "parameter" for better readability and accuracy.

While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase.

Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-07-05 12:28:36 +00:00
71a650ad56 Fix typo: 'Intializing' → 'Initializing' in test_parametrization.py (#157362)
This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`.

- Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase.

This is a non-functional change and does not impact behavior or test outcomes.

Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157362
Approved by: https://github.com/ezyang
2025-07-05 12:21:15 +00:00
2471cc3355 [pc] verify max autotune is in generated source code (#157650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157650
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614, #157619
2025-07-05 07:55:11 +00:00
db00e1699a [pc] introduce ProgressiveCompilationState and clear callback (#157619)
followup from https://github.com/pytorch/pytorch/pull/157305 where
@aorenste correctly suggested clearing callback. this refactor
introduces a new dataclass so we don't need to check nullability for
each field

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157619
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614
2025-07-05 07:55:11 +00:00
5ea832e5f6 [pc] migrate progression futures from list to deque (#157614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157614
Approved by: https://github.com/aorenste
ghstack dependencies: #157305
2025-07-05 07:55:03 +00:00
a952956d05 Add isnan exit condition to special ops (#157464)
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
2025-07-05 04:19:50 +00:00
63e87d6d05 [Refactor] Add maybe unused flag to remove warning (#157655)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157655
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-05 03:23:39 +00:00
f7127b9b94 [Refactor] Remove unused variables (#157654)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-05 02:12:15 +00:00
44f5b93122 fix: correct sentence punctuation in cuDNN note (#157623)
Fixes #ISSUE_NUMBER
This PR fixes a small punctuation issue in the PyTorch README.

Specifically:

Added a missing full stop at the end of the sentence:
"Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware."

Added comma for clarity between "CUDA driver" and "NVIDIA hardware".

These edits improve the readability and grammatical correctness of the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157623
Approved by: https://github.com/Skylion007
2025-07-05 01:37:33 +00:00
e0fd48be7d Fix typo: 'occurances' → 'occurrences' in mobile model test (#157629)
This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`.

### Changes:
- Corrected "occurances" to the correct spelling "occurrences"
- Renamed associated variables to reflect this change for consistency and clarity

This is a non-functional, cleanup-only PR to improve code readability.

Thanks to the PyTorch team for maintaining such a high-quality codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157629
Approved by: https://github.com/Skylion007
2025-07-05 01:36:42 +00:00
43f7216327 Fix typo: 'paramters' → 'parameters' in ATen tunable README (#157575)
This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency.

Context
Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project.

Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157575
Approved by: https://github.com/eqy
2025-07-05 01:14:45 +00:00
8a8fac1131 [SymmMem] Move code to where it is used (#157611)
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157611
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-04 23:37:49 +00:00
bcc98bb2a4 Update _linux-test to support B200 runner (#157341)
This unblocks https://github.com/pytorch/test-infra/issues/6869.  The key changes to call out:

* B200 needs OIDC to access ECR and upload stats to S3, so we need to set `id-token: write` in `_linux-test`.  All workflows calling `_linux-test` also need to be updated accordingly
* Connecting sccache to S3 on B200 doesn't seem to work, so I disable it.  It still works locally though.

### Testing

https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157341
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-07-04 23:19:24 +00:00
524e827095 [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #157557
2025-07-04 19:25:14 +00:00
9968edd002 Fix #153942 (#153943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153943
Approved by: https://github.com/malfet
2025-07-04 18:25:18 +00:00
7275f28045 Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630)
This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch:
1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change.
2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-07-04 18:08:31 +00:00
7be862ab8f [dynamo] Relax DUPLICATED_INPUT to be serializable. (#157492)
Since we don't actually rely on any real data while building DUPLICATE_INPUT guard, we can safely serialize it with sources and it should be able to reconstruct the guard correctly in the new process. Therefore we don't really need to prevent serializing it.

Differential Revision: [D77683302](https://our.internmc.facebook.com/intern/diff/D77683302/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157492
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-07-04 15:19:34 +00:00
336f1e2d35 [AOTI] Fix AOT inductor CMake build dependency order (#157557)
compile_model.py -> aoti_custom_class -> torch

The custom command requires `torch` to be installed.

8408522976/test/cpp/aoti_inference/compile_model.py (L1-L7)

Fixes CI failure on trunk:

- https://github.com/pytorch/pytorch/actions/runs/16041370426/job/45275085572#step:22:18348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157557
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-04 14:33:36 +00:00
a46ea8a364 Fix typo: 'initalized' → 'initialized' in alias analysis test (#157628)
This PR corrects a small spelling error in `test/jit/test_alias_analysis.py`.

- "initalized" → "initialized"

This is a minor comment correction and does not affect functionality or logic.

Thank you for maintaining this amazing codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157628
Approved by: https://github.com/Skylion007
2025-07-04 13:41:53 +00:00
f41d017aa6 Add device check in mse_loss (#155089)
Fixes #154978

## Test Result

```python
>>> import torch
>>> import numpy as np
>>> import torch.nn as nn
>>> import torch.distributions.normal as norm
>>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu'))
>>> print('Using {}'.format(device))
Using cuda
>>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh())
>>> m.to(device, dtype=None, non_blocking=False)
Sequential(
  (0): Linear(in_features=1, out_features=128, bias=True)
  (1): Tanh()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): Tanh()
  (4): Linear(in_features=128, out_features=128, bias=True)
  (5): Tanh()
)
>>> opt = torch.optim.Adam(m.parameters(), lr=0.001)
>>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad)))
Number of trainable parameters:  33280
>>> input_tensor = torch.tensor(77.0, device=device)
>>> target = torch.tensor(66.0)
>>> loss_function = nn.MSELoss()
>>> print('Loss Function: ', loss_function)
Loss Function:  MSELoss()
>>> loss = loss_function(input_tensor, target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss
    return torch._C._nn.mse_loss(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-04 12:37:48 +00:00
52e4e41cbc [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598)
`lru_cache` usage warning was being raised for `torch.get_device_module()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598
Approved by: https://github.com/Sidharth123-cpu
2025-07-04 08:17:50 +00:00
64f2ec77f8 [inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912)
Fixes #156682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912
Approved by: https://github.com/angelayi
2025-07-04 06:09:28 +00:00
fdc5b42a8f _broadcast_shapes gso generalizations (#157008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157008
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-04 05:56:42 +00:00
d58ed04d89 [async-compile] add progressive compile mode (#157305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157305
Approved by: https://github.com/aorenste
2025-07-04 04:18:50 +00:00
386bc9e2e9 [audio hash update] update the pinned audio hash (#156905)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156905
Approved by: https://github.com/pytorchbot
2025-07-04 04:06:59 +00:00
f2e712ca14 Revert "Fix is_unaligned usage of statically_known_true (#157400)"
This reverts commit b359571c6043b40c4ae4fbb07135fd0f04902e21.

Reverted https://github.com/pytorch/pytorch/pull/157400 on behalf of https://github.com/malfet due to It break tests, see 99c1a6bdd9/1 ([comment](https://github.com/pytorch/pytorch/pull/157400#issuecomment-3034353539))
2025-07-04 03:57:08 +00:00
99c1a6bdd9 [SymmMem] Find NVSHMEM from system installation (#157513)
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157513
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-04 03:34:44 +00:00
4ed1b03f72 Add missing graph and memory related symbols to cuda_to_hip_mappings (#157435) (#157573)
Summary: This PR adds missing CUDA symbols in `cuda_to_hip_mappings`.

Test Plan: Tested in D77642700.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157573
Approved by: https://github.com/Skylion007

Co-authored-by: Geon-Woo Kim <gwkim@meta.com>
2025-07-04 03:03:04 +00:00
8f9a191db6 [SymmMem] Fix CI name mismatch; remove TORCH_SYMMMEM requirement (#157597)
Thanks @huydhn for spotting two name mismatches in the CI configs.
We were matching against "test_h100_symm_mem" instead of "h100-symm-mem".

Also, replaced `TORCH_SYMMMEM` env setting with programmatic method:
`symm_mem.set_backend(...)`

Further, skips a hanged test in `test_nvshmem_trion.py`. (#TODO @codingwithsurya )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157597
Approved by: https://github.com/fduwjj, https://github.com/huydhn
2025-07-04 01:43:08 +00:00
ef97bd4713 [torch] Add MTIA to the list of devices supporting foreach/fused kernels (#157583)
Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels.

Reviewed By: egienvalue, scottxu0730

Differential Revision: D77751248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583
Approved by: https://github.com/egienvalue
2025-07-04 01:15:24 +00:00
f0b388665e Add dynamo_timed to bytecode hook (#157587)
Test Plan:
- ran tlparse on vLLM and saw this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157587
Approved by: https://github.com/jingsh, https://github.com/BoyuanFeng
2025-07-04 01:11:03 +00:00
c9a5bf09ba [FP8] FP8 for SwishLayerNorm (#157574)
Summary: Add a pass use_triton_fp8_swish_replace_normal_swish to replace _triton_swish_rms_norm with its counterpart that supports fp8 triton_swish_rms_norm, and turn on fp8 during inference.

Test Plan:
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12.4 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=899072727_0 --node-replacement-dict="{}" --gpu-trace --add-passes=use_triton_fp8_swish_replace_normal_swish
```
The perf improvement on the 100x model with this pass is roughly ~7%, details are recorded [here](https://docs.google.com/document/d/1eIV_OTQyQcf_DlEDxwycTwhyGxT5OJkLzs8cPL6EMYc/edit?tab=t.0)

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76531303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157574
Approved by: https://github.com/frank-wei
2025-07-04 01:06:21 +00:00
dfcda613b6 Ensure Dynamo can trace through explicit dunder method call (#154366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154366
Approved by: https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066, #154263
2025-07-04 00:46:05 +00:00
0e7f02fe2e [Dynamo] [FrozensetSubclass] Add support for user defined frozensets (#154263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154263
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066
2025-07-04 00:46:05 +00:00
308b88bde9 [Dynamo] [Set] Add comparison for set subclass (#154066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154066
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065
2025-07-04 00:45:58 +00:00
c51da57b55 [Dynamo] [Set] Raise TypeError in set.union(...) and "__or__" (#154065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154065
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064
2025-07-04 00:45:50 +00:00
f9544f1f0c [Dynamo] [Set] Raise TypeError if object is unhashable (#154064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154064
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063
2025-07-04 00:45:42 +00:00
11c71053e0 [Dynamo] [Set] Implement some binop operators for dict/set/frozenset/dict_keys (#154063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154063
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553
2025-07-04 00:45:34 +00:00
22abe6ded4 [Dynamo] [SetSubclass] Add support for user defined sets (#153553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153553
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539
2025-07-04 00:45:25 +00:00
2b82c61f04 [Generator] Implement generator.__contains__ (#154539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154539
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991
2025-07-04 00:45:18 +00:00
f651e28f80 [FrozenSet] Fixes for FrozenSet (#152991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152991
Approved by: https://github.com/zou3519
ghstack dependencies: #153150
2025-07-04 00:45:11 +00:00
e7167dbacf [Set] Support sets in VariableBuilder (#153150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153150
Approved by: https://github.com/zou3519
2025-07-04 00:45:03 +00:00
6c42afe196 Introduce sync_cross_rank_decision (#156287)
Summary:
This is an improvement over `_broadcast_rank0_decision` where we uses the rank0's decision to broadcast to every rank. The issue of `_broadcast_rank0_decision` is that we observed large variance on the peak memory usage. One cause is that different ranks receive different dynamic shaped tensors and the hints of those tensors are different in different ranks. If we only rely on rank0's decision and it's unlucky to get unrepresentative hints, then the decision it makes may not be suitable for other ranks.

Here, we introduce `sync_cross_rank_decision` which comes up with the decision after comparing all ranks' local decision, it will:
1. all gather decisions from all ranks;
2. test each decision on the current rank and get its estimated memory usage;
3. all reduce estimated memory usage with ReduceOp.MAX, so that we know the maximum memory usage of each decision on all ranks;
4. pick the decision which gives us minimum maximum memory memory usage;

A graph to show more details
https://internalfb.com/excalidraw/EX484509

After applying sync_cross_rank_decision, we observed that the variance are much smaller

Rollback Plan:

Differential Revision: D76714005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156287
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-07-03 23:43:53 +00:00
f7130c097e [nativert] Move Executor to PyTorch core (#157514)
Test Plan:
CI

Rollback Plan:

Differential Revision: D77693984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157514
Approved by: https://github.com/zhxchen17
2025-07-03 23:31:54 +00:00
ad86c05b78 efficient zero_mask implementation for vec128_*_neon (#155766)
Differential Revision: [D76481039](https://our.internmc.facebook.com/intern/diff/D76481039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155766
Approved by: https://github.com/malfet
2025-07-03 23:27:03 +00:00
b359571c60 Fix is_unaligned usage of statically_known_true (#157400)
Summary:
- symbolic shapes statically_known_true usage  is wrong, this API is meant to be used for SymNodes. what is needed is V.graph.sizevars.statically_known_true. or  V.graph.sizevars.statically_known_Equals or ideally  V.graph.sizevars.statically_known_multiple_of.

- The construction using == 0 is not symbolic, this used to always return false for symbolic inputs.

Differential Revision: D77619293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157400
Approved by: https://github.com/ColinPeppler
2025-07-03 23:26:36 +00:00
a6fab82b16 [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC (#157453)
Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: https://github.com/pytorch/pytorch/pull/157411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157453
Approved by: https://github.com/kwen2501, https://github.com/atalman
2025-07-03 22:55:18 +00:00
dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
7081b8233a [BE] Accelerator agnostic timer.py (#157131)
Farewell to a lot of if statements - benefit is this now also supports mps synchronization

Still need to think of a good test strategy for the privateUse1 removal, granted I'm not sure what the semantics of something like https://docs.pytorch.org/docs/stable/generated/torch.cpu.synchronize.html actually since CPU is probably synchronous?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157131
Approved by: https://github.com/albanD
2025-07-03 22:23:04 +00:00
7b392bac13 all_gather_bucketing fx pass (#157396)
Porting passes to bucket all_gathers

The main logic of the pass is done via
1. Searching for all all_gathers from the buckets

Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.

Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.

Inductor changes:
Adding foreach_copy_ lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
2025-07-03 22:07:42 +00:00
19ae5afdaa Fix typo: 'recieve' → 'receive' in comments (#157544)
This PR corrects minor typos in developer-facing comments:

- Replaces 'recieve' with 'receive' in:
  - `FunctionalTensorWrapper.cpp`
  - `make_boxed_from_unboxed_functor.h`

These changes improve code readability and maintain comment correctness.

Thank you for reviewing!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157544
Approved by: https://github.com/soulitzer
2025-07-03 19:11:15 +00:00
3fd84a8592 [BE][PYFMT] migrate PYFMT for torch/[a-c]*/ to ruff format (#144554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554
Approved by: https://github.com/soulitzer
2025-07-03 18:56:07 +00:00
d56f11a1f2 [MPS] Implement logcumsumexp metal kernel (#156858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858
Approved by: https://github.com/malfet
ghstack dependencies: #157512
2025-07-03 18:16:25 +00:00
794b95d54b Enable Half dtype for logcumsumexp_backward (#157512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512
Approved by: https://github.com/malfet
2025-07-03 18:13:38 +00:00
e3fe001d9e Add einops x torch.compile testing in PyTorch CI (#157416)
Fixes #146782. This PR adds testing for multiple einops versions in
PyTorch CI. This occurs in a new "einops" CI job that runs for both
Python 3.9 and 3.13 (aka, what we test Dynamo over).

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416
Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305
2025-07-03 17:36:39 +00:00
660dbea909 [cutlass backend] modify presets ahead of cutlass 4 upgrade (#157522)
Differential Revision: [D77707409](https://our.internmc.facebook.com/intern/diff/D77707409/)

Also asking in https://github.com/NVIDIA/cutlass/issues/2435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157522
Approved by: https://github.com/coconutruben
2025-07-03 17:13:24 +00:00
5cfe4377d6 [dtensor] Rework partial propagation in pointwise op and support mul (#157340)
I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation.

In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute `aten.add(lhs, rhs)` on 2 ranks:
* `lhs` is partial, value on rank 0: `r0`, lhs value on rank 1: `r1`
* `rhs` is replicate, value: `a`

Then in order to preserve single device semantic (which should produce the value of `a + r0 + r1`), we do `rhs/world_size` first, then add `rhs` to `lhs`. This means every operand would first need be partial, then we can add them together.

But this become non-true for multiplicative operations, like `aten.mul`, for `aten.mul`, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1`

So to accomodate the difference of add/mul, in this PR I:
* change linearity to be a int to support different linearity types, add linearity and multiplicative are separate
* add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg)
* handle the linearity type plumbing through the pointwise ops.
* add `mul.Tensor/Scalar` to be the multiplicative linearity
* added the tests to show that the partial placements can be propagated with `aten.mul`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157340
Approved by: https://github.com/zpcore
2025-07-03 17:04:08 +00:00
898179331e [cutlass backend] fix CutlassTensor post-renaming (#157408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157408
Approved by: https://github.com/mlazos
ghstack dependencies: #157402
2025-07-03 17:02:21 +00:00
2e64e45b0b Revert "[build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)"
This reverts commit 404008e3efdabeaf5b140a3aff77131461c33a0a.

Reverted https://github.com/pytorch/pytorch/pull/155998 on behalf of https://github.com/malfet due to Broke inductor_cpp, wrapper see e472daa809/1 ([comment](https://github.com/pytorch/pytorch/pull/155998#issuecomment-3032915058))
2025-07-03 16:47:07 +00:00
e472daa809 [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-03 16:42:06 +00:00
ec816d73b4 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
2025-07-03 15:48:37 +00:00
f17f658125 [profiler] add more CUDA API for kernel launcher (#156016)
Add more kernel detection options, resolving TODO
- References : [NVIDIA - docs](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156016
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-07-03 15:26:42 +00:00
c9174a20f7 Revert "[BE] Unskip special ops (#157464)"
This reverts commit e124a0d88ca2aa04bfaca2dcabf5de6244048e45.

Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
b6276a425f Revert "[MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)"
This reverts commit 9620994067b18e846a097d1e99af85ec2426ef0a.

Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
a0e0abd037 Fix typo: 'intialized' → 'initialized' in test_modules.py (#157226)
This PR fixes a minor typo in `test/jit/test_modules.py`:

- Before: `intialized`
- After:  `initialized`

There are no functional code changes — this is a comment-only fix to improve clarity and consistency.

Thank you to the PyTorch team for maintaining this outstanding project.
Please let me know if anything else is needed.

With appreciation,
Abhishek Nandy
[@abhitorch81](https://github.com/abhitorch81)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157226
Approved by: https://github.com/Skylion007
2025-07-03 14:56:02 +00:00
b221be9140 Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306)
This is a minor typo fix in `test/test_transformers.py`:

- Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming.

There are **no functional or logic changes** — this PR is aimed purely at improving readability and maintaining code quality.

Thanks to the PyTorch team for their work and review time
Please feel free to suggest if this needs any adjustment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306
Approved by: https://github.com/Skylion007
2025-07-03 14:08:12 +00:00
8408522976 Remove +PTX from CUDA 12.8 builds (#157516)
Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh.
Removing +PTX reduces binary size required to be able to upload binaries to pypi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516
Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv
2025-07-03 13:19:19 +00:00
c329a8f19c Fix CPU bitwise shifts for out-of-limit values in VSX-vec (#157463)
Similar to #96659 this implements the conditionals handling the out-of-limit values in the shift amounts (rhs) for the vectorized VSX code using the same logic as the scalar code.

Fixes #109777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157463
Approved by: https://github.com/jgong5
2025-07-03 10:41:33 +00:00
5dfd8a9c7a Remove is_jit_trace option (#157387)
Summary: Title

Test Plan:
CI

Rollback Plan:

Differential Revision: D77319249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157387
Approved by: https://github.com/pianpwk
2025-07-03 09:20:27 +00:00
8c2e450082 [PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks (#157487)
Summary:
This is a follow up after the PR to add comm override support: https://github.com/pytorch/pytorch/pull/155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157487
Approved by: https://github.com/weifengpy
2025-07-03 07:00:35 +00:00
2bb33e7a08 Fixed triton kernel in ET due to Triton version change. (#157484)
Summary: Fixed triton kernel in ET due to Triton version change.

Test Plan:
buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Rollback Plan:

Differential Revision: D77398841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157484
Approved by: https://github.com/davidberard98
2025-07-03 06:16:23 +00:00
4ce6e6ec88 XCCL changes for DDP (#155497)
Add XCCL documentation for DDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155497
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-03 05:18:08 +00:00
382598ef87 Fix unsafe collective reorder past wait (#157489)
Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim

Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note:  this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective.  After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
2025-07-03 05:04:19 +00:00
dc524efb4d Move logging into inner method for reorder pass (#156879)
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass.  The logs should be in the inner part, so
they are functional also during unit testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
2025-07-03 05:04:19 +00:00
5d5a5b3501 Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466)
#149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782).

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-07-03 04:57:52 +00:00
404008e3ef [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
2025-07-03 04:10:44 +00:00
b642a5c118 [cutlass backend] Add dynamo timed (#157410)
Differential Revision: [D77631592](https://our.internmc.facebook.com/intern/diff/D77631592/)

Before:
![Screenshot 2025-07-01 at 4 08 06 PM](https://github.com/user-attachments/assets/8f6445aa-50c7-456f-b5ac-b2749eb9bf40)

After (different run):
![Screenshot 2025-07-01 at 5 11 09 PM](https://github.com/user-attachments/assets/7513d312-c4dc-4e39-9718-c63eb641bc30)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157410
Approved by: https://github.com/jingsh
2025-07-03 04:03:20 +00:00
493f42a541 [symm_mem] Create a one side get api for symm mem (#157294)
Doing similar like what we did in https://github.com/pytorch/pytorch/pull/156443 so that we can also have a one-side get API for symmetric memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157294
Approved by: https://github.com/kwen2501
2025-07-03 03:52:05 +00:00
662c1cfed2 [c10d][PGNCCL] Add waitcounter for watchdog and heartbeat monitoring thread (#157480)
We want to have a wait counter for both side thread so that we can monitor its lifecycle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157480
Approved by: https://github.com/d4l3k
2025-07-03 02:47:06 +00:00
5cc4e856fd Add device_id to XPU device properties (#156481)
# Motivation

Some older Intel iGPUs may share the same device name across different hardware products.
(See [device name example](aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)))
To help disambiguate which specific iGPU product is being used, we introduce the use of a
[device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html) and enables more accurate identification and troubleshooting for user issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-07-03 01:22:11 +00:00
7597988f1b [fake tensor] fix issue of no attribute tags (#156689)
Fixes #156688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2025-07-03 01:16:01 +00:00
9620994067 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
ghstack dependencies: #157464
2025-07-02 23:29:35 +00:00
e124a0d88c [BE] Unskip special ops (#157464)
They were slow on CUDA-11.3, which has long been gone, let's see if they work now

Before
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 42.379s

OK (skipped=80)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007
2025-07-02 23:16:52 +00:00
7cfd054075 [attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#157472)
Summary:
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.

Test Plan:
contbuild & OSS CI,

Rollback Plan:

Reviewed By: malfet

Differential Revision: D77639021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472
Approved by: https://github.com/aorenste
2025-07-02 23:12:29 +00:00
d40aaa42ee [BE][16/16] fix typos in torch/ (torch/utils/) (#156606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156606
Approved by: https://github.com/albanD
ghstack dependencies: #156318, #156320, #156602, #156604
2025-07-02 22:55:29 +00:00
11c07c848c [BE][14/16] fix typos in torch/ (torch/fx/) (#156604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156604
Approved by: https://github.com/jingsh
ghstack dependencies: #156318, #156320, #156602
2025-07-02 22:55:29 +00:00
db259bd6b8 [BE][12/16] fix typos in torch/ (#156602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156602
Approved by: https://github.com/justinchuby, https://github.com/albanD
ghstack dependencies: #156318, #156320
2025-07-02 22:55:29 +00:00
d5cdc36943 [BE][10/16] fix typos in torch/ (torch/csrc/jit/) (#156320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156320
Approved by: https://github.com/albanD
ghstack dependencies: #156318
2025-07-02 22:55:29 +00:00
541584d22e [BE][8/16] fix typos in torch/ (torch/csrc/jit/) (#156318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156318
Approved by: https://github.com/albanD
2025-07-02 22:55:29 +00:00
c0e155a8d2 [cutlass backend] Use alignment of D for EVT / Float8 (#157402)
I encountered an C++ compile error from running cutlass backend tests when upgrading cutlass version. It seems like Nvidia added
"static_assert(detail::is_aligned<ElementC_, AlignmentC, ElementD_, AlignmentD>(),"

b995f93317/include/cutlass/epilogue/collective/builders/sm90_builder.inl (L297)

However, it seems codegen have the wrong alignment for D. For C, 1 is okay since it is void. But for D, this is probably wrong.
```
    void, cutlass::layout::ColumnMajor, 1,
    cutlass::bfloat16_t, cutlass::layout::RowMajor, 1,
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157402
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-07-02 22:55:00 +00:00
48560eef80 [dynamo] Fix bug in dict(mapping_proxy) (#157467)
Fixes https://github.com/pytorch/pytorch/issues/157284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467
Approved by: https://github.com/jansel, https://github.com/StrongerXi

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-02 22:13:02 +00:00
fd4f704905 [ez][CI] Print set output in CI (#157477)
Print what the output that's getting set is for better debugging

It's probably bad there are 4 of these, but I'm also not sure if imports will behave correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157477
Approved by: https://github.com/huydhn
2025-07-02 21:47:19 +00:00
60e66d11ab [CI] Keep-going on main (#157470)
Run an experiment where we turn on keep going on main.  Revert this PR to cancel the experiment

There have been a couple of changes that make it so that HUD will show the failure early even while the job is in progress, so triaging for reverts should still be able to happen quickly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157470
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-02 21:42:46 +00:00
4b4c2a7b1d Support complex numbers in DTensor redistribute (#157329)
Add complex number unwrapping in functional collectives used by DTensor.

Complex tensors are not directly supported by underlying comm kernels
(e.g. nccl) but complex tensors can be viewed as real tensors of a
higher rank (added size-2 tensor dim represents real vs im component).
Collective output is then viewed as complex to restore the
original/expected shape and dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329
Approved by: https://github.com/XilunWu
2025-07-02 21:37:16 +00:00
af9c92b4cb [CI] Remove redundant accuracy benchmarks for cpp_wrapper (#155966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155966
Approved by: https://github.com/desertfire
2025-07-02 20:58:08 +00:00
c09cf29d7d [ez][BE] Tag deletion script to delete any old ciflow + autorevert tags (#157468)
Change the branch/tag deletion script that runs once per day to delete more tags

Previous: only delete ciflow tags that didn't correspond to an open PR
New: delete ciflow tags attached to commits that are > 7 days old.  Also delete `trunk/<sha>` (I think they are for autorevert) tags that are attached to commits that are > 7 days old

It's hard to figure out when the actual tag was pushed or created, so instead it looks at the commit date, which might lead to unexpected behavior if the tag was pushed much later than the commit (ex triggering periodic later to bisect).  I think it's ok though since you don't really need the tag after the workflow runs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157468
Approved by: https://github.com/izaitsevfb
2025-07-02 20:42:32 +00:00
6f60cfe9b1 [ez] Add super().setUp() in test_ops::TestFakeTensor (#157475)
Noticed some disable issues getting a bunch of comments, so I took a look

One day I'll write a better check for this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157475
Approved by: https://github.com/huydhn
2025-07-02 20:34:00 +00:00
e20784f228 [dynamo] Support BUILTIN_MATCH serialization. (#157016)
Serialize BUILTIN_MATCH since they are all stored in __builtin__ dict.

Also fixed an issue that the wrong global scope is passed to CheckFunctionManager while loading guards. Previously we can always reuse the compile-time global scope for evaluating guards because the compile-time and runtime global scope are always the same.

For precompile, we need to serialize the compile-time global scope for loading only. We need to point the CheckFunctionManager to the new global scope after loading is finished for evaluating guards.

Differential Revision: [D77159313](https://our.internmc.facebook.com/intern/diff/D77159313/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157016
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-07-02 20:24:24 +00:00
172853547a [inductor] more size_hint_or_throw usage (#157394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157394
Approved by: https://github.com/jingsh
2025-07-02 20:20:59 +00:00
e0ab1b538a [ez][BE] Remove max jobs override for CI build jobs (#157473)
Basically reverts #147487 since it's not needed anymore

Not an exact revert because some things have already been removed in a different PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157473
Approved by: https://github.com/huydhn
2025-07-02 20:12:28 +00:00
3f569f9af7 [BE] Remove extra semicolon (#157486)
Fixes
```
/Users/nshulga/git/pytorch/pytorch/torch/nativert/executor/GraphExecutorBase.cpp:16:58: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
   16 |       execPlan_(ExecutionPlanner{graph_}.createPlan()) {};
      |                                                          ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157486
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007
2025-07-02 19:56:21 +00:00
94716db222 [BE][DCE] eliminate remnants of global gemm cache (#157327)
Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77520979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327
Approved by: https://github.com/jansel
2025-07-02 19:52:35 +00:00
06f39a71b6 Add Release 2.8 CUDA matrix. Update Release schedule for 2.7.1 and 2.9 (#157482)
This PR:
- Adds Release 2.8 CUDA matrix
- Update Release 2.9 schedule, to make it more similar to 2.5 release schedule. Mid Oct release
- Update 2.7.1 release day
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157482
Approved by: https://github.com/Camyll
2025-07-02 19:52:24 +00:00
36dd598bda layernorm tests: Tweak test thresholds for comparing tensors (#156699)
After I landed this PR: https://github.com/pytorch/pytorch/pull/156600, this test was failing internally on large tensors because the differences were greater than tolerances on some cuda devices.

We now raise the tolerances for larger tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156699
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-07-02 19:33:38 +00:00
32983ea698 [nativert] continue to move generated static dispatch kernels (#157460)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77623080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157460
Approved by: https://github.com/zhxchen17
2025-07-02 19:28:13 +00:00
5e636d664a [BE] @serialTest decorator must be called (#157388)
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase

class MegaTest(TestCase):
    @serialTest
    def test_foo(self):
        if hasattr(self.test_foo, "pytestmark"):
            print("foo has attr and it is", self.test_foo.pytestmark)
        print("foo")

    @serialTest()
    def test_bar(self):
        if hasattr(self.test_bar, "pytestmark"):
            print("bar has attr and it is", self.test_bar.pytestmark)
        print("bar")

if __name__ == "__main__":
    run_tests()
```

That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.013s

```

Added assert that arg is boolean in the decorator to prevent such silent skips in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
2025-07-02 19:15:19 +00:00
eaf32fffb7 fixed a tiny typo in torch.compiler.md (#157462)
Fixes #157444

there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462
Approved by: https://github.com/Skylion007, https://github.com/svekars
2025-07-02 19:15:15 +00:00
0e9d8032a3 [build] remove cmake cache and reconfigure again if it is invalid (#156958)
See also:

- astral-sh/uv#14269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156958
Approved by: https://github.com/Skylion007
ghstack dependencies: #156742
2025-07-02 18:46:32 +00:00
0105cd89ab [ONNX] Fix conversion of attention - 4D (#157130)
Fixes a wrong conversion to onnx while investigation #149662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130
Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-07-02 18:05:10 +00:00
d5d14ee823 [nativert] create persistent value helper (#157286)
Summary: att

Test Plan: CI

Reviewed By: georgiaphillips

Differential Revision: D74300519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157286
Approved by: https://github.com/SherlockNoMad
2025-07-02 17:15:52 +00:00
156bc243f0 Back out "Include c++ stack traces when we hit constraint violation (#155603)" (#157406)
Summary:
Original commit changeset: 4b3fdaa8f2c6

Original Phabricator Diff: D76434787

Meta:
https://fb.workplace.com/groups/1286739428954016/permalink/1535462614081695/

Test Plan:
Meta:
Revert D76434787 for S536719

Rollback Plan:

Differential Revision: D77626334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157406
Approved by: https://github.com/bobrenjc93
2025-07-02 16:51:07 +00:00
bd6b5fddbf [Precompile] [easy] Serialize requires_grad for tensors when serializing guards (#157372)
Need to keep requires_grad on the tensor when serializing/deserializing guards. This matters when there's a TENSOR_MATCH guard on a tensor that requires_grad. Added a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157372
Approved by: https://github.com/jansel, https://github.com/zhxchen17
ghstack dependencies: #156433
2025-07-02 16:34:37 +00:00
54701a0c94 Add is_hidden_event method to KinetoEvent Python interface (#155214)
Fixes #155213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155214
Approved by: https://github.com/sraikund16
2025-07-02 16:29:21 +00:00
0edc1b91f7 [Inductor] Disable decompose_k for AMD (#157283)
Differential Revision: D77544250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157283
Approved by: https://github.com/bdhirsh
2025-07-02 15:21:46 +00:00
9f5276dc07 Fix typo: 'Intializes' → 'Initializes' in _distributed_c10d.pyi docst… (#157455)
Description:

This PR fixes a small documentation typo in torch/_C/_distributed_c10d.pyi, correcting:

Intializes → Initializes

This helps improve clarity in internal docstrings for maintainers and contributors.
Let me know if further changes are needed. Thanks for your time and the amazing work on PyTorch!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157455
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-02 15:19:05 +00:00
9d175bc7e6 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-07-02 15:04:00 +00:00
b096341963 [BE] use pathlib.Path instead of os.path.* in setup.py (#156742)
Resolves:

- https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742
Approved by: https://github.com/malfet
2025-07-02 14:57:58 +00:00
82eefaedd9 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-02 14:02:01 +00:00
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
d5a89178b0 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 77676753ecabf6a6645bdd3abfe01939e5751e76.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on [D77652793](https://www.internalfb.com/diff/D77652793). Maybe codev could be a better strategy to merge this PR faster... ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3027952946))
2025-07-02 13:48:41 +00:00
bdb7819166 [dynamo, nested graph breaks] remove recursive cell/freevar in instruction tx (#154078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154078
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 13:36:14 +00:00
34c8033fd3 Fix a div_mod bug in generic_math.h (#157383)
Summary: There is a bug in integer div_mod that when the remainder is 0 and the divisor is negative, mod operation produces a negative number. Fixed in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157383
Approved by: https://github.com/angelayi, https://github.com/jingsh
2025-07-02 12:22:57 +00:00
ab2294d828 [dynamo] fix _torchdynamo_orig_callable naming issues (#156901)
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks

We rename ~the first usage to `_torchdynamo_orig_fn`~ and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.

UPDATE: seems like both internal and OSS users depend on `_torchdynamo_orig_callable`, but it only seems in the first context. We should thus keep the original name for the first case then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 09:53:55 +00:00
3173616532 [nativert] start to move generated static dispatch kernels (#157403)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77622952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157403
Approved by: https://github.com/georgiaphillips
2025-07-02 08:42:01 +00:00
8c0df6fe17 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 42b48ee67229286127390000f103a11dfc8901f5.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red in internal runs, check D77593713 ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3026944993))
2025-07-02 08:30:39 +00:00
0364db7cd1 [PT] support custom all_gather and reduce_scatter comms (#155189)
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.

This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`

Related PR:
https://github.com/pytorch/pytorch/pull/150564

Test Plan: CI

Differential Revision: D75714362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
2025-07-02 06:58:45 +00:00
f8c0a4bd28 [inductor] enable bf32 test for mkldnn conv (#127293)
Enable more test on inductor conv + bf32
Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv_transpose2d_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_binary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127293
Approved by: https://github.com/jgong5
ghstack dependencies: #126050, #126054

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:49:01 +00:00
4c8eb65efb allow to use bf16 as fp32 internal precision for mkldnn conv backward (#126054)
Used for CI since depends on ideep update.

Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d backward vs. BF16 internal computation conv backward on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 461.6734| 358.3779| 1.48
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 358.3779 | 247.8631| 1.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 4.3783| 3.8513| 1.14

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 16.6119 | 12.2047 | 1.38
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 12.0016 | 8.6711 | 1.38
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 20.5947 | 15.9366 | 1.29
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 40.0952 | 32.2222 | 1.24
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 162.7449 | 142.3054 | 1.14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126054
Approved by: https://github.com/jgong5
ghstack dependencies: #126050

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:40:13 +00:00
5a2db5152d allow to use bf16 as fp32 internal precision for mkldnn conv (#126050)
Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d vs. BF16 internal computation conv2d on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804  | 7.4349 | 0.89
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940  | 3.8093 | 1.31
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861  | 2.07

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:31:23 +00:00
0a63053fe9 Don't store flamegraph to tmp folder (#157374)
Where it's accessible(and mutable) by multiple users. Instead use
`~/.cache` folder instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157374
Approved by: https://github.com/eqy
ghstack dependencies: #157373
2025-07-02 00:46:51 +00:00
bb476310a4 [dynamo][guards] Stash root guard manager pointer in the LeafGuard (#157325)
Preparing to simplify the recompilation reason codebase. This PR was 95% done by using AI tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157325
Approved by: https://github.com/jansel
2025-07-02 00:42:43 +00:00
fa1c20ae92 Fix test consolidate hf safetensors (#157386)
Need to change an argument name that was changed in the test so that it doesn't throw

Differential Revision: [D77604210](https://our.internmc.facebook.com/intern/diff/D77604210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157386
Approved by: https://github.com/meetv18
ghstack dependencies: #154743, #156705
2025-07-02 00:16:21 +00:00
77676753ec [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-02 00:10:01 +00:00
617e3f69f8 [FP8] Fix Benchmarking for certain Priors (#155722)
Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Differential Revision: D76092551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722
Approved by: https://github.com/Skylion007
2025-07-02 00:01:23 +00:00
ab6cb34480 Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)"
This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57.

Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))
2025-07-01 23:21:37 +00:00
c6a27bae36 Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit d0a9629435aaceb5acbf31aad70f2109cb8a3ea2.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally  ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794))
2025-07-01 22:58:14 +00:00
563fd95563 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-01 22:51:11 +00:00
6ef70edd9a Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 47f10d0ad0dda281c886ff08ac2f938207027316.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))
2025-07-01 22:11:53 +00:00
3df6360e8c [BE][Easy][setup] use super().method(...) in command subclasses in setup.py (#156044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156044
Approved by: https://github.com/albanD
ghstack dependencies: #156741
2025-07-01 22:09:10 +00:00
d0a9629435 [do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-07-01 21:39:38 +00:00
22edb457c9 [invoke_subgraph][partitioner] Add meta val on run_and_save_rng ops (#157319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157319
Approved by: https://github.com/zou3519
2025-07-01 21:02:08 +00:00
e5f6ffd810 [BE] Replace checkcall("chmod") with os.chmod (#157373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157373
Approved by: https://github.com/clee2000, https://github.com/eqy, https://github.com/Skylion007
2025-07-01 20:46:25 +00:00
019e30e3b8 [BE] Decorate LargeTensorTest with serialTests (#157382)
May be it'll help make M2-15 jobs more stable, as that was the last test run before OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157382
Approved by: https://github.com/clee2000
2025-07-01 20:35:42 +00:00
4500a4aa50 remove allow-untyped-defs from torch/backends/mps/__init__.py (#157227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157227
Approved by: https://github.com/Skylion007
2025-07-01 20:00:19 +00:00
6bc263809d [SymmMem] Add NVSHMEM_CHECK macro (#157174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157174
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-07-01 19:50:28 +00:00
ffac0de07e [export] Remove stack trace from input/output (#157302)
Fixes https://github.com/pytorch/pytorch/issues/157183

https://github.com/pytorch/pytorch/pull/156257 consolidated the path for saving stack traces, but missed the part where stacktraces are not added to placeholder/output nodes in proxy_tensor tracing [(code)](https://github.com/pytorch/pytorch/pull/156257/files#diff-6960ce90e7162c0953b1ca07e92e7f0f2f6ba63b427b42df593e20cc6a096bb7L1107).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157302
Approved by: https://github.com/yushangdi
2025-07-01 19:16:28 +00:00
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
6401d1d53d Revert "Fused RMSNorm implementation (#153666)"
This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001) [HUD commit link](e1aee86646) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176))
2025-07-01 18:46:45 +00:00
3a5677a380 Revert "ci: Add ability to test images for build-triton-wheel (#156894)"
This reverts commit 0e47312ae5a687f0aed61db753d03180118cddc4.

Reverted https://github.com/pytorch/pytorch/pull/156894 on behalf of https://github.com/seemethere due to causing issues in downstream builds see https://github.com/pytorch/pytorch/pull/156664 for more info ([comment](https://github.com/pytorch/pytorch/pull/156894#issuecomment-3025137790))
2025-07-01 18:43:34 +00:00
02608e560a [ROCm] Add more shards for inductor dashboard, more frequent runs (#157288)
Also increases regularity of dashboard runs on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157288
Approved by: https://github.com/jeffdaily
2025-07-01 18:27:30 +00:00
e1aee86646 Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
1c8844d9e7 [MPS] Switch Cholesky decomp to column wise (#157014)
Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like.

To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host

TODOs:
 - Rename upper parameter to something more sensible and add comments
 - Use simd_groupsize instead of hardcoded 32 everywhere

Fixes https://github.com/pytorch/pytorch/issues/156658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157179
2025-07-01 18:00:59 +00:00
720c2c46b1 [Inductor UT][XPU] Reduce the runtime of the test case test_comprehensive_nn_functional_max_pool2d_xpu. (#157357)
This test case has over a thousand input samples, causing it to run for more than 30 minutes, which triggers the timeout mechanism and breaks the XPU CI. This PR limit the sample number as one for this XPU case .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157357
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2025-07-01 17:47:49 +00:00
3bc6bdc866 [BE] add type annotations and run mypy on setup.py (#156741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156741
Approved by: https://github.com/aorenste
2025-07-01 17:09:05 +00:00
47f10d0ad0 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-01 16:51:03 +00:00
0f9c1b374f [dynamo] Ensure global state guard is preserved across serialization. (#157285)
Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python.

Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285
Approved by: https://github.com/jansel
2025-07-01 15:46:34 +00:00
b146e1a264 [BE] remove duplicates in generated torch._VF.__all__ (#157365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157365
Approved by: https://github.com/Skylion007
2025-07-01 15:43:20 +00:00
c78fce9e79 [dynamo] show frame information when recompilation is triggered on fail_on_recompile (#156433)
adding more information to the error message for debugging.

example error message:
```
Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0
Failed on the following precompiled guards:

TREE_GUARD_MANAGER:
+- RootGuardManager
| +- LAMBDA_GUARD: isinstance(L['x'], bool)
GuardDebugInfo(
result=0,
verbose_code_parts=["isinstance(L['x'], bool)"],
num_guards_executed=1)
```

Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433
Approved by: https://github.com/jamesjwu
2025-07-01 15:15:58 +00:00
023887fc5a Revert "Switch to standard pep517 sdist generation (#152098)"
This reverts commit f16053f0c9a09fa337fbf85aaf64f88712b8dcdb.

Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))
2025-07-01 14:14:52 +00:00
1586521461 Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit 2c76f31221e117b217b8a6a96a5405f626d2218a.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929))
2025-07-01 11:23:00 +00:00
534c454e77 Revert "[xla hash update] update the pinned xla hash (#156584)"
This reverts commit b1a54fab9bcb0cc167773f9a885d4170447e1c68.

Reverted https://github.com/pytorch/pytorch/pull/156584 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/155590 ([comment](https://github.com/pytorch/pytorch/pull/156584#issuecomment-3023492421))
2025-07-01 11:20:05 +00:00
13bf2655c1 Revert "HF loads dcp - don't do a full deserialize on every file (#155942)"
This reverts commit 117db5601d78cbc746b35eef71fc815e042e903f.

Reverted https://github.com/pytorch/pytorch/pull/155942 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red internally, more details on D76442012 ([comment](https://github.com/pytorch/pytorch/pull/155942#issuecomment-3023473036))
2025-07-01 11:15:08 +00:00
0bce390269 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 20e40492b046b9287726d3ec656117e4dc38f0e2.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on D77471538 ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3023455082))
2025-07-01 11:10:45 +00:00
a767e50adc remove allow-untyped-defs from torch/fx/experimental/migrate_gradual_types/util.py (#157236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157236
Approved by: https://github.com/ezyang
2025-07-01 10:36:48 +00:00
210632fae1 [ROCm] support experimental CU carveout (#149466)
Fixes #149280.  Follow up to #147966, but now available for ROCm.

Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied.  We pass this masked stream to hipblaslt instead of pytorch's current stream.  We ensure stream ordering between streams using hipEvents and stream synchronization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466
Approved by: https://github.com/malfet, https://github.com/atalman
2025-07-01 08:54:52 +00:00
0596323c35 Better fix for __index__ SymInt issue (#157201)
This improves on #156928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157201
Approved by: https://github.com/ezyang
2025-07-01 07:06:46 +00:00
c202a7329a Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit 23491519d288dedb2a54cfad5fef7fcb2ad8eade.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/XuehaiPan due to sys.get_int_max_str_digits is not always available ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3021990027))
2025-07-01 06:16:49 +00:00
754699610b [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang
2025-07-01 06:07:29 +00:00
8f0998aafe Check F2C BLAS for OpenBLAS and other vendors (#143846)
This issue came from https://github.com/conda-forge/pytorch-cpu-feedstock/issues/180. MKL follows the F2C convention for returning single precision floats as doubles and uses the G77 convention for returning complex valued scalars. OpenBLAS does the opposite. There is a check for this already, but it's done only when the Generic BLAS vendor code path is used and this PR moves that code to `Dependencies.cmake` to make it work when the BLAS vendor is OpenBLAS and others

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143846
Approved by: https://github.com/rgommers, https://github.com/atalman
2025-07-01 05:56:24 +00:00
04bd7e6850 [ROCm] Remove use of warpsize on host-side compilation (#156979)
Changes needed for ROCm7.0:
* `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore
* `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime
* Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-01 04:55:31 +00:00
c811f41cf5 [BE] Remove unused variable from Pooling.metal (#157332)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:101:21: warning: unused variable 'indices_sizes' [-Wunused-variable]
  constant int64_t* indices_sizes = params.indices_sizes.data();
                    ^

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157332
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/dcci
2025-07-01 04:28:04 +00:00
4d5d627e5f Remove super spammy log (#157157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157157
Approved by: https://github.com/davidberard98
2025-07-01 03:51:58 +00:00
b40981c630 Fix incorrect stride handling in adaptive_avg_pool3d (#157326)
Fixes #157248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326
Approved by: https://github.com/eqy
ghstack dependencies: #157242
2025-07-01 03:03:48 +00:00
b5ce77c1f5 [ROCm] Initial AITER Integration for mha_bwd asm kernels (#152630)
Generates AITER plumbing via cmake. Calls into fav3 asm bwd CK kernels.

Update submodule composable kernel for this change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152630
Approved by: https://github.com/xw285cornell, https://github.com/yoyoyocmu
2025-07-01 02:53:27 +00:00
f40efde2a4 [CI] Add prebuild command option, set prebuild command option for CI to build flash attention (#156236)
Build flash attention separately in build using 2 jobs since it OOMs on more, then the rest of the job uses 6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156236
Approved by: https://github.com/malfet
2025-07-01 02:53:22 +00:00
3ed4384f5b [dynamo] temporarily disabling generation of weblinks for torch v2.8 release (#157299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157299
Approved by: https://github.com/williamwen42
2025-07-01 02:31:17 +00:00
c174f3a6a5 [ONNX] Delete deprecated tutorial page link (#157310)
Related to https://github.com/pytorch/tutorials/issues/3420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310
Approved by: https://github.com/justinchuby
2025-07-01 01:18:26 +00:00
6dc2b22269 [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce (#156746)
The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:
![image](https://github.com/user-attachments/assets/57005856-4cb5-43cd-8e9c-46869f75ab0b)

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.

co-authored by: @amd-hhashemi
https://github.com/pytorch/FBGEMM/issues/4072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-07-01 00:37:30 +00:00
f860992db5 Add a custom profiler configuration option (#151656)
We aim to pass some configuration options to our custom Kineto backend via ExperimentalConfig,, so we added a `custom_profiler_config` parameter.

Requires https://github.com/pytorch/kineto/pull/1077 ,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151656
Approved by: https://github.com/sraikund16
2025-07-01 00:36:09 +00:00
b60569ed94 HF - consolidate shards of safetensors files to full tensors in finish step (#156705)
Title - we can consolidate the shards to a full tensors, optionally behind a flag, in the finish step of DCP.save
also adds the thread count argument which is configurable for users, before we were just using the default of 1.
Re-creating https://github.com/pytorch/pytorch/pull/155940 bc it got into a bad detached state

Differential Revision: [D77231774](https://our.internmc.facebook.com/intern/diff/D77231774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156705
Approved by: https://github.com/saumishr
ghstack dependencies: #154743
2025-07-01 00:30:48 +00:00
4ebd269065 [Testing] Remove duplicate MPSInductor tests (#157328)
They were added there before test_torchinductor were running in CI, but
now the same are covered by `GPUTests.test_pointwise_*_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157328
Approved by: https://github.com/huydhn
2025-07-01 00:21:22 +00:00
7709ff5512 [remove untyped defs] batch 1 (#157011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157011
Approved by: https://github.com/Skylion007
2025-06-30 23:54:40 +00:00
fee2377f9e Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251)
Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week.

Original summary:

headeronly is more clear, let's change the name before anyone depends on standalone

Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire
2025-06-30 23:25:30 +00:00
3dda80e990 Overload mul_overflows for size_t (#155736)
Partially fixes https://github.com/pytorch/executorch/pull/11537.

We want to extend `mul_overflows` to support `size_t` in ExecuTorch. The current workflow in ET checks that the `c10` mirrors exactly as in PT, so the tests are failing.

See comment: https://github.com/pytorch/executorch/pull/11537#issuecomment-2963821312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155736
Approved by: https://github.com/swolchok
2025-06-30 22:57:28 +00:00
42b48ee672 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-06-30 22:32:52 +00:00
a9352bd25e Script for consolidation of sharded safetensor files (#154743)
Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743
Approved by: https://github.com/saumishr
2025-06-30 22:25:58 +00:00
f096820d0f [precompile] Detect source code changes for save/load. (#156432)
Go through all dynamo traced functions and compute checksum for them. While loading a precompilation back to memory, we will always check the checksum and refuse to load when
source code changes are detected.

Differential Revision: [D76987123](https://our.internmc.facebook.com/intern/diff/D76987123/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156432
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-06-30 21:16:15 +00:00
d3efd73234 Revert "[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)"
This reverts commit 84c588e5eada9e7921608065edc444a15c22cb1c.

Reverted https://github.com/pytorch/pytorch/pull/156656 on behalf of https://github.com/henrylhtsang due to breaking fbcode A100 tests ([comment](https://github.com/pytorch/pytorch/pull/156656#issuecomment-3020769914))
2025-06-30 21:16:04 +00:00
3684be056d [dynamo] Fix source for lru_cache method (#157292)
Fixes - https://github.com/pytorch/pytorch/issues/157273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292
Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel
2025-06-30 20:53:57 +00:00
23491519d2 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-30 19:42:11 +00:00
f16053f0c9 Switch to standard pep517 sdist generation (#152098)
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.

Closes #150461.

The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.

This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.

The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.

## Issues

### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.

### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.

PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.

The list of symbolic links currently is as follows:

<details>

|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` |  ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` |  ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` | ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` |  ok (individual file)||

</details>

The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.

[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.

[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.

[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.

[^4]: We don't need to ship the rust bindings for ittapi.

[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.

### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
2025-06-30 19:07:34 +00:00
2567 changed files with 110880 additions and 54794 deletions

View File

@ -2,7 +2,7 @@ build --cxxopt=--std=c++17
build --copt=-I.
# Bazel does not support including its cc_library targets as system
# headers. We work around this for generated code
# (e.g. c10/macros/cmake_macros.h) by making the generated directory a
# (e.g. torch/headeronly/macros/cmake_macros.h) by making the generated directory a
# system include path.
build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
build --copt=-isystem --copt bazel-out/darwin-fastbuild/bin

View File

@ -4,7 +4,7 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

View File

@ -5,7 +5,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
pip install click mock tabulate networkx==2.0
pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
pip -q install "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
fi
# Skip tests in environments where they are not built/applicable
@ -147,8 +147,8 @@ export DNNL_MAX_CPU_ISA=AVX2
if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
# TODO(sdym@meta.com) remove this when the linked issue resolved.
# py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed
pip install --user py==1.11.0
pip install --user pytest-sugar
pip install py==1.11.0
pip install pytest-sugar
# NB: Warnings are disabled because they make it harder to see what
# the actual erroring test is
"$PYTHON" \

View File

@ -36,3 +36,104 @@ See `build.sh` for valid build environments (it's the giant switch).
# Set flags (see build.sh) and build image
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```
## [Guidance] Adding a New Base Docker Image
### Background
The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
1. **Automatic Rebuilding**:
- The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
- This ensures all images stay up-to-date with the latest dependencies and configurations
2. **Image Reuse in PyTorch Build Workflows** (example: linux-build):
- The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
- The `_linux-build.yml` workflow:
- Pulls the Docker image determined by the `calculate-docker-image` step
- Runs a Docker container with that image
- Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
3. **Usage in Test Workflows** (example: linux-test):
- The same Docker images are also used in `_linux-test.yml` for running tests
- The `_linux-test.yml` workflow follows a similar pattern:
- It uses the `calculate-docker-image` step to determine which Docker image to use
- It pulls the Docker image and runs a container with that image
- It installs the wheels from the artifacts generated by PyTorch build jobs
- It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
### Understanding File Purposes
#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
- Used for building base Docker images
- Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
- Contains configurations for different Docker build environments
- **`.ci/pytorch/build.sh`**:
- Used for building PyTorch inside a Docker container
- Called by workflows like `_linux-build.yml` after the Docker container is started
- Builds PyTorch wheels and other artifacts
#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
- Used for pinning dependency versions during base Docker image building
- Ensures consistent environments for building PyTorch
- Changes here trigger base Docker image rebuilds
- **`.github/ci_commit_pins`**:
- Used for pinning dependency versions during PyTorch building and tests
- Ensures consistent dependencies for PyTorch across different builds
- Used by build scripts running inside Docker containers
### Step-by-Step Guide for Adding a New Base Docker Image
#### 1. Add Pinned Commits (If Applicable)
We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
If your new Docker image needs a library installed from a specific pinned commit or built from source:
1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
#### 2. Configure the Base Docker Image
1. **Add new Base Docker image configuration** (if applicable):
Add the configuration in `.ci/docker/build.sh`. For example:
```bash
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
NEW_ARG_1=yes
;;
```
2. **Add build arguments to Docker build command**:
If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
```bash
docker build \
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
```
3. **Update Dockerfile logic**:
Update the Dockerfile to use the new argument. For example, in `ubuntu/Dockerfile`:
```dockerfile
ARG NEW_ARG_1
# Set up environment for NEW_ARG_1
RUN if [ -n "${NEW_ARG_1}" ]; then bash ./do_something.sh; fi
```
4. **Add the Docker configuration** in `.github/workflows/docker-builds.yml`:
The `docker-builds.yml` workflow pre-builds the Docker images whenever changes occur in the `.ci/docker/` directory. This includes the
pinned commit updates.

View File

@ -52,6 +52,8 @@ fi
if [[ "$image" == *-jammy* ]]; then
UBUNTU_VERSION=22.04
elif [[ "$image" == *-noble* ]]; then
UBUNTU_VERSION=24.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
fi
@ -89,9 +91,18 @@ tag=$(echo $image | awk -F':' '{print $2}')
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$tag" in
pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
@ -102,7 +113,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -114,7 +124,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
@ -126,7 +135,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
@ -136,56 +144,18 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -206,32 +176,12 @@ case "$tag" in
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.11-clang12)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.3
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)
if [[ $tag =~ "jammy" ]]; then
ANACONDA_PYTHON_VERSION=3.10
else
ANACONDA_PYTHON_VERSION=3.12
fi
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
@ -242,6 +192,19 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-noble-rocm-alpha-py3)
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
ROCM_VERSION=7.0
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -258,7 +221,7 @@ case "$tag" in
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
@ -270,7 +233,6 @@ case "$tag" in
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
CLANG_VERSION=12
VISION=yes
TRITON=yes
@ -322,6 +284,8 @@ case "$tag" in
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -331,6 +295,8 @@ case "$tag" in
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -345,7 +311,6 @@ case "$tag" in
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
@ -397,9 +362,6 @@ docker build \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
@ -417,6 +379,7 @@ docker build \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "OPENBLAS=${OPENBLAS:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \

View File

@ -1 +1 @@
v2.27.3-1
v2.27.5-1

View File

@ -1 +1 @@
c8757738a7418249896224430ce84888e8ecdd79
f7888497a1eb9e98d4c07537f0d0bcfe180d1363

View File

@ -23,6 +23,10 @@ conda_install() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*
}
conda_install_through_forge() {
as_jenkins conda install -c conda-forge -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*
}
conda_run() {
as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION --no-capture-output $*
}

View File

@ -15,6 +15,9 @@ install_ubuntu() {
elif [[ "$UBUNTU_VERSION" == "22.04"* ]]; then
cmake3="cmake=3.22*"
maybe_libiomp_dev=""
elif [[ "$UBUNTU_VERSION" == "24.04"* ]]; then
cmake3="cmake=3.28*"
maybe_libiomp_dev=""
else
cmake3="cmake=3.5*"
maybe_libiomp_dev="libiomp-dev"

View File

@ -4,12 +4,8 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
@ -21,7 +17,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
exit 1
;;
esac
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -70,10 +65,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
fi
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.29=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
if [[ $(uname -m) != "aarch64" ]]; then
pip_install mkl==2024.2.0
pip_install mkl-static==2024.2.0
pip_install mkl-include==2024.2.0
fi
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@ -87,6 +82,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})
fi
if [[ "$UBUNTU_VERSION" == "24.04"* ]] ; then
conda_install_through_forge libstdcxx-ng=14
fi
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt

View File

@ -66,7 +66,7 @@ function do_cpython_build {
ln -s pip3 ${prefix}/bin/pip
fi
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
${prefix}/bin/pip install wheel==0.45.1 setuptools==80.9.0
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -sf ${prefix} /opt/python/${abi_tag}
}

View File

@ -10,6 +10,8 @@ else
arch_path='sbsa'
fi
NVSHMEM_VERSION=3.3.9
function install_cuda {
version=$1
runfile=$2
@ -40,13 +42,65 @@ function install_cudnn {
rm -rf tmp_cudnn
}
function install_nvshmem {
cuda_major_version=$1 # e.g. "12"
nvshmem_version=$2 # e.g. "3.3.9"
case "${arch_path}" in
sbsa)
dl_arch="aarch64"
;;
x86_64)
dl_arch="x64"
;;
*)
dl_arch="${arch}"
;;
esac
tmpdir="tmp_nvshmem"
mkdir -p "${tmpdir}" && cd "${tmpdir}"
# nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html
filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"
# download, unpack, install
wget -q "${url}"
tar xf "${filename}.tar.gz"
cp -a "libnvshmem/include/"* /usr/local/cuda/include/
cp -a "libnvshmem/lib/"* /usr/local/cuda/lib64/
# cleanup
cd ..
rm -rf "${tmpdir}"
echo "nvSHMEM ${nvshmem_version} for CUDA ${cuda_major_version} (${arch_path}) installed."
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"
install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux
install_cudnn 12 $CUDNN_VERSION
CUDA_VERSION=12.4 bash install_nccl.sh
CUDA_VERSION=12.4 bash install_cusparselt.sh
ldconfig
}
function install_126 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.6 bash install_nccl.sh
CUDA_VERSION=12.6 bash install_cusparselt.sh
@ -56,13 +110,15 @@ function install_126 {
function install_129 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.9.1 in the same container
install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.9 bash install_nccl.sh
CUDA_VERSION=12.9 bash install_cusparselt.sh
@ -70,6 +126,40 @@ function install_129 {
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
@ -106,13 +196,15 @@ function prune_126 {
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.8 bash install_nccl.sh
CUDA_VERSION=12.8 bash install_cusparselt.sh
@ -124,6 +216,8 @@ function install_128 {
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
12.6|12.6.*) install_126; prune_126
;;
12.8|12.8.*) install_128;

View File

@ -1,24 +0,0 @@
#!/bin/bash
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cudnn
ldconfig
fi

View File

@ -13,6 +13,14 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi

View File

@ -15,11 +15,37 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y torch torchvision triton
}
function install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
python install.py --continue_on_fail
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
chown -R jenkins torchbench
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
# Stable packages are ok here, just to satisfy TorchBench check
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
install_torchbench
install_huggingface
install_timm
# Clean up
conda_run pip uninstall -y torch torchvision torchaudio triton

View File

@ -4,8 +4,9 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128
USE_OPENMP=1
@ -13,9 +14,8 @@ NO_SHARED=0
DYNAMIC_ARCH=1
TARGET=ARMV8
CFLAGS=-O3
BUILD_BFLOAT16=1
"
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

View File

@ -8,9 +8,11 @@ ver() {
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 20.04 ]]; then
# gpg-agent is not available by default on 20.04
apt-get install -y --no-install-recommends gpg-agent
# gpg-agent is not available by default
apt-get install -y --no-install-recommends gpg-agent
if [[ $(ver $UBUNTU_VERSION) -ge $(ver 22.04) ]]; then
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
| sudo tee /etc/apt/preferences.d/rocm-pin-600
fi
apt-get install -y kmod
apt-get install -y wget
@ -28,16 +30,25 @@ EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.1"
ROCM_VERSION="${ROCM_VERSION}.2"
fi
# Default url values
rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
# Special case for ROCM_VERSION == 7.0
if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
# Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
@ -71,29 +82,33 @@ EOF
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# CI no longer builds for ROCm 6.3, but
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.2) ]]; then
HIP_TAG=rocm-6.4.2
CLR_HASH=74d78ba3ac4bac235d02bcb48511c30b5cfdd457 # branch release/rocm-rel-6.4.2-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_TAG=rocm-6.4.1
CLR_HASH=efe6c35790b9206923bfeed1209902feff37f386 # branch release/rocm-rel-6.4.1-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
HIP_TAG=rocm-6.4.0
CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c # branch release/rocm-rel-6.4-statco-hotfix
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_TAG
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
git clone https://github.com/jeffdaily/clr
pushd clr
git checkout $CLR_HASH
popd
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
# Need to point CMake to the correct python installation to find CppHeaderParser
cmake .. -DPython3_EXECUTABLE=/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}/bin/python3 -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
cp hipamd/lib/libamdhip64.so.6.4.* /opt/rocm/lib/libamdhip64.so.6.4.*
popd
rm -rf HIP clr
fi

View File

@ -56,14 +56,10 @@ function install_ubuntu() {
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.8"
if [[ ! " 8.8 8.10 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
dnf install -y 'dnf-command(config-manager)'

View File

@ -41,7 +41,7 @@ case ${DOCKER_TAG_PREFIX} in
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

View File

@ -27,5 +27,7 @@ COPY ./common/install_linter.sh install_linter.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh
RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env
USER jenkins
CMD ["bash"]

View File

@ -41,7 +41,7 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.29"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
@ -77,7 +77,7 @@ case ${image} in
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
TARGET=rocm_final
MANY_LINUX_VERSION="2_28"

View File

@ -16,6 +16,7 @@ click
#test that import:
coremltools==5.0b5 ; python_version < "3.12"
coremltools==8.3 ; python_version == "3.12"
#Description: Apple framework for ML integration
#Pinned versions: 5.0b5
#test that import:
@ -49,7 +50,7 @@ flatbuffers==24.12.23
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
#Pinned versions: 3.44.6, 4.53.2
#Pinned versions: 5.35.1
#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
junitparser==2.1.1
@ -63,6 +64,7 @@ lark==0.12.0
#test that import:
librosa>=0.6.2 ; python_version < "3.11"
librosa==0.10.2 ; python_version == "3.12"
#Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2
#test that import: test_spectral_ops.py
@ -111,6 +113,7 @@ ninja==1.11.1.3
numba==0.49.0 ; python_version < "3.9"
numba==0.55.2 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.10"
numba==0.60.0 ; python_version == "3.12"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#test that import: test_numba_integration.py
@ -218,9 +221,9 @@ pygments==2.15.0
#Pinned versions: 2.12.0
#test that import: the doctests
#PyYAML
#pyyaml
#Description: data serialization format
#Pinned versions:
#Pinned versions: 6.0.2
#test that import:
#requests
@ -230,7 +233,7 @@ pygments==2.15.0
#rich
#Description: rich text and beautiful formatting in the terminal
#Pinned versions: 10.9.0
#Pinned versions: 14.1.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
@ -304,7 +307,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.12.6.0
z3-solver==4.15.1.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
@ -358,12 +361,11 @@ pwlf==2.2.1
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
astunparse
PyYAML
pyyaml
pyzstd
setuptools
setuptools>=70.1.0
six
scons==4.5.2 ; platform_machine == "aarch64"
@ -383,6 +385,12 @@ cmake==4.0.0
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI

View File

@ -1,11 +1,11 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# something related to Docker setup. We can investigate this later
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs
@ -50,8 +50,8 @@ IPython==8.12.0
#Pinned versions: 8.12.0
myst-nb==0.17.2
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 0.13.2
#Description: This is used to generate PyTorch functorch and torch.compile docs.
#Pinned versions: 0.17.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5

View File

@ -1 +1 @@
3.3.1
3.4.0

View File

@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt
# (optional) Install non-default Ninja version
ARG NINJA_VERSION

View File

@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt
ARG TRITON
ARG TRITON_CPU
@ -147,6 +148,12 @@ RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi
RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
ARG OPENBLAS
COPY ./common/install_openblas.sh install_openblas.sh
RUN if [ -n "${OPENBLAS}" ]; then bash ./install_openblas.sh; fi
RUN rm install_openblas.sh
ENV INSTALLED_OPENBLAS ${OPENBLAS}
# Install ccache/sccache (do this last, so we get priority in PATH)
ARG SKIP_SCCACHE_INSTALL
COPY ./common/install_cache.sh install_cache.sh

View File

@ -97,7 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in

View File

@ -51,20 +51,23 @@ else
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8|12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
#removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
#however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517
12.8)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$DESIRED_CUDA" == "cu129" && "$PACKAGE_TYPE" == "libtorch" ]]; then
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
*)
echo "unknown cuda version $CUDA_VERSION"

View File

@ -92,7 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1
@ -104,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling setup.py install at $(date)"
echo "Calling 'python -m pip install .' at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
@ -120,7 +120,7 @@ fi
# TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed
CFLAGS='-Wno-deprecated-declarations' \
BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \
python setup.py install
python -m pip install --no-build-isolation -v .
mkdir -p libtorch/{lib,bin,include,share}

View File

@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

View File

@ -19,7 +19,7 @@ git config --global --add safe.directory /var/lib/jenkins/workspace
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# TODO: This can be removed later once vision is also part of the Docker image
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
pip install -q --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
# JIT C++ extensions require ninja, so put it into PATH.
export PATH="/var/lib/jenkins/.local/bin:$PATH"
# NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we

View File

@ -1,34 +0,0 @@
#!/usr/bin/env bash
# DO NOT ADD 'set -x' not to reveal CircleCI secret context environment variables
set -eu -o pipefail
# This script uses linux host toolchain + mobile build options in order to
# build & test mobile libtorch without having to setup Android/iOS
# toolchain/simulator.
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
# Install torch & torchvision - used to download & trace test model.
# Ideally we should use the libtorch built on the PR so that backward
# incompatible changes won't break this script - but it will significantly slow
# down mobile CI jobs.
# Here we install nightly instead of stable so that we have an option to
# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.
retry pip install --pre torch torchvision \
-f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \
--progress-bar off
# Run end-to-end process of building mobile library, linking into the predictor
# binary, and running forward pass with a real model.
if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then
TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh
elif [[ "$BUILD_ENVIRONMENT" == *-mobile-lightweight-dispatch* ]]; then
test/mobile/lightweight_dispatch/build.sh
else
TEST_DEFAULT_BUILD=1 test/mobile/custom_build/build.sh
fi
print_sccache_stats

View File

@ -11,10 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
fi
echo "Python version:"
python --version
@ -124,26 +120,8 @@ if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
fi
# Use special scripts for Android builds
if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
export ANDROID_NDK=/opt/ndk
build_args=()
if [[ "${BUILD_ENVIRONMENT}" == *-arm-v7a* ]]; then
build_args+=("-DANDROID_ABI=armeabi-v7a")
elif [[ "${BUILD_ENVIRONMENT}" == *-arm-v8a* ]]; then
build_args+=("-DANDROID_ABI=arm64-v8a")
elif [[ "${BUILD_ENVIRONMENT}" == *-x86_32* ]]; then
build_args+=("-DANDROID_ABI=x86")
elif [[ "${BUILD_ENVIRONMENT}" == *-x86_64* ]]; then
build_args+=("-DANDROID_ABI=x86_64")
fi
if [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then
build_args+=("-DUSE_VULKAN=ON")
fi
build_args+=("-DUSE_LITE_INTERPRETER_PROFILER=OFF")
exec ./scripts/build_android.sh "${build_args[@]}" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
export USE_VULKAN=1
# shellcheck disable=SC1091
source /var/lib/jenkins/vulkansdk/setup-env.sh
@ -198,10 +176,8 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"
fi
if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
@ -227,7 +203,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
@ -308,6 +284,22 @@ else
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then
install_torchvision
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *audio* ]]; then
install_torchaudio
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchrec* || "${BUILD_ADDITIONAL_PACKAGES:-}" == *fbgemm* ]]; then
install_torchrec_and_fbgemm
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchao* ]]; then
install_torchao
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
echo "Checking that xpu is compiled"
pushd dist/
@ -395,10 +387,8 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.

View File

@ -13,6 +13,13 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
fi
if which sccache > /dev/null; then
# Clear SCCACHE_BUCKET and SCCACHE_REGION if they are empty, otherwise
# sccache will complain about invalid bucket configuration
if [[ -z "${SCCACHE_BUCKET:-}" ]]; then
unset SCCACHE_BUCKET
unset SCCACHE_REGION
fi
# Save sccache logs to file
sccache --stop-server > /dev/null 2>&1 || true
rm -f ~/sccache_error.log || true

View File

@ -78,6 +78,34 @@ function pip_install_whl() {
fi
}
function pip_build_and_install() {
local build_target=$1
local wheel_dir=$2
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
python3 -m pip wheel \
--no-build-isolation \
--no-deps \
--no-use-pep517 \
-w "${wheel_dir}" \
"${build_target}"
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
}
function pip_install() {
# retry 3 times
@ -124,14 +152,7 @@ function get_pinned_commit() {
function install_torchaudio() {
local commit
commit=$(get_pinned_commit audio)
if [[ "$1" == "cuda" ]]; then
# TODO: This is better to be passed as a parameter from _linux-test workflow
# so that it can be consistent with what is set in build
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
else
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
fi
pip_build_and_install "git+https://github.com/pytorch/audio.git@${commit}" dist/audio
}
function install_torchtext() {
@ -139,8 +160,8 @@ function install_torchtext() {
local text_commit
data_commit=$(get_pinned_commit data)
text_commit=$(get_pinned_commit text)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/data.git@${data_commit}"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${text_commit}"
pip_build_and_install "git+https://github.com/pytorch/data.git@${data_commit}" dist/data
pip_build_and_install "git+https://github.com/pytorch/text.git@${text_commit}" dist/text
}
function install_torchvision() {
@ -153,7 +174,14 @@ function install_torchvision() {
echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -
LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so
fi
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
# Not sure if both are needed, but why not
export FORCE_CUDA=1
export WITH_CUDA=1
fi
pip_build_and_install "git+https://github.com/pytorch/vision.git@${commit}" dist/vision
if [ -n "${LD_PRELOAD}" ]; then
LD_PRELOAD=${orig_preload}
fi
@ -173,25 +201,71 @@ function install_torchrec_and_fbgemm() {
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_uninstall fbgemm-gpu-nightly
# Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm
ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"
# Find rocm_version.h header file for ROCm version extract
rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"
if [ ! -f "$rocm_version_h" ]; then
rocm_version_h="${ROCM_HOME}/include/rocm_version.h"
fi
# Error out if rocm_version.h not found
if [ ! -f "$rocm_version_h" ]; then
echo "Error: rocm_version.h not found in expected locations." >&2
exit 1
fi
# Extract major, minor and patch ROCm version numbers
MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')
MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')
PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')
ROCM_INT=$((MAJOR_VERSION * 10000 + MINOR_VERSION * 100 + PATCH_VERSION))
echo "ROCm version: $ROCM_INT"
export BUILD_ROCM_VERSION="$MAJOR_VERSION.$MINOR_VERSION"
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
local wheel_dir=dist/fbgemm_gpu
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}" --recurse-submodules
python setup.py bdist_wheel \
--build-variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
# Save the wheel before cleaning up
mkdir -p dist/fbgemm_gpu
cp fbgemm/fbgemm_gpu/dist/*.whl dist/fbgemm_gpu
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu
fi
}
@ -207,34 +281,10 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
function install_torchao() {
local commit
commit=$(get_pinned_commit torchao)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/ao.git@${commit}"
pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao
}
function print_sccache_stats() {

View File

@ -1,123 +0,0 @@
from datetime import datetime, timedelta, timezone
from tempfile import mkdtemp
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
temp_dir = mkdtemp()
print(temp_dir)
def genrsa(path):
key = rsa.generate_private_key(
public_exponent=65537,
key_size=2048,
)
with open(path, "wb") as f:
f.write(
key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.TraditionalOpenSSL,
encryption_algorithm=serialization.NoEncryption(),
)
)
return key
def create_cert(path, C, ST, L, O, key):
subject = issuer = x509.Name(
[
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc) + timedelta(days=10)
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
critical=True,
)
.sign(key, hashes.SHA256())
)
# Write our certificate out to disk.
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
return cert
def create_req(path, C, ST, L, O, key):
csr = (
x509.CertificateSigningRequestBuilder()
.subject_name(
x509.Name(
[
# Provide various details about who we are.
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
)
.sign(key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(csr.public_bytes(serialization.Encoding.PEM))
return csr
def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
cert = (
x509.CertificateBuilder()
.subject_name(csr_cert.subject)
.issuer_name(ca_cert.subject)
.public_key(csr_cert.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc) + timedelta(days=10)
# Sign our certificate with our private key
)
.sign(private_ca_key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
return cert
ca_key = genrsa(temp_dir + "/ca.key")
ca_cert = create_cert(
temp_dir + "/ca.pem",
"US",
"New York",
"New York",
"Gloo Certificate Authority",
ca_key,
)
pkey = genrsa(temp_dir + "/pkey.key")
csr = create_req(
temp_dir + "/csr.csr",
"US",
"California",
"San Francisco",
"Gloo Testing Company",
pkey,
)
cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

View File

@ -157,6 +157,29 @@ test_jit_hooks() {
assert_git_not_dirty
}
# Shellcheck doesn't like it when you pass no arguments to a function
# that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2120
checkout_install_torchbench() {
local commit
commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
@ -179,13 +202,11 @@ torchbench_setup_macos() {
USE_OPENMP=0 python setup.py develop
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2119,SC2120
checkout_install_torchbench
}
pip_benchmark_deps() {
python -mpip install --no-input astunparse requests cython scikit-learn
python -mpip install --no-input requests cython scikit-learn six
}

View File

@ -1,18 +0,0 @@
#!/bin/bash
CREATE_TEST_CERT="$(dirname "${BASH_SOURCE[0]}")/create_test_cert.py"
TMP_CERT_DIR=$(python "$CREATE_TEST_CERT")
openssl verify -CAfile "${TMP_CERT_DIR}/ca.pem" "${TMP_CERT_DIR}/cert.pem"
export GLOO_DEVICE_TRANSPORT=TCP_TLS
export GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=${TMP_CERT_DIR}/pkey.key
export GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=${TMP_CERT_DIR}/cert.pem
export GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=${TMP_CERT_DIR}/ca.pem
time python test/run_test.py --include distributed/test_c10d_gloo --verbose -- ProcessGroupGlooTest
unset GLOO_DEVICE_TRANSPORT
unset GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE

View File

@ -74,12 +74,13 @@ else
fi
# Environment initialization
retry pip install -qUr requirements-build.txt
if [[ "$(uname)" == Darwin ]]; then
# Install the testing dependencies
retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest
else
retry pip install -qr requirements.txt || true
retry pip install -q hypothesis protobuf pytest setuptools || true
retry pip install -q hypothesis protobuf pytest || true
numpy_ver=1.15
case "$(python --version 2>&1)" in
*2* | *3.5* | *3.6*)

View File

@ -385,6 +385,29 @@ def smoke_test_compile(device: str = "cpu") -> None:
x_pt2 = torch.compile(model, mode="max-autotune")(x)
def smoke_test_nvshmem() -> None:
if not torch.cuda.is_available():
print("CUDA is not available, skipping NVSHMEM test")
return
# Check if NVSHMEM is compiled in current build
try:
from torch._C._distributed_c10d import _is_nvshmem_available
except ImportError:
# Not built with NVSHMEM support.
# torch is not compiled with NVSHMEM prior to 2.9
if torch.__version__ < "2.9":
return
else:
# After 2.9: NVSHMEM is expected to be compiled in current build
raise RuntimeError("torch not compiled with NVSHMEM") from None
print("torch compiled with NVSHMEM")
# Check if NVSHMEM is available on current system.
print(f"NVSHMEM available at run time: {_is_nvshmem_available()}")
def smoke_test_modules():
cwd = os.getcwd()
for module in MODULES:
@ -479,6 +502,8 @@ def main() -> None:
options.pypi_pkg_check,
)
smoke_test_nvshmem()
if __name__ == "__main__":
main()

View File

@ -11,6 +11,8 @@ export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
@ -163,8 +165,6 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"
# setting PYTHON_TEST_EXTRA_OPTION
export PYTHON_TEST_EXTRA_OPTION="--xpu"
# Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585
sudo rm -rf /opt/cache
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then
@ -201,7 +201,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja.
pip_install --user "ninja==1.10.2"
pip_install "ninja==1.10.2"
# ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins
# but this script should be runnable by any user, including root
export PATH="$HOME/.local/bin:$PATH"
@ -289,6 +289,12 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
export ATEN_CPU_CAPABILITY=avx2
fi
if [[ "${TEST_CONFIG}" == "legacy_nvidia_driver" ]]; then
# Make sure that CUDA can be initialized
(cd test && python -c "import torch; torch.rand(2, 2, device='cuda')")
export USE_LEGACY_DRIVER=1
fi
test_python_legacy_jit() {
time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose
assert_git_not_dirty
@ -333,12 +339,18 @@ test_h100_distributed() {
test_h100_symm_mem() {
# symmetric memory test
time python test/run_test.py --include distributed/test_symmetric_memory.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time TORCH_SYMMMEM=NVSHMEM python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time TORCH_SYMMMEM=NVSHMEM python test/run_test.py --include distributed/test_nvshmem_triton.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time TORCH_SYMMMEM=NCCL python test/run_test.py --include distributed/test_nccl.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nvshmem_triton.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nccl.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_h100_cutlass_backend() {
# cutlass backend tests for H100
TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_backend -k "not addmm" $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_evt $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
@ -353,7 +365,6 @@ test_dynamo_wrapped_shard() {
exit 1
fi
python tools/dynamo/verify_dynamo.py
python tools/dynamo/gb_id_mapping.py verify
# PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.
# Instead, use @skipIfTorchDynamo on your tests.
time python test/run_test.py --dynamo \
@ -368,13 +379,24 @@ test_dynamo_wrapped_shard() {
assert_git_not_dirty
}
test_einops() {
pip install einops==0.6.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.7.0
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.8.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
assert_git_not_dirty
}
test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
python test/run_test.py -i inductor/test_torchinductor.py -k test_multi_gpu --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_on_gpu_device1 --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_gpu_device --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_load_package_multiple_gpus --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose
@ -426,14 +448,21 @@ test_inductor_aoti() {
python3 tools/amd_build/build_amd.py
fi
if [[ "$BUILD_ENVIRONMENT" == *sm86* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop
BUILD_COMMAND=(TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python -m pip install --no-build-isolation -v -e .)
# TODO: Replace me completely, as one should not use conda libstdc++, nor need special path to TORCH_LIB
LD_LIBRARY_PATH=/opt/conda/envs/py_3.10/lib/:${TORCH_LIB_DIR}:$LD_LIBRARY_PATH
CPP_TESTS_DIR="${BUILD_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="/opt/conda/envs/py_3.10/lib:${TORCH_LIB_DIR}:${LD_LIBRARY_PATH}")
else
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
BUILD_COMMAND=(python -m pip install --no-build-isolation -v -e .)
TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}")
fi
# aoti cmake custom command requires `torch` to be installed
# initialize the cmake build cache and install torch
/usr/bin/env "${BUILD_COMMAND[@]}"
# rebuild with the build cache with `BUILD_AOT_INDUCTOR_TEST` enabled
/usr/bin/env CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 "${BUILD_COMMAND[@]}"
/usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference cpp/test_vec_half_AVX2 -dist=loadfile
}
test_inductor_cpp_wrapper_shard() {
@ -446,47 +475,26 @@ test_inductor_cpp_wrapper_shard() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
if [[ "$1" -eq "2" ]]; then
# For now, manually put the opinfo tests in shard 2, and all other tests in
# shard 1. Run all CPU tests, as well as specific GPU tests triggering past
# bugs, for now.
python test/run_test.py \
--include inductor/test_torchinductor_opinfo \
-k 'linalg or to_sparse or TestInductorOpInfoCPU' \
--verbose
exit
fi
# Run certain inductor unit tests with cpp wrapper. In the end state, we
# should be able to run all the inductor unit tests with cpp_wrapper.
#
# TODO: I'm pretty sure that "TestInductorOpInfoCPU" is not a valid filter,
# but change that in another PR to more accurately monitor the increased CI
# usage.
python test/run_test.py \
--include inductor/test_torchinductor_opinfo \
-k 'linalg or to_sparse or TestInductorOpInfoCPU' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py \
--include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py --inductor \
--include test_torch \
-k 'take' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py --inductor --include test_torch -k 'take' --verbose
# Run inductor benchmark tests with cpp wrapper.
# Skip benchmark tests if it's in rerun-disabled-mode.
if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then
echo "skip dynamo benchmark tests for rerun-disabled-test"
else
echo "run dynamo benchmark tests with cpp wrapper"
python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"
fi
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -499,7 +507,7 @@ DYNAMO_BENCHMARK_FLAGS=()
pr_time_benchmarks() {
pip_install --user "fbscribelogger"
pip_install "fbscribelogger"
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
@ -619,6 +627,8 @@ test_perf_for_dashboard() {
device=cuda_a10g
elif [[ "${TEST_CONFIG}" == *h100* ]]; then
device=cuda_h100
elif [[ "${TEST_CONFIG}" == *b200* ]]; then
device=cuda_b200
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
device=rocm
fi
@ -793,6 +803,16 @@ test_dynamo_benchmark() {
if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
# TODO (huydhn): Just smoke test some sample models
if [[ "${TEST_CONFIG}" == *b200* ]]; then
if [[ "${suite}" == "huggingface" ]]; then
export TORCHBENCH_ONLY_MODELS="DistillGPT2"
elif [[ "${suite}" == "timm_models" ]]; then
export TORCHBENCH_ONLY_MODELS="inception_v3"
elif [[ "${suite}" == "torchbench" ]]; then
export TORCHBENCH_ONLY_MODELS="hf_Bert"
fi
fi
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
@ -920,12 +940,6 @@ test_torchbench_gcp_smoketest(){
popd
}
test_python_gloo_with_tls() {
source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
assert_git_not_dirty
}
test_aten() {
# Test ATen
# The following test(s) of ATen have already been skipped by caffe2 in rocm environment:
@ -972,6 +986,8 @@ test_without_numpy() {
if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"
fi
# Regression test for https://github.com/pytorch/pytorch/pull/157734 (torch.onnx should be importable without numpy)
python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"
popd
}
@ -1316,10 +1332,13 @@ EOF
# Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing
# file is modified to introduce an invalid public API function.
EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"
# The filepath here must not have __all__ defined in it, otherwise the test will pass.
# If your PR introduces __all__ to torch/cuda/streams.py please point this to another file
# that does not have __all__ defined.
EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/cuda/streams.py"
cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"
echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"
invalid_api="torch.nn.parameter.new_public_func"
invalid_api="torch.cuda.streams.new_public_func"
echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."
check_public_api_test_fails \
@ -1474,8 +1493,8 @@ test_bazel() {
test_benchmarks() {
if [[ "$BUILD_ENVIRONMENT" == *cuda* && $TEST_CONFIG != *nogpu* ]]; then
pip_install --user "pytest-benchmark==3.2.3"
pip_install --user "requests"
pip_install "pytest-benchmark==3.2.3"
pip_install "requests"
BENCHMARK_DATA="benchmarks/.data"
mkdir -p ${BENCHMARK_DATA}
pytest benchmarks/fastrnns/test_bench.py --benchmark-sort=Name --benchmark-json=${BENCHMARK_DATA}/fastrnns_default.json --fuser=default --executor=default
@ -1553,7 +1572,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1583,7 +1602,7 @@ test_operator_benchmark() {
test_inductor_set_cpu_affinity
cd benchmarks/operator_benchmark/pt_extension
python setup.py install
python -m pip install .
cd "${TEST_DIR}"/benchmarks/operator_benchmark
$TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \
@ -1603,7 +1622,13 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
fi
if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then
# Install numpy-2.0.2 and compatible scipy & numba versions
python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
# Force re-install of pandas to avoid error where pandas checks numpy version from initial install and fails upon import
TMP_PANDAS_VERSION=$(python -c "import pandas; print(pandas.__version__)" 2>/dev/null)
if [ -n "$TMP_PANDAS_VERSION" ]; then
python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0 pandas=="$TMP_PANDAS_VERSION" --force-reinstall
else
python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
fi
python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py
elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
test_linux_aarch64
@ -1657,52 +1682,40 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio cuda
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_cachebench
PYTHONPATH=/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio cpu
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt
PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
PYTHONPATH=/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
fi
install_torchaudio
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao
install_torchao
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest
TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchaudio cuda
install_torchvision
checkout_install_torchbench hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
test_inductor_aoti
PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
if [[ "$SHARD_NUMBER" -eq "1" ]]; then
test_inductor_aoti
fi
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
@ -1711,6 +1724,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
install_torchvision
test_dynamo_wrapped_shard "${SHARD_NUMBER}"
@ -1760,8 +1775,10 @@ elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == test_h100_symm_mem ]]; then
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
test_h100_cutlass_backend
else
install_torchvision
install_monkeytype

View File

@ -0,0 +1,34 @@
# If you want to rebuild, run this with $env:REBUILD=1
# If you want to build with CUDA, run this with $env:USE_CUDA=1
# If you want to build without CUDA, run this with $env:USE_CUDA=0
# Check for setup.py in the current directory
if (-not (Test-Path "setup.py")) {
Write-Host "ERROR: Please run this build script from PyTorch root directory."
exit 1
}
# Get the script's parent directory
$ScriptParentDir = Split-Path -Parent $MyInvocation.MyCommand.Definition
# Set TMP_DIR and convert to Windows path
$env:TMP_DIR = Join-Path (Get-Location) "build\win_tmp"
$env:TMP_DIR_WIN = $env:TMP_DIR # Already in Windows format, no cygpath needed
# Set final package directory with default fallback
if (-not $env:PYTORCH_FINAL_PACKAGE_DIR) {
$env:PYTORCH_FINAL_PACKAGE_DIR = "C:\w\build-results"
}
# Create the final package directory if it doesn't exist
if (-not (Test-Path $env:PYTORCH_FINAL_PACKAGE_DIR)) {
New-Item -Path $env:PYTORCH_FINAL_PACKAGE_DIR -ItemType Directory -Force | Out-Null
}
# Set script helpers directory
$env:SCRIPT_HELPERS_DIR = Join-Path $ScriptParentDir "win-test-helpers\arm64"
# Run the main build script
& "$env:SCRIPT_HELPERS_DIR\build_pytorch.ps1"
Write-Host "BUILD PASSED"

View File

@ -0,0 +1,24 @@
#!/bin/bash
set -ex -o pipefail
SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
# shellcheck source=./common.sh
source "$SCRIPT_PARENT_DIR/common.sh"
run_tests() {
echo Running smoke_test.py...
python ./.ci/pytorch/smoke_test/smoke_test.py --package torchonly
echo Running test_autograd.oy, test_nn.py, test_torch.py...
cd test
CORE_TEST_LIST=("test_autograd.py" "test_nn.py" "test_modules.py")
for t in "${CORE_TEST_LIST[@]}"; do
echo "Running test: $t"
python "$t" --verbose --save-xml --use-pytest -vvvv -rfEsxXP -p no:xdist
done
}
run_tests
echo "TEST PASSED"

View File

@ -0,0 +1,98 @@
# TODO: we may can use existing build_pytorch.bat for arm64
if ($env:DEBUG -eq "1") {
$env:BUILD_TYPE = "debug"
} else {
$env:BUILD_TYPE = "release"
}
# This inflates our log size slightly, but it is REALLY useful to be
# able to see what our cl.exe commands are. (since you can actually
# just copy-paste them into a local Windows setup to just rebuild a
# single file.)
# log sizes are too long, but leaving this here in case someone wants to use it locally
# $env:CMAKE_VERBOSE_MAKEFILE = "1"
$env:INSTALLER_DIR = Join-Path $env:SCRIPT_HELPERS_DIR "installation-helpers"
cd ..
# Environment variables
$env:SCCACHE_IDLE_TIMEOUT = "0"
$env:SCCACHE_IGNORE_SERVER_IO_ERROR = "1"
$env:CMAKE_BUILD_TYPE = $env:BUILD_TYPE
$env:CMAKE_C_COMPILER_LAUNCHER = "sccache"
$env:CMAKE_CXX_COMPILER_LAUNCHER = "sccache"
$env:libuv_ROOT = Join-Path $env:DEPENDENCIES_DIR "libuv\install"
$env:MSSdk = "1"
if ($env:PYTORCH_BUILD_VERSION) {
$env:PYTORCH_BUILD_VERSION = $env:PYTORCH_BUILD_VERSION
$env:PYTORCH_BUILD_NUMBER = "1"
}
$env:CMAKE_POLICY_VERSION_MINIMUM = "3.5"
# Set BLAS type
if ($env:ENABLE_APL -eq "1") {
$env:BLAS = "APL"
$env:USE_LAPACK = "1"
} elseif ($env:ENABLE_OPENBLAS -eq "1") {
$env:BLAS = "OpenBLAS"
$env:OpenBLAS_HOME = Join-Path $env:DEPENDENCIES_DIR "OpenBLAS\install"
}
# Change to source directory
Set-Location $env:PYTORCH_ROOT
# Copy libuv.dll
Copy-Item -Path (Join-Path $env:libuv_ROOT "lib\Release\uv.dll") -Destination "torch\lib\uv.dll" -Force
# Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
where.exe python
# Python install dependencies
python -m pip install --upgrade pip
pip install setuptools pyyaml
pip install -r requirements.txt
# Set after installing psutil
$env:DISTUTILS_USE_SDK = "1"
# Print all environment variables
Get-ChildItem Env:
# Start and inspect sccache
sccache --start-server
sccache --zero-stats
sccache --show-stats
# Build the wheel
python setup.py bdist_wheel
if ($LASTEXITCODE -ne 0) { exit 1 }
# Install the wheel locally
$whl = Get-ChildItem -Path "dist\*.whl" | Select-Object -First 1
if ($whl) {
python -mpip install --no-index --no-deps $whl.FullName
}
# Copy final wheel
robocopy "dist" "$env:PYTORCH_FINAL_PACKAGE_DIR" *.whl
# Export test times
python tools/stats/export_test_times.py
# Copy additional CI files
robocopy ".additional_ci_files" "$env:PYTORCH_FINAL_PACKAGE_DIR\.additional_ci_files" /E
# Save ninja log
Copy-Item -Path "build\.ninja_log" -Destination $env:PYTORCH_FINAL_PACKAGE_DIR -Force
# Final sccache stats and stop
sccache --show-stats
sccache --stop-server
exit 0

View File

@ -42,7 +42,7 @@ call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=Syste
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
call pip install mkl==2024.2.0 mkl-static==2024.2.0 mkl-include==2024.2.0
if errorlevel 1 goto fail
if not errorlevel 0 goto fail

View File

@ -41,7 +41,7 @@ fi
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0
python -m pip install z3-solver==4.15.1.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.30

View File

@ -37,10 +37,10 @@ IF "%CUDA_PATH_V129%"=="" (
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V129%"

View File

@ -148,14 +148,7 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (
goto end
)
set BUILD_SPLIT_CUDA=
if exist "%install_root%\lib\torch_cuda_cu.lib" if exist "%install_root%\lib\torch_cuda_cpp.lib" set BUILD_SPLIT_CUDA=ON
if "%BUILD_SPLIT_CUDA%" == "ON" (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ /INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z
) else (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
)
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
.\check-torch-cuda.exe
if ERRORLEVEL 1 exit /b 1

View File

@ -127,7 +127,7 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages
export MACOSX_DEPLOYMENT_TARGET=10.15
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
SETUPTOOLS_PINNED_VERSION="=46.0.0"
SETUPTOOLS_PINNED_VERSION="==70.1.0"
PYYAML_PINNED_VERSION="=5.3"
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
@ -135,7 +135,7 @@ RENAME_WHEEL=true
case $desired_python in
3.13t)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
@ -145,31 +145,31 @@ case $desired_python in
;;
3.13)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
;;
3.12)
echo "Using 3.12 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.11)
echo "Using 3.11 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.10)
echo "Using 3.10 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.9)
echo "Using 3.9 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
@ -184,7 +184,8 @@ tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
source activate "$tmp_env_name"
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
retry pip install -r "${pytorch_rootdir}/requirements-build.txt"
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
retry brew install libomp

View File

@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == cu128 ]]; then
# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi

View File

@ -120,6 +120,7 @@ UseTab: Never
Language: ObjC
ColumnLimit: 120
AlignAfterOpenBracket: Align
IndentWidth: 2
ObjCBlockIndentWidth: 2
ObjCSpaceAfterProperty: false
ObjCSpaceBeforeProtocolList: false

View File

@ -61,8 +61,8 @@ You are now all set to start developing with PyTorch in a DevContainer environme
## Step 8: Build PyTorch
To build pytorch from source, simply run:
```
python setup.py develop
```bash
python -m pip install --no-build-isolation -v -e .
```
The process involves compiling thousands of files, and would take a long time. Fortunately, the compiled objects can be useful for your next build. When you modify some files, you only need to compile the changed files the next time.

View File

@ -1,14 +1,36 @@
root = true
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
# Python
[*.py]
[*.{py,pyi,py.in,pyi.in}]
indent_style = space
indent_size = 4
# C/C++/CUDA
[*.{cpp,hpp,cxx,cc,c,h,cu,cuh}]
indent_style = space
indent_size = 2
# Objective-C
[*.{mm,m,M}]
indent_style = space
indent_size = 2
# Clang tools
[.clang-{format,tidy}]
indent_style = space
indent_size = 2
# Make
[Makefile]
indent_style = tab
# Batch file
[*.bat]
indent_style = space
indent_size = 2
end_of_line = crlf

View File

@ -7,12 +7,12 @@ max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
ignore =
E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,F824,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
# these ignores are from flake8-bugbear; please fix!
B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907
B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907,B908,B910
# these ignores are from flake8-comprehensions; please fix!
C407,
# these ignores are from flake8-logging-format; please fix!

View File

@ -53,16 +53,12 @@ self-hosted-runner:
- linux.rocm.gpu.mi250
- linux.rocm.gpu.2
- linux.rocm.gpu.4
# MI300 runners
- linux.rocm.gpu.mi300.2
- linux.rocm.gpu.mi300.4
# gfx942 runners
- linux.rocm.gpu.gfx942.2
- linux.rocm.gpu.gfx942.4
- rocm-docker
# Repo-specific Apple hosted runners
- macos-m1-ultra
- macos-m2-14
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-13
- macos-m1-14
# GitHub-hosted MacOS runners
- macos-latest-xlarge

View File

@ -1,78 +0,0 @@
name: build android
description: build android for a specific arch
inputs:
arch:
description: arch to build
required: true
arch-for-build-env:
description: |
arch to pass to build environment.
This is currently different than the arch name we use elsewhere, which
should be fixed.
required: true
github-secret:
description: github token
required: true
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image:
required: true
description: Name of the base docker image to build with.
branch:
required: true
description: What branch we are building on.
outputs:
container_id:
description: Docker container identifier used to build the artifacts
value: ${{ steps.build.outputs.container_id }}
runs:
using: composite
steps:
- name: Build-${{ inputs.arch }}
id: build
shell: bash
env:
BRANCH: ${{ inputs.branch }}
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-${{ inputs.arch-for-build-env }}-build"
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
DOCKER_IMAGE: ${{ inputs.docker-image }}
MATRIX_ARCH: ${{ inputs.arch }}
run: |
# detached container should get cleaned up by teardown_ec2_linux
set -exo pipefail
export container_name
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SKIP_SCCACHE_INITIALIZATION=1 \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
git submodule sync && git submodule update -q --init --recursive --depth 1
docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
(echo "sudo chown -R jenkins . && .ci/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete" | docker exec -u jenkins -i "${container_name}" bash) 2>&1
# Copy install binaries back
mkdir -p "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
docker cp "${container_name}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
echo "container_id=${container_name}" >> "${GITHUB_OUTPUT}"

View File

@ -70,7 +70,7 @@ runs:
set -eux
# PyYAML 6.0 doesn't work with MacOS x86 anymore
# This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2
python3 -m pip install requests==2.27.1 pyyaml==6.0.1
python3 -m pip install requests==2.27.1 pyyaml==6.0.2
- name: Parse ref
id: parse-ref
@ -125,7 +125,7 @@ runs:
TAG: ${{ steps.parse-ref.outputs.tag }}
EVENT_NAME: ${{ github.event_name }}
SCHEDULE: ${{ github.event.schedule }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
HEAD_BRANCH: ${{ steps.parse-ref.outputs.branch }}
id: filter
run: |
echo "Workflow: ${GITHUB_WORKFLOW}"

View File

@ -126,7 +126,7 @@ runs:
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -304,8 +304,7 @@ def unzip_artifact_and_replace_files() -> None:
def set_output() -> None:
# Disable for now so we can monitor first
# pass
print("Setting output reuse=true")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print("reuse=true", file=env)

View File

@ -1 +1 @@
4e94321c54617dd738a05bfedfc28bc0fa635b5c
6fbc710b617f79b992ef2ebc7f95e818aa390293

View File

@ -1 +1 @@
5fb5024118e9bb9decf96c2b0b1a8f0010bf56be
7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8

1
.github/ci_commit_pins/vllm.txt vendored Normal file
View File

@ -0,0 +1 @@
6a39ba85fe0f2fff9494b5eccea717c93510c230

View File

@ -1 +1 @@
3d92aa75df4467f41679cd5f544303b033741e29
b6a5b82b9948b610fa4c304d0d869c82b8f17db1

View File

@ -76,6 +76,7 @@
- .github/ci_commit_pins/audio.txt
- .github/ci_commit_pins/vision.txt
- .github/ci_commit_pins/torchdynamo.txt
- .github/ci_commit_pins/vllm.txt
- .ci/docker/ci_commit_pins/triton.txt
approved_by:
- pytorchbot
@ -130,21 +131,6 @@
- Lint
- pull
- name: Mobile
patterns:
- ios/**
- android/**
- test/mobile/**
approved_by:
- linbinyu
- IvanKobzarev
- dreiss
- raziel
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: PrimTorch
patterns:
- torch/_meta_registrations.py
@ -491,6 +477,23 @@
- srossross
- chillee
- zou3519
- guilhermeleobas
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Dynamo
patterns:
- torch/_dynamo/**
- torch/csrc/dynamo/**
- test/dynamo/**
- test/dynamo_expected_failures/**
- test/dynamo_skips/**
- test/inductor_expected_failures/**
- test/inductor_skips/**
approved_by:
- guilhermeleobas
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -31,7 +31,9 @@ ciflow_push_tags:
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
retryable_workflows:
- pull
- trunk

View File

@ -1,14 +1,15 @@
# This file is to cache other dependencies not specified elsewhere in:
# requirement.txt
# requirements.txt
# requirements-build.txt
# docs/requirements.txt
# docs/cpp/requirements.txt
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.35.42
jinja2==3.1.6
lintrunner==0.10.7
lintrunner==0.12.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84
pyyaml==6.0
pyyaml==6.0.2
requests==2.32.4
rich==10.9.0
rich==14.1.0

View File

@ -2,7 +2,7 @@ boto3==1.35.42
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.6.0
filelock==3.18.0
hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0
@ -16,7 +16,7 @@ packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.4
psutil==5.9.1
psutil==5.9.8
pygments==2.15.0
pytest-cpp==2.3.0
pytest-flakefinder==1.1.0
@ -33,4 +33,4 @@ tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
z3-solver==4.12.2.0
z3-solver==4.15.1.0

View File

@ -275,7 +275,7 @@ def delete_branches() -> None:
delete_branch(git_repo, branch)
def delete_old_ciflow_tags() -> None:
def delete_old_tags() -> None:
# Deletes ciflow tags if they are associated with a closed PR or a specific
# commit. Lightweight tags don't have information about the date they were
# created, so we can't check how old they are. The script just assumes that
@ -288,23 +288,29 @@ def delete_old_ciflow_tags() -> None:
delete_branch(git_repo, f"refs/tags/{tag}")
tags = git_repo._run_git("tag").splitlines()
open_pr_numbers = [x["number"] for x in get_open_prs()]
CIFLOW_TAG_REGEX = re.compile(r"^ciflow\/.*\/(\d{5,6}|[0-9a-f]{40})$")
AUTO_REVERT_TAG_REGEX = re.compile(r"^trunk\/[0-9a-f]{40}$")
for tag in tags:
try:
if ESTIMATED_TOKENS[0] > 400:
print("Estimated tokens exceeded, exiting")
break
if not tag.startswith("ciflow/"):
if not CIFLOW_TAG_REGEX.match(tag) and not AUTO_REVERT_TAG_REGEX.match(tag):
continue
re_match_pr = re.match(r"^ciflow\/.*\/(\d{5,6})$", tag)
re_match_sha = re.match(r"^ciflow\/.*\/([0-9a-f]{40})$", tag)
if re_match_pr:
pr_number = int(re_match_pr.group(1))
if pr_number in open_pr_numbers:
continue
delete_tag(tag)
elif re_match_sha:
# This checks the date of the commit associated with the tag instead
# of the tag itself since lightweight tags don't have this
# information. I think it should be ok since this only runs once a
# day
tag_info = git_repo._run_git("show", "-s", "--format=%ct", tag)
tag_timestamp = int(tag_info.strip())
# Maybe some timezone issues, but a few hours shouldn't matter
tag_age_days = (datetime.now().timestamp() - tag_timestamp) / SEC_IN_DAY
if tag_age_days > 7:
print(f"[{tag}] Tag is older than 7 days, deleting")
delete_tag(tag)
except Exception as e:
print(f"Failed to check tag {tag}: {e}")
@ -312,4 +318,4 @@ def delete_old_ciflow_tags() -> None:
if __name__ == "__main__":
delete_branches()
delete_old_ciflow_tags()
delete_old_tags()

View File

@ -18,6 +18,7 @@ import yaml
REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)"
MAIN_BRANCH = "main"
PREFIX = "test-config/"
@ -97,7 +98,7 @@ def parse_args() -> Any:
parser.add_argument(
"--branch",
type=str,
default="main",
default=MAIN_BRANCH,
help="the branch name",
)
return parser.parse_args()
@ -456,6 +457,7 @@ def download_json(url: str, headers: dict[str, str], num_retries: int = 3) -> An
def set_output(name: str, val: Any) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)
@ -495,13 +497,20 @@ def check_for_setting(labels: set[str], body: str, setting: str) -> bool:
def perform_misc_tasks(
labels: set[str], test_matrix: dict[str, list[Any]], job_name: str, pr_body: str
labels: set[str],
test_matrix: dict[str, list[Any]],
job_name: str,
pr_body: str,
branch: Optional[str] = None,
) -> None:
"""
In addition to apply the filter logic, the script also does the following
misc tasks to set keep-going and is-unstable variables
"""
set_output("keep-going", check_for_setting(labels, pr_body, "keep-going"))
set_output(
"keep-going",
branch == MAIN_BRANCH or check_for_setting(labels, pr_body, "keep-going"),
)
set_output(
"ci-verbose-test-logs",
check_for_setting(labels, pr_body, "ci-verbose-test-logs"),
@ -624,6 +633,7 @@ def main() -> None:
test_matrix=filtered_test_matrix,
job_name=args.job_name,
pr_body=pr_body if pr_body else "",
branch=args.branch,
)
# Set the filtered test matrix as the output

View File

@ -17,7 +17,7 @@ from typing import Optional
# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this
CUDA_ARCHES = ["12.6", "12.8", "12.9"]
CUDA_STABLE = "12.6"
CUDA_STABLE = "12.8"
CUDA_ARCHES_FULL_VERSION = {
"12.6": "12.6.3",
"12.8": "12.8.1",
@ -53,8 +53,8 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -70,8 +70,8 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -87,7 +87,8 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -192,7 +193,7 @@ LIBTORCH_CONTAINER_IMAGES: dict[str, str] = {
"cpu": "libtorch-cxx11-builder:cpu",
}
FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12", "3.13", "3.13t"]
FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t"]
def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
@ -314,6 +315,11 @@ def generate_wheels_matrix(
# TODO: Enable python 3.13t on cpu-s390x
if gpu_arch_type == "cpu-s390x" and python_version == "3.13t":
continue
# TODO: Enable python 3.14 on non linux OSes
if os != "linux" and (
python_version == "3.14" or python_version == "3.14t"
):
continue
if use_split_build and (
arch_version not in ["12.6", "12.8", "12.9", "cpu"] or os != "linux"

View File

@ -22,6 +22,7 @@ LABEL_CIFLOW_BINARIES = "ciflow/binaries"
LABEL_CIFLOW_PERIODIC = "ciflow/periodic"
LABEL_CIFLOW_BINARIES_LIBTORCH = "ciflow/binaries_libtorch"
LABEL_CIFLOW_BINARIES_WHEEL = "ciflow/binaries_wheel"
LABEL_CIFLOW_ROCM = "ciflow/rocm"
@dataclass
@ -146,13 +147,35 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
),
]
ROCM_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_variant="rocm",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["6.4"],
python_versions=["3.9"],
),
ciflow_config=CIFlowConfig(
labels={
LABEL_CIFLOW_BINARIES,
LABEL_CIFLOW_BINARIES_WHEEL,
LABEL_CIFLOW_ROCM,
},
isolated_workflow=True,
),
branches="main",
),
]
LINUX_BINARY_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["12.6", "12.8", "12.9", "6.4"],
arches=["12.6", "12.8", "12.9"],
python_versions=["3.9"],
),
branches="main",
@ -387,6 +410,11 @@ def main() -> None:
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
S390X_BINARY_BUILD_WORKFLOWS,
),
(
# Give rocm it's own workflow file
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
ROCM_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
LINUX_BINARY_SMOKE_WORKFLOWS,

View File

@ -136,10 +136,10 @@ def find_job_id_name(args: Any) -> tuple[str, str]:
def set_output(name: str, val: Any) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)
print(f"setting {name}={val}")
else:
print(f"::set-output name={name}::{val}")

View File

@ -2,7 +2,7 @@
set -ex
# Use uv to speed up lintrunner init
python3 -m pip install uv==0.1.45 setuptools
python3 -m pip install -U uv==0.8.* setuptools
CACHE_DIRECTORY="/tmp/.lintbin"
# Try to recover the cached binaries

View File

@ -5,6 +5,7 @@ import re
def set_output(name: str, val: str) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)

View File

@ -6,7 +6,7 @@ set -euxo pipefail
cd llm-target-determinator
pip install -q -r requirements.txt
cd ../codellama
pip install -e .
pip install --no-build-isolation -v -e .
pip install numpy==1.26.0
# Run indexer

View File

@ -0,0 +1,56 @@
import os
import unittest
from datetime import datetime
from unittest.mock import MagicMock, patch
os.environ["GITHUB_TOKEN"] = "test_token"
from delete_old_branches import delete_old_tags
@patch("delete_old_branches.delete_branch")
@patch("gitutils.GitRepo._run_git")
class TestDeleteTag(unittest.TestCase):
def test_delete_tag(
self, mock_run_git: "MagicMock", mock_delete_tag: "MagicMock"
) -> None:
for tag in [
"ciflow/branch/12345",
"ciflow/commitsha/1234567890abcdef1234567890abcdef12345678",
"trunk/1234567890abcdef1234567890abcdef12345678",
]:
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 8 * 24 * 60 * 60)), # 8 days ago
]
delete_old_tags()
mock_delete_tag.assert_called_once()
mock_delete_tag.reset_mock()
# Don't delete if the tag is not old enough
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 6 * 24 * 60 * 60)), # 6 days ago
]
delete_old_tags()
mock_delete_tag.assert_not_called()
def test_do_not_delete_tag(
self, mock_run_git: "MagicMock", mock_delete_tag: "MagicMock"
) -> None:
for tag in [
"ciflow/doesntseemtomatch",
"trunk/doesntseemtomatch",
"doesntseemtomatch",
]:
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 8 * 24 * 60 * 60)), # 8 days ago
]
delete_old_tags()
mock_delete_tag.assert_not_called()
if __name__ == "__main__":
unittest.main()

View File

@ -1891,7 +1891,9 @@ def validate_revert(
else pr.get_comment_by_id(comment_id)
)
if comment.editor_login is not None:
raise PostCommentError("Don't want to revert based on edited command")
raise PostCommentError(
"Halting the revert as the revert comment has been edited."
)
author_association = comment.author_association
author_login = comment.author_login
allowed_reverters = ["COLLABORATOR", "MEMBER", "OWNER"]

View File

@ -70,7 +70,7 @@ jobs:
runner: ${{ inputs.runner_prefix }}linux.12xlarge
# TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
# Let's try to figure out how this can be improved
timeout-minutes: 240
timeout-minutes: 360
- docs_type: python
runner: ${{ inputs.runner_prefix }}linux.2xlarge
# It takes less than 30m to finish python docs unless there are issues

View File

@ -0,0 +1,43 @@
name: Get Changed Files
on:
workflow_call:
outputs:
changed-files:
description: "List of changed files (space-separated) or '*' if not in a PR"
value: ${{ jobs.get-changed-files.outputs.changed-files }}
jobs:
get-changed-files:
runs-on: ubuntu-latest
outputs:
changed-files: ${{ steps.get-files.outputs.changed-files }}
steps:
- name: Get changed files
id: get-files
env:
GH_TOKEN: ${{ github.token }}
run: |
# Check if we're in a pull request context
if [ "${{ github.event_name }}" = "pull_request" ] || [ "${{ github.event_name }}" = "pull_request_target" ]; then
echo "Running in PR context"
# Get the PR number from the github context
PR_NUMBER="${{ github.event.number }}"
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
else
echo "Not in PR context, setting changed files to '*'"
echo "changed-files=*" >> "$GITHUB_OUTPUT"
fi

View File

@ -16,11 +16,6 @@ on:
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
@ -69,11 +64,6 @@ on:
required: false
type: string
default: ""
max-jobs:
description: |
Overwrite the number of jobs to use for the build
required: false
type: string
disable-monitor:
description: |
Disable utilization monitoring for build job
@ -92,7 +82,6 @@ on:
required: false
type: number
default: 1
allow-reuse-old-whl:
description: |
If set, the build try to pull an old wheel from s3 that was built on a
@ -100,6 +89,13 @@ on:
required: false
type: boolean
default: true
build-additional-packages:
description: |
If set, the build job will also builds these packages and saves their
wheels as artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
@ -111,7 +107,6 @@ on:
description: |
FB app token to write to scribe endpoint
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
@ -136,6 +131,9 @@ jobs:
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
Build is done inside the container, to start an interactive session run:
docker exec -it $(docker container ps --format '{{.ID}}') bash
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
@ -227,7 +225,7 @@ jobs:
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
mkdir -p ../../usage_logs
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
python3 -m tools.stats.monitor \
--log-interval "$MONITOR_LOG_INTERVAL" \
--data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" \
@ -249,8 +247,6 @@ jobs:
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
# Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs
@ -262,11 +258,10 @@ jobs:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
MAX_JOBS_OVERRIDE: ${{ inputs.max-jobs }}
BUILD_ADDITIONAL_PACKAGES: ${{ inputs.build-additional-packages }}
run: |
START_TIME=$(date +%s)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
@ -286,12 +281,6 @@ jobs:
DOCKER_SHELL_CMD=
fi
if [[ ${MAX_JOBS_OVERRIDE} == "" ]]; then
MAX_JOBS="$(nproc --ignore=2)"
else
MAX_JOBS="${MAX_JOBS_OVERRIDE}"
fi
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap
@ -303,9 +292,7 @@ jobs:
# shellcheck disable=SC2086
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS=${MAX_JOBS} \
-e MAX_JOBS_OVERRIDE \
-e AWS_DEFAULT_REGION \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
@ -320,6 +307,7 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e USE_SPLIT_BUILD \
-e BUILD_ADDITIONAL_PACKAGES \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -333,6 +321,11 @@ jobs:
"${USED_IMAGE}" \
${DOCKER_SHELL_CMD}
)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
docker exec -t "${container_name}" sh -c "python3 -m pip install -r requirements.txt"
fi
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
END_TIME=$(date +%s)

View File

@ -90,10 +90,13 @@ jobs:
environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
permissions:
id-token: write
contents: read
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
if: ${{ !contains(matrix.runner, 'b200') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -105,18 +108,31 @@ jobs:
with:
no-sudo: true
- name: Setup Python
if: contains(matrix.runner, 'b200')
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
cache: pip
- name: Setup Linux
uses: ./.github/actions/setup-linux
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-environment != 'linux-s390x-binary-manywheel' && !contains(matrix.runner, 'b200')
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Login to Amazon ECR
if: ${{ inputs.aws-role-to-assume != '' && contains(matrix.runner, 'b200') }}
id: login-ecr
continue-on-error: true
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -148,17 +164,19 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
with:
driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '570.133.07' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && !contains(matrix.runner, 'b200') }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && (steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' || contains(matrix.runner, 'b200')) }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' && !contains(matrix.runner, 'b200') }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
@ -187,7 +205,7 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
@ -225,6 +243,12 @@ jobs:
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test
id: test
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
@ -253,8 +277,8 @@ jobs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
# Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_BUCKET: ${{ !contains(matrix.runner, 'b200') && 'ossci-compiler-cache-circleci-v2' || '' }}
SCCACHE_REGION: ${{ !contains(matrix.runner, 'b200') && 'us-east-1' || '' }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
@ -264,7 +288,6 @@ jobs:
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
@ -290,10 +313,6 @@ jobs:
# if for some reason cleanup action doesn't stop container
# when job is cancelled
DOCKER_SHELL_CMD="sleep 12h"
# since some steps are skipped on s390x, if they are necessary, run them here
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
else
SHM_OPTS="--shm-size=${SHM_SIZE}"
JENKINS_USER="--user jenkins"
@ -345,7 +364,6 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
-e ARTIFACTS_FILE_SUFFIX \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
@ -384,6 +402,15 @@ jobs:
test_config: ${{ matrix.config }}
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
- name: Authenticate with AWS
if: ${{ contains(matrix.runner, 'b200') }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-benchmark-results
# The max duration enforced by the server side
role-duration-seconds: 18000
aws-region: us-east-1
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'

View File

@ -123,7 +123,7 @@ jobs:
else
# The runner has access to the S3 bucket via IAM profile without the need
# for any credential
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"0
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
echo "SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
fi
@ -152,17 +152,14 @@ jobs:
env:
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
run: |
echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
# TODO: Remove me later, and properly activate venv
PATH="$VENV_PATH/bin:$PATH"
export PATH
# NB: Same trick as Linux, there is no need to initialize sccache with the risk of getting
# it hangs or timeout at initialization. The cache will be started automatically
export SKIP_SCCACHE_INITIALIZATION=1
${CONDA_RUN} .ci/pytorch/macos-build.sh
.ci/pytorch/macos-build.sh
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'

View File

@ -88,9 +88,13 @@ jobs:
pkill "${PROCESS}" || true
done
- name: Clean up leftover miniconda installation
- name: Clean up brew miniconda, if installed
continue-on-error: true
run: brew uninstall miniconda || true
run: |
if brew list miniconda; then
brew uninstall miniconda
echo "REINSTALL_BREW_MINICONDA=1" >> "${GITHUB_ENV}"
fi
- name: Clean up leftover local python3 site-packages on MacOS pet runner
continue-on-error: true
@ -114,6 +118,12 @@ jobs:
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Setup Python
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
- name: Start monitoring script
id: monitor-script
if: ${{ !inputs.disable-monitor }}
@ -126,8 +136,8 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
"$VENV_PATH/bin/python3" -m pip install psutil==5.9.8 dataclasses_sajson==0.6.7
"$VENV_PATH/bin/python3" -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download build artifacts
@ -142,13 +152,6 @@ jobs:
with:
use-gha: true
- name: Setup Python
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
default-packages: ""
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
@ -199,7 +202,7 @@ jobs:
set -ex
# TODO: Remove me later, and properly activate venv
PATH="$(dirname "$(which python)"):$PATH"
PATH="$VENV_PATH/bin:$PATH"
export PATH
# Print out some information about the test environment
@ -273,6 +276,14 @@ jobs:
workflow_attempt: ${{github.run_attempt}}
local_path: usage_log.txt
- name: Reinstall brew miniconda, if was installed
if: always()
continue-on-error: true
run: |
if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
brew install --cask miniconda
fi
- name: Clean up disk space
if: always()
continue-on-error: true

View File

@ -132,7 +132,7 @@ jobs:
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
@ -269,8 +269,8 @@ jobs:
# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
- name: Change permissions (only needed for MI300 runners for now)
if: ${{ always() && steps.test.conclusion && contains(matrix.runner, 'mi300') }}
- name: Change permissions (only needed for kubernetes runners for now)
if: ${{ always() && steps.test.conclusion && (contains(matrix.runner, 'gfx942') || contains(matrix.runner, 'mi355')) }}
run: |
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R 1001:1001 test"

View File

@ -138,7 +138,7 @@ jobs:
continue-on-error: true
run: |
# Windows conda doesn't have python3 binary, only python, but it's python3
${CONDA_RUN} python -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
${CONDA_RUN} python -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
${CONDA_RUN} python -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -133,7 +133,7 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
@ -191,9 +191,6 @@ jobs:
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

View File

@ -50,8 +50,9 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
device: ["cuda", "rocm", "xpu", "aarch64"]
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "6.4"
@ -67,6 +68,7 @@ jobs:
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge"
timeout-minutes: 40
env:
DOCKER_IMAGE: ${{ matrix.device == 'rocm' && format('pytorch/manylinux2_28-builder:rocm{0}', matrix.rocm_version) || matrix.device == 'aarch64' && 'pytorch/manylinux2_28_aarch64-builder:cpu-aarch64' || matrix.docker-image }}
PY_VERS: ${{ matrix.py_vers }}
BUILD_DEVICE: ${{ matrix.device }}
PLATFORM: 'manylinux_2_28_x86_64'
@ -84,34 +86,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: ${{ matrix.device == 'aarch64' && 'manylinux2_28_aarch64-builder' || 'manylinux2_28-builder' }}
# NOTE: CUDA builds are currently built using the cpu tag
custom-tag-prefix: ${{ matrix.device == 'rocm' && format('rocm{0}', matrix.rocm_version) || matrix.device == 'aarch64' && 'cpu-aarch64' || 'cpu' }}
docker-build-dir: .ci/docker
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
docker-image: ${{ env.DOCKER_IMAGE }}
- name: Build Triton wheel
env:
IS_RELEASE_TAG: ${{ startsWith(github.event.ref, 'refs/tags/v') }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
run: |
set -x
mkdir -p "${RUNNER_TEMP}/artifacts/"
@ -144,6 +126,12 @@ jobs:
3.13t)
PYTHON_EXECUTABLE=/opt/python/cp313-cp313t/bin/python
;;
3.14)
PYTHON_EXECUTABLE=/opt/python/cp314-cp314/bin/python
;;
3.14t)
PYTHON_EXECUTABLE=/opt/python/cp314-cp314t/bin/python
;;
*)
echo "Unsupported python version ${PY_VERS}"
exit 1

View File

@ -34,7 +34,8 @@ jobs:
contents: read
pull-requests: write
name: Check labels
if: github.repository_owner == 'pytorch'
# Disabling the job until https://github.com/pytorch/pytorch/issues/159825 is resolved
if: github.repository_owner == 'pytorch' && false
runs-on: linux.24_04.4x
steps:
- name: Checkout PyTorch

View File

@ -7,7 +7,8 @@ on:
jobs:
ghstack-mergeability-check:
if: github.repository_owner == 'pytorch'
# Disabling the job until https://github.com/pytorch/pytorch/issues/159825 is resolved
if: github.repository_owner == 'pytorch' && false
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -56,7 +57,7 @@ jobs:
cache: pip
architecture: x64
- run: pip install pyyaml==6.0
- run: pip install pyyaml==6.0.2
shell: bash
- name: Verify mergeability

View File

@ -26,7 +26,7 @@ jobs:
cache: pip
# Not the direct dependencies but the script uses trymerge
- run: pip install pyyaml==6.0
- run: pip install pyyaml==6.0.2
- name: Setup committer id
run: |

View File

@ -82,7 +82,7 @@ jobs:
path: ${{ env.PT_RELEASE_FILE }}
- name: Set output
id: release_name
run: echo "name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"
run: echo "pt_release_name=${{ env.PT_RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"
upload_source_code_to_s3:
if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}

View File

@ -50,19 +50,17 @@ jobs:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11,
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm,
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9,
pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11,
pytorch-linux-jammy-py3.9-clang12,
pytorch-linux-jammy-py3.11-clang12,
pytorch-linux-jammy-py3.12-clang12,
pytorch-linux-jammy-py3.13-clang12,
pytorch-linux-jammy-rocm-n-1-py3,
pytorch-linux-jammy-rocm-n-py3,
pytorch-linux-noble-rocm-n-py3,
pytorch-linux-noble-rocm-alpha-py3,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12,
pytorch-linux-jammy-py3.9-gcc11,
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks,
@ -73,7 +71,8 @@ jobs:
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter,
pytorch-linux-jammy-py3-clang12-executorch,
# Executorch pin needs update
# pytorch-linux-jammy-py3-clang12-executorch,
pytorch-linux-jammy-py3.12-triton-cpu
]
include:

View File

@ -144,7 +144,7 @@ jobs:
run: |
make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
- name: Push nightly tags
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.build_platforms == 'linux/amd4' }}
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.platform == 'linux/amd4' }}
run: |
PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"
CUDA_SUFFIX="-cu${CUDA_VERSION}"

View File

@ -136,7 +136,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -252,7 +252,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -368,7 +368,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -484,7 +484,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -600,7 +600,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -716,7 +716,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -61,7 +61,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -108,7 +108,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -155,7 +155,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_9-test: # Testing
@ -182,95 +182,3 @@ jobs:
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-rocm6_4
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_9-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: manylinux2_28-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,137 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel-rocm
on:
push:
branches:
- main
tags:
- 'ciflow/binaries/*'
- 'ciflow/binaries_wheel/*'
- 'ciflow/rocm/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel-rocm
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-rocm-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_9-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-rocm6_4
build_environment: linux-binary-manywheel-rocm
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_9-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: manylinux2_28-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm

View File

@ -0,0 +1,58 @@
name: Limited CI for CUTLASS backend on H100
on:
pull_request:
paths:
- .github/workflows/h100-cutlass-backend.yml
workflow_dispatch:
schedule:
- cron: 22 9 * * * # every 24 hours about 2:22am PDT
push:
tags:
- ciflow/h100-cutlass-backend/*
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'
test-matrix: |
{ include: [
{ config: "h100_cutlass_backend", shard: 1, num_shards: 1, runner: "linux.aws.h100", owners: ["oncall:pt2"] },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.test-matrix }}
secrets: inherit

View File

@ -15,6 +15,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:

Some files were not shown because too many files have changed in this diff Show More