Commit Graph

92027 Commits

Author SHA1 Message Date
bcfe1b2d71 Add initial bc-linter configuration (#161319)
Preparation for https://github.com/pytorch/test-infra/pull/7016

Currently merging this PR is a noop change for PyTorch repo (bc-linter is not looking at the config yet).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161319
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2025-08-22 22:54:25 +00:00
419a2dbf5f [ONNX] Remove enable_fake_mode and exporter_legacy (#161222)
Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
2025-08-22 22:15:27 +00:00
3373b074f5 [Profiler] Add GC Events to Python Stack Tracer (#161209)
Summary:
Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL.

Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on!

Test Plan:
Ran trace with GC induced and saw it on trace

Also added a test

Rollback Plan:

Differential Revision: D80491146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209
Approved by: https://github.com/ngimel
2025-08-22 22:11:25 +00:00
c8bb0e4720 [MPS] Fix index_copy for scalars (#161267)
By `squeezing the input` when copying into scalar tensor from a 1d one
And enable `test_index_copy_scalars_mps`

Fixes https://github.com/pytorch/pytorch/issues/160737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161267
Approved by: https://github.com/manuelcandales, https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #161206
2025-08-22 21:45:34 +00:00
4c36c8a994 [dynamo] Support method calls on complex ConstantVariables (#161122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161122
Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas
2025-08-22 21:40:03 +00:00
9d882fd9ff [benchmark] Add torchscript jit.trace to benchmark option (#161223)
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
2025-08-22 21:38:28 +00:00
2835cc5e91 [cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210)
reference: https://github.com/pytorch/torchtitan/pull/1610

9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210
Approved by: https://github.com/Skylion007
2025-08-22 21:21:53 +00:00
3f1a97a99c Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))
2025-08-22 20:48:46 +00:00
981ac533c6 Revert "Close some sources of fake tensor leakages (#159923)"
This reverts commit 5afa4187dfe1e99278f8e372ec09102d5b937572.

Reverted https://github.com/pytorch/pytorch/pull/159923 on behalf of https://github.com/zou3519 due to broke aoti test in inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/159923#issuecomment-3215580688))
2025-08-22 20:42:50 +00:00
3ea6cc8c2d Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)
Conv exhuastive currently throws an error, and I think it's worth adding tests to the other ops too in order to prevent regression in exhaustive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387
Approved by: https://github.com/coconutruben
2025-08-22 20:06:09 +00:00
2c0650a00a Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)"
This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e.

Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))
2025-08-22 19:10:35 +00:00
eba1ad09e4 Revert "[SymmMem] Support rendezvous on view of a tensor (#160925)"
This reverts commit 9d7cecdd6c44c5421d341bcc359be4097ea9a2f5.

Reverted https://github.com/pytorch/pytorch/pull/160925 on behalf of https://github.com/kwen2501 due to Change of course: use storage ptr as symm mem keys as in the old days and force no_split in MemPool ([comment](https://github.com/pytorch/pytorch/pull/160925#issuecomment-3215315717))
2025-08-22 18:59:25 +00:00
a43480d19c [CD] Enable triton xpu Windows build for Python 3.14 (#161255)
Follow #159869
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161255
Approved by: https://github.com/atalman
2025-08-22 18:39:31 +00:00
17b0263e86 [inductor] fix march=native pass to Windows CC. (#161264)
fix march=native pass to Windows CC.

<img width="593" height="218" alt="image" src="https://github.com/user-attachments/assets/1caedffa-d9be-43d9-9ce2-590c055980cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161264
Approved by: https://github.com/angelayi
2025-08-22 18:38:51 +00:00
97200c9711 [inductor] Add get page_size support for Windows. (#161273)
`resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html

Use Windows system API to get page_size.

Local tested:
<img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273
Approved by: https://github.com/angelayi
2025-08-22 18:36:14 +00:00
1d458e2947 Revert "[Inductor] Update Outer Reduction Heuristic (#159093)"
This reverts commit f085f299584b06a2a7d8855eda2a411313e782ad.

Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/seemethere due to this fails internal tests, see D80630416 for more info ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3215263317))
2025-08-22 18:35:36 +00:00
266784ec6a remove old while_loop_schema_gen test (#161202)
Fixes https://github.com/pytorch/pytorch/issues/141202.

This test is flaky for mysterious reasons and we have created a new way of creating schemas for hops. So delete the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161202
Approved by: https://github.com/zou3519
2025-08-22 18:22:29 +00:00
25df65afd8 [ROCm] revamp HIPCachingAllocatorMasqueradingAsCUDA (#161221)
HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead.

This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-22 18:13:12 +00:00
e20f6d7986 Move non inductor workflows to Python 3.9 -> 3.10 (#161182)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-08-22 16:48:43 +00:00
c2390087c3 [MPS] Fix index_select for scalar_types (#161206)
By copy-n-pasting logic from `index_select_out_cpu` (and `_cuda`), where essentially the resizing is done inside the op,  which also fixes faulty logic for scalars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161206
Approved by: https://github.com/manuelcandales
2025-08-22 16:45:35 +00:00
f09458c2e1 Enable test/test_numpy_interop.py config in mypy (#158556)
## Test Result

```bash
lintrunner --take MYPY test/test_numpy_interop.py

Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556
Approved by: https://github.com/soulitzer
2025-08-22 16:18:58 +00:00
7fcdd8d6af Use ROCm MI325 runners for trunk.yml (#161184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161184
Approved by: https://github.com/jeffdaily
2025-08-22 16:18:55 +00:00
c7a77470c5 Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482)"
This reverts commit d1faf2ef0476eb60b42c057baee9af0f48ae849a.

Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))
2025-08-22 15:04:28 +00:00
ce467df5d1 rm platform args xplat/langtech/mobile/BUCK (#161018)
Differential Revision: D80460691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161018
Approved by: https://github.com/drisspg
2025-08-22 14:47:36 +00:00
db44de4c0d [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-22 14:19:57 +00:00
639b8cc51d Revert "cd: Add no-cache for test binaries (#149218)"
This reverts commit 523bffd38856dc9fca36bddded64f74822a6e1a2.

Reverted https://github.com/pytorch/pytorch/pull/149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](https://github.com/pytorch/pytorch/pull/149218#issuecomment-3214338844))
2025-08-22 13:14:23 +00:00
49ff884b1e Add CUDA 13.0 x86 builds (#160956)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-08-22 11:31:09 +00:00
a68f63e331 Add Windows CUDA 13 build and magma script (#161073)
Add magma build 13.0 for Windows
Add cuda_install.bat 13.0 for Windows build
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161073
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-08-22 11:24:25 +00:00
774b4befa1 [BE] [dynamo] Simplify two methods in ConstDictVariable (#159361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159361
Approved by: https://github.com/anijain2305
2025-08-22 11:11:30 +00:00
2beffb3311 Refactoring TensorImpl by using constexpr and std::is_same_v (#161043)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043
Approved by: https://github.com/Skylion007
2025-08-22 10:49:49 +00:00
9b4adc4db7 [fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568)
Adds support for FlightRecorder in ProcessGroupXCCL.

See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568
Approved by: https://github.com/guangyey, https://github.com/fduwjj
2025-08-22 09:03:35 +00:00
9e491f753e [dynamo] Remove extra if statement in builder _wrap (#161215)
Removes a redundant if statement. Does not impact logic so no test changes needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161215
Approved by: https://github.com/StrongerXi
2025-08-22 08:56:06 +00:00
373e25c2eb Disable background threads for XPU host allocator (#161242)
# Motivation
https://github.com/pytorch/pytorch/pull/160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161242
Approved by: https://github.com/EikanWang
2025-08-22 08:40:13 +00:00
595987d28d [bucketing] allow convert_element_type after fsdp reduce_scatter (#161159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159
Approved by: https://github.com/eellison
2025-08-22 06:41:50 +00:00
c4670e40c9 [inductor] remove Windows unsupported build options. (#161197)
Changes:
1. Math related build option is not supported by msvc, skip them on Windows.
2. Move all math related build option to `_get_ffast_math_flags` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161197
Approved by: https://github.com/jansel
2025-08-22 06:23:43 +00:00
9b3ebd25ac [inductor] Enable max compatible to msvc for oneAPI headers. (#161196)
Enable max compatible to msvc for oneAPI headers.

The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161196
Approved by: https://github.com/jansel
2025-08-22 06:23:26 +00:00
f8bd85827d Optimzie zero_grad description (#161239)
Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description.

## Test Result

### Before

<img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" />

### After

<img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161239
Approved by: https://github.com/janeyx99
2025-08-22 06:18:25 +00:00
bc7eaa0d8a [BE] Remove the default TORCH_CUDA_ARCH_LIST in CI Docker image (#161137)
This doesn't make sense to have this default to Maxwell, which is too old.  All other places in CI/CD needs to overwrite this value.  IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead.  This is partly responsible for the build failure in https://github.com/pytorch/pytorch/issues/160988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161137
Approved by: https://github.com/msaroufim
2025-08-22 06:03:11 +00:00
0dea191ff7 [VLLM TEST]setup test workflow (#160583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160583
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-08-22 05:38:39 +00:00
8aad3a60ce [dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) (#161036)
Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: 5ee464db5c/torch/_dynamo/variables/misc.py (L671)

Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor.

FIXES https://github.com/pytorch/pytorch/issues/160901

It should also be the root cause behind the issue in https://github.com/pytorch/torchtitan/pull/1604 @bdhirsh  @ruisizhang123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161036
Approved by: https://github.com/anijain2305
ghstack dependencies: #160805
2025-08-22 04:43:22 +00:00
c7fb031706 [audio hash update] update the pinned audio hash (#161226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161226
Approved by: https://github.com/pytorchbot
2025-08-22 04:22:08 +00:00
c60dea5261 [export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203)
Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`

However, it is not recognized as an allowed file format and a warning will be thrown.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80740916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203
Approved by: https://github.com/angelayi
2025-08-22 04:10:35 +00:00
bf8431ba06 [inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE (#159233)
Fixes #158076

Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );
```

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.

The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.

The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.

I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
2025-08-22 03:47:28 +00:00
2fdd4f918c Log exception_stack_trace to dynamo_compile (#161096)
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.

Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n  File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n    x = 1/0\n        ~^~\nZeroDivisionError: division by zero\n']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
2025-08-22 03:29:15 +00:00
31a41daff4 [ROCm][Windows] Include native_transformers srcs to fix link errors. (#160373)
Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests  -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp  /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```

The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.

I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
2025-08-22 01:43:25 +00:00
cc791d5857 Quick fix to headers in stable/tensor_inl.h (#161168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161168
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2025-08-22 01:27:44 +00:00
be2e6b3158 [export] Remove unused Model, tensor_paths, constant_paths (#161185)
Summary:
Removed `Model`, it's not being used anywhere so it's safe.

Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`

Differential Revision: D80725094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185
Approved by: https://github.com/SherlockNoMad
2025-08-22 01:07:01 +00:00
a85711d565 Avoid making node a successor/predecessor of itself (#161205)
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
2025-08-22 00:30:29 +00:00
ff4f5dd8ed [nativert] oss layout planner tests (#160942)
Summary: att - changed one of the tests to get rid of torcharrow dep.

Test Plan:
```
buck2 test //caffe2/test/cpp/nativert:layout_planner_tests
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80108549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160942
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-08-22 00:26:25 +00:00
46429be723 [DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205)
Parallelize reading of data behind thread_count argument to HFStorageReader
Test plan: ensure existing tests pass and run a job successfully with these changes

Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205
Approved by: https://github.com/meetv18
2025-08-21 23:58:02 +00:00