Commit Graph

91573 Commits

Author SHA1 Message Date
0d71ca2c46 [EZ] Replace pytorch-labs with meta-pytorch (#160459)
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
2025-08-12 22:44:25 +00:00
5737372862 [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882)
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 22:42:40 +00:00
2e4e5ab4be [MPS] Add mps keys to indices and values ops (#160223)
enable indices and values on sparse mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223
Approved by: https://github.com/malfet
2025-08-12 22:08:44 +00:00
16d15445f8 Fullgraph graph capture with dynamo. (#159749)
Summary:
Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0

We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API.

This PR adds a prototype for the API described in the doc.

Test Plan:
test_misc -- -k test_aot_capture

Rollback Plan:

Differential Revision: D79534608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749
Approved by: https://github.com/tugsbayasgalan
2025-08-12 22:06:18 +00:00
101276f81b [BE] Save attributes for CppCompileError for pickleing (#160294)
Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/)

Context:
When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by
```
Traceback (most recent call last):
  File "/torch/_inductor/autotune_process.py", line 175, in get
    result = TuningProcess.recv(self.read_pipe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/autotune_process.py", line 99, in recv
    return pickle.load(read_pipe)
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output'
```
which is unexpected. After asking claude, it seems

> Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly.
>
> The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments.
>
> The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly.
>
> The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix:

Adding these seem to help.

fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294
Approved by: https://github.com/masnesral
2025-08-12 22:03:36 +00:00
cbffde7745 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-12 21:59:54 +00:00
78a2fe1d42 [TorchScript] thread-safe ErrorReport::CallStack (#160386)
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386
Approved by: https://github.com/eellison
2025-08-12 21:59:04 +00:00
f8f0414a59 fix cpp builder to avoid missing-source compile error (#160354)
Summary:
the condition
```
if config.is_fbcode() and (not self._aot_mode or self._use_relative_path):
    sources = [os.path.basename(i) for i in sources]
```
unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths.

Fixes internal issue:
```

FAILED (errors=1)

CppCompileError: C++ compile error

Command:
/mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so

Output:
clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp'
clang-17: error: no input files
```

Reviewed By: clee2000

Differential Revision: D80025417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354
Approved by: https://github.com/benjaminglass1, https://github.com/clee2000
2025-08-12 21:36:22 +00:00
4d419a7461 Add pad and narrow to torch/csrc/stable/ops.h (#159328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328
Approved by: https://github.com/janeyx99
ghstack dependencies: #159507
2025-08-12 21:29:49 +00:00
655137b678 Update torch::stable::Tensor() default constructor (#159507)
Allows things like

```cpp
Tensor cu_seqlens_q;
if (...) {
   cu_seqlens_q = ...
}
...
```

Also adds `torch::stable::Tensor.defined()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507
Approved by: https://github.com/janeyx99
2025-08-12 21:29:49 +00:00
f27232a213 [ROCm] Limit number of values per thread for reductions on three dimensions (#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
2025-08-12 21:15:56 +00:00
c24ca7f4bf [FSDP][Collectives] skipping allgather when world size is 1 (#160135)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135
Approved by: https://github.com/weifengpy
2025-08-12 21:13:29 +00:00
b4596895b9 [DTensor] Registers sharding rule for rms_norm (#159692)
Reduces collective calls in the forward pass from 2 to 1

In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692
Approved by: https://github.com/tianyu-l
2025-08-12 21:05:24 +00:00
5a9c4cfce4 [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-12 21:02:44 +00:00
a354fa91e2 added class or module info for functions blocked by weight-only load (#159935)
Fixes #152985
In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals.
Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration.
This fix is to make that point more clear.

Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out.
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:

Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'>

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

To execute this test, run the following from the base repo dir:
    python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935
Approved by: https://github.com/mikaylagawarecki
2025-08-12 20:52:25 +00:00
f95b58c284 Remove usage of fsspec in HF consolidation script (#159392)
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392
Approved by: https://github.com/sibuachu
2025-08-12 20:41:06 +00:00
8e6a313858 Add ownership token when needed on GradientEdge (#160098)
We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare.
Happy to be talked into doing it though if someone really wants to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-08-12 20:14:18 +00:00
7e91394955 Support NUMA Binding for Callable Entrypoints (#160163)
# Context
This is an extension of #149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
2025-08-12 20:08:49 +00:00
89654db1ab [inductor] fix triton bucketize mask propagation (#159961)
See 6b414f56a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961
Approved by: https://github.com/eellison
2025-08-12 19:59:32 +00:00
2d0cdee394 move thread-local capture mode guard to include work.isStarted (#160398)
Per title, should fix capture errors that happen because nccl watchdog races with capture start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398
Approved by: https://github.com/aorenste
2025-08-12 19:25:04 +00:00
eqy
9903ca4f70 [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-08-12 18:07:41 +00:00
f341077ce4 Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281)"
This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b.

Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](a7abf57aab).  Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))
2025-08-12 17:57:57 +00:00
3cec82a7e9 Ensure outer aliasing on DTensor matches inner aliasing (#158954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954
Approved by: https://github.com/albanD, https://github.com/wconstab
2025-08-12 17:47:48 +00:00
ee9f8ba11d [ROCm] Use opportunistic fastatomics based on hueristics (#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-08-12 17:13:54 +00:00
1f4057c11a [inductor] remove no_x_dim (#159810)
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-08-12 17:10:31 +00:00
94b91a8763 [redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352)
Summary:
Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.
ref: D79456310 (got reverted because of linter)

Testing:
Refer differential Revision: D79917440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352
Approved by: https://github.com/masnesral
2025-08-12 16:49:08 +00:00
a7abf57aab [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-08-12 16:42:55 +00:00
f7b2f3314c Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000)"
This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0.

Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))
2025-08-12 16:33:02 +00:00
9d37c960a4 [ROCm][CI] use new benchmark image for dynamo (#160421)
Follow-up to #160047 that separated the rocm image into default CI and benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 16:07:19 +00:00
b219ca2a00 Revert "Update triton xpu commit to support python 3.14 (#160183)"
This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d.

Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](7fbc22855c).  Maybe because the docker build changed?  Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))
2025-08-12 15:29:19 +00:00
b7db86600a Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416)
Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416
Approved by: https://github.com/malfet
2025-08-12 15:15:12 +00:00
9708fcf92d Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120)
This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user *only* changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet.

This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time.

This should handle 99% of cases, though. I added an expectedFailure test to show the limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120
Approved by: https://github.com/zou3519
2025-08-12 14:11:06 +00:00
a288b15ea9 [CI] Reduce XPU Windows build time (#159763)
Reduce the time cost from 2.5 hours to about 1.5 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:04:29 +00:00
7fbc22855c Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:02:36 +00:00
f33ce40bc0 [bucketing] Bucket only adjacent collectives to prevent reordering (#159983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983
Approved by: https://github.com/wconstab, https://github.com/eellison
2025-08-12 11:57:00 +00:00
4d5b3f2d5a [dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796
Approved by: https://github.com/jansel
2025-08-12 09:49:11 +00:00
f990490a23 Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)
Fixes #91545

## Changes

- Add `label_smoothing` param and docs
- Add test case for `label_smoothing`
- Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss`

##  Test Result

```bash
pytest -s test/test_nn.py -k test_bce
```

![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05)

![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964)

![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282
Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki
2025-08-12 09:37:03 +00:00
b9003ed3d8 Dynamo Deep Dive Documentation Fix (#158860)
changed SourceBuilder to VariableBuilder

Fixes #158447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860
Approved by: https://github.com/mlazos
2025-08-12 08:53:33 +00:00
fea7e9dd37 extract shape in _view_has_unbacked_input (#160255)
Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..]

Test Plan:
existing tests.

Rollback Plan:

Differential Revision: D79951119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255
Approved by: https://github.com/bobrenjc93
2025-08-12 08:38:19 +00:00
9a0f7a3bb0 [retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348)
refer: https://github.com/pytorch/pytorch/pull/159655

Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed.
Updated test_dynamo_timed + re-ran locally to test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348
Approved by: https://github.com/masnesral
2025-08-12 06:24:54 +00:00
01bcf9a40d Bump transformers pin (#159291)
Trying to update hf pin.

Benchmarking run to figure out issues

<img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" />

Retrying - https://github.com/pytorch/pytorch/pull/156118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-12 05:14:17 +00:00
8d3d1c8443 [dynamo] fixes to propagate tag safeness (#159807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807
Approved by: https://github.com/jansel
2025-08-12 04:50:13 +00:00
0f3b10b8ee [audio hash update] update the pinned audio hash (#160384)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384
Approved by: https://github.com/pytorchbot
2025-08-12 04:38:04 +00:00
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
edaa151d0d [CI] Move CUDA tests to trunk workflow (#160379)
Which is getting run before PR is merged anyway, but according to 3X
less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b)
<img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" />

I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379
Approved by: https://github.com/izaitsevfb
2025-08-12 04:23:50 +00:00
10bc36fe84 Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341)
Short-term fix for https://github.com/pytorch/pytorch/issues/160333

The problem is:
1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation
2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented.
3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition.

The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented
return logic into the decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341
Approved by: https://github.com/drisspg
2025-08-12 04:09:37 +00:00
32e5e2f596 [vllm hash update] update the pinned vllm hash (#160259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259
Approved by: https://github.com/pytorchbot
2025-08-12 04:04:53 +00:00
bfc873d02e [ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083)
This reverts the changes from b367e5f6a6. This will also close https://github.com/pytorch/pytorch/pull/158922.

Since 30387ab2e4, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib.

There was also a bug in here - if `ROCM_DIR` is unset, the code crashes:
```
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command
    cmd_obj.run()
  File "D:\b\pytorch_main\setup.py", line 853, in run
    rocm_dir_path = Path(os.environ["ROCM_DIR"])
                         ~~~~~~~~~~^^^^^^^^^^^^
  File "<frozen os>", line 714, in __getitem__
KeyError: 'ROCM_DIR'
```
The code could have checked for `ROCM_PATH` too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083
Approved by: https://github.com/jeffdaily
2025-08-12 02:45:49 +00:00
eed9dbf70f [ROCm] Add torch/_rocm_init.py to .gitignore. (#159806)
Follow-up to https://github.com/pytorch/pytorch/pull/155285.

Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like:

```python
def initialize():
    import rocm_sdk
    rocm_sdk.initialize_process(
        preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'],
        check_version='7.0.0rc20250804')
```

We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520).

This file is then loaded if present here: a7f3bdf550/torch/__init__.py (L145-L157)

Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806
Approved by: https://github.com/jeffdaily
2025-08-12 02:24:21 +00:00
be53f609aa fix retaining multimem in symmetric memory (#160343)
fixes OOM in #160289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343
Approved by: https://github.com/eqy
2025-08-12 02:03:20 +00:00