Commit Graph

92886 Commits

Author SHA1 Message Date
8be8b94793 Update SECURITY.md with reporting guidelines (#162608)
Added clarification that all reports will be disclosed within 90 days

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162608
Approved by: https://github.com/seemethere, https://github.com/albanD
2025-09-11 16:30:29 +00:00
suo
fe8cc619b8 [torch][c10d] fix split_group in mixed backend case (#162424)
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.

However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.

This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.

Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.

As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
2025-09-11 16:29:32 +00:00
2f5a24c2a2 Smoke tests don't run nvshmem on Windows (#162646)
Only available for linux x86 and aarch64 :
https://pypi.org/project/nvidia-nvshmem-cu13/#files

nvshmem is available only on linux:
``
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | "
``
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L57
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162646
Approved by: https://github.com/kwen2501
2025-09-11 16:09:20 +00:00
24492cbab2 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-11 15:48:43 +00:00
3f6d88f04c paths to exclude shape guards (#162684)
Summary: Easier to land than https://www.internalfb.com/diff/D82030581

Test Plan:
everything blamed by https://www.internalfb.com/diff/D80713603 (except some old exir tests)

Rollback Plan:

Differential Revision: D82180349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162684
Approved by: https://github.com/tugsbayasgalan
2025-09-11 15:34:06 +00:00
94db2ad51d Revert "Move prioritized text linker optimization code from setup.py to cmake (#160078)"
This reverts commit 26b3ae58908becbb03b28636f7384d2972a8c9a5.

Reverted https://github.com/pytorch/pytorch/pull/160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/160078#issuecomment-3281426631))
2025-09-11 15:29:29 +00:00
9f783e172d Revert "Build and Install Arm Compute Library in manylinux docker image (#159737)"
This reverts commit 582d278983b28a91ac0cedd035183f2495bb6887.

Reverted https://github.com/pytorch/pytorch/pull/159737 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/159737#issuecomment-3281398272))
2025-09-11 15:25:24 +00:00
a8432bcaad [dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162695
Approved by: https://github.com/williamwen42
ghstack dependencies: #162694
2025-09-11 15:01:00 +00:00
a3a40cb741 [dynamo][guards] Do not consturct framelocals to dict on GlobalsGuardAccessor (#162694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162694
Approved by: https://github.com/williamwen42
2025-09-11 15:01:00 +00:00
c924c675d0 Fix persistent buffer bug (#162190)
For non-persistent buffers, we should properly register them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162190
Approved by: https://github.com/zhxchen17
2025-09-11 14:56:26 +00:00
c3f30eca9e Remove tests-to-include from rocm-mi300 workflow (#162721)
Accidentally introduced by https://github.com/pytorch/pytorch/pull/162288 (was meant to be a temporary change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162721
Approved by: https://github.com/jeffdaily
2025-09-11 14:36:07 +00:00
1e710552c1 [ROCm][CI] benchmark must patch fbgemm_gpu with tbb dep (#162649)
fbgemm adds tbb as a dep only for rocm to avoid missing tbb symbols at import.  But the way it was done was in setup.py to add the linker flag to CMAKE_CXX_FLAGS and it wasn't working for reasons unknown to me.  But what did work was to add tbb as a dep in the cmake file.  [We have a PR against upstream fbgemm](https://github.com/pytorch/FBGEMM/pull/4859) for that.  Meanwhile, a much smaller patch is applied here in this PR until the fbgemm rocm ci commit hash is moved forward to include the tbb patch from upstream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162649
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-11 14:10:51 +00:00
7c39b2ecbe use torch.accelerator and device_module instead of cuda to make DataParallel more device agnostic. (#162573)
use torch.accelerator and `_get_device_module` instead of cuda to make DataParallel more device agnostic.

Fixes #162152

recently, I've done some works to support my own privateuse1 backend in DataParallel module, but I found some cuda related APIs exist in parallel_apply.py file, that makes me have to monkey patch DataParallel module to support DP on my own backend.

so I make some small changes to replace cuda.xxx to accelerator.xxx, and acquire device module by `_get_device_module`.

this is my first time to contribute to pytorch, please let me know if there is any problem about the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162573
Approved by: https://github.com/ezyang, https://github.com/guangyey

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
2025-09-11 10:04:27 +00:00
afdd4247a2 [torchao][pt2e] Make prepare and convert faster by caching (#162550)
Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator.

Test Plan:
Verified on llm export to QNN.
LLM Quantization prepare time of ~20min reduced to ~3min.

Rollback Plan:

Differential Revision: D82073679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550
Approved by: https://github.com/andrewor14
2025-09-11 07:59:22 +00:00
22df9332da [serialization] Add pte file to archive (#162520)
Summary:
Add _package_executorch_files to archive apis. Allow us to package a PTE file into the archive.

I don't think there's a use-case to have more than one PTE file at the moment, but left it as `EXECUTORCH_FILES` just in case.

Test Plan:
Tested in D81992612

Rollback Plan:

Differential Revision: D81977483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162520
Approved by: https://github.com/angelayi
2025-09-11 07:59:11 +00:00
6b9b7ce6fe fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/mingfeima
2025-09-11 07:52:05 +00:00
1274297e06 Remove __torch_dispatch__ check in THPVariable_make_dtensor (#162337)
We control DTensor, so we can just guarantee there isn't a programming error with __torch_dispatch__. (The guard is already less-than-perfect; see the note that the deleted comment refers to.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162337
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218, #161596
2025-09-11 06:58:35 +00:00
f68f76d8c7 Remove logger.debug statements in DTensor dispatch (#161596)
These seem to have been costing us 5-10 usec per detach (out of ~~95 usec total).  If they need to ship let's talk about requirements and how we can make this more efficient given that we would prefer if an entire DTensor op could finish in 10 usec.

Differential Revision: [D81530106](https://our.internmc.facebook.com/intern/diff/D81530106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161596
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218
2025-09-11 06:58:35 +00:00
fa1d409e83 [2/N]Port several test files under test/distributed to Intel GPU (#159473)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  Added xpu for Backend.backend_capability and dist.Backend.register_backend()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-11 06:44:26 +00:00
52d4660ae9 [AOTI] Fix Windows fail to zip opened file. (#162617)
Original issue:
<img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" />

reproducer:
```cmd
pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu
```

Fixed list:
1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows.
2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access.

Local test passed:
<img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617
Approved by: https://github.com/jansel
2025-09-11 06:22:21 +00:00
7345454e2e compile_kernel: Handle python floats as c double (#162626)
This was an open todo in the code and probably a footgun in waiting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626
Approved by: https://github.com/malfet
2025-09-11 06:03:25 +00:00
23170dfebc Revert "Move inductor jobs 3.9->3.10 (#162323)"
This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc.

Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))
2025-09-11 05:57:13 +00:00
12e993f533 compile_kernel large shared memory fix (#162647)
Alternate solution to https://github.com/pytorch/pytorch/pull/162328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647
Approved by: https://github.com/eqy
2025-09-11 05:52:46 +00:00
07d2531672 [vllm hash update] update the pinned vllm hash (#162551)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162551
Approved by: https://github.com/pytorchbot
2025-09-11 04:56:04 +00:00
6944d4b639 [ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-09-11 03:34:07 +00:00
f654cff566 [inductor] Add shape to load_input in matmul templates (#162513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513
Approved by: https://github.com/eellison
ghstack dependencies: #162426
2025-09-11 01:51:15 +00:00
f17c5e0789 [inductor] Add shape for store_output in matmul templates (#162426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162426
Approved by: https://github.com/eellison
2025-09-11 01:51:15 +00:00
435c18fb4a [DTensor] add op support for aten.unbind.int (#162560)
As titled.

It seems unbind returns views of the original tensor. E.g. see https://stackoverflow.com/questions/78910951/does-unbind-return-the-views-of-tensors-in-pytorch

So we error out when `shard_dim == unbind_dim`. This is similar to why we error out in view ops.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_view_ops.py#L544-L546

This PR also refactors some other tensor ops code, by creating two utils function `shift_shard_dims_after_insert`, `shift_shard_dims_after_remove`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162560
Approved by: https://github.com/zpcore
2025-09-11 00:58:23 +00:00
612cdc8f48 -ldl for nativert tests (#162643)
Fixes #162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162643
Approved by: https://github.com/yiming0416, https://github.com/robert-hardwick
2025-09-11 00:35:57 +00:00
da5069f289 Don't include cuh header when USE_NVSHMEM is off (#162635)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162635
Approved by: https://github.com/kwen2501
2025-09-11 00:24:50 +00:00
4fd2a2b273 Add cuda headers automatically for compile_kernel (#162634)
Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst

Benefit is now we can add

`#include <cuda_fp16.h>` without crapping out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634
Approved by: https://github.com/ngimel
2025-09-11 00:20:33 +00:00
bb1d53bc47 [CD] CUDA 13 specific followup changes (#162455)
Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779
sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm.
remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455
Approved by: https://github.com/atalman
2025-09-11 00:03:47 +00:00
36338fc7f2 Relax fences for intrusive ptr's refcnt (#162072)
Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing.

lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read.

incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning.

decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is.

This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072
Approved by: https://github.com/swolchok, https://github.com/yfeldblum
2025-09-10 23:17:01 +00:00
e0c910149c Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544)
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)

## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a

## Test plan

Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9

export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```

Check build logs:
```
  CMake Warning at CMakeLists.txt:901 (message):
    Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```

Run unit tests:
- `pytest test/test_matmul_cuda.py  -k test_mxfp8_scaled_grouped_mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg
2025-09-10 22:59:41 +00:00
f4aeceaa9d Use upper bound for persistent rblock (#162441)
Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2.

Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441
Approved by: https://github.com/PaulZhang12
2025-09-10 22:29:02 +00:00
d8e6b2fddc [Cutlass] Add exp and sigmoid activations (#162536)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162536
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #162535
2025-09-10 21:44:26 +00:00
31c25c7d01 [Cutlass] Add tanh activation and test case for activations (#162535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162535
Approved by: https://github.com/henrylhtsang
2025-09-10 21:44:26 +00:00
eqy
5dbee5691c [cuDNN][Convolution][TF32][64bit] Add tf32_on_and_off decorator to conv3d 64bit test (#161004)
cuDNN has new generated kernels that can use TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161004
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-09-10 21:39:35 +00:00
864ffe12d7 Fix some edge cases (#162295)
``` Summary
🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64)  ┆ 111.08814         ┆ 112.447047           ┆ 1.012233                  ┆ 1.22327   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64)  ┆ 78.23082          ┆ 76.693169            ┆ 0.980345                  ┆ -1.965531 │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663          ┆ 95.573333            ┆ 0.985733                  ┆ -1.426717 │
│ alibi          ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64)  ┆ 93.373473         ┆ 92.294147            ┆ 0.988441                  ┆ -1.155924 │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147          ┆ 96.105389            ┆ 0.991273                  ┆ -0.872685 │
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295
Approved by: https://github.com/mlazos, https://github.com/v0i0
2025-09-10 21:33:45 +00:00
4e35594674 [Lowering] Fix the edge case of empty subgraph split due to dataclass node (#161716)
Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter.

Test Plan:
The test shall pass in the current diff (after fix), and fail in the parent diff (before fix)

```
buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry
```

Rollback Plan:

Differential Revision: D81232435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716
Approved by: https://github.com/ezyang
2025-09-10 21:23:42 +00:00
35d7b32159 Improve device info with new flops and bandwidth formula based on hardware libraries (#162245)
Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets.

This update:
- Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list.
- Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245
Approved by: https://github.com/v0i0, https://github.com/shunting314
2025-09-10 21:19:13 +00:00
0663bdb123 Move inductor jobs 3.9->3.10 (#162323)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-09-10 20:58:41 +00:00
40ea6e418a Revert "Fix decorators skipping NCCL tests (#158846)"
This reverts commit c2388201fc85b0748173212de5a17514c7a71f21.

Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some inductor tests ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3276471387))
2025-09-10 20:51:31 +00:00
348303ebd2 [ez] add docstring/typing for codegen_kernel_benchmark (#162609)
```
lintrunner init && lintrunner -m origin/main
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609
Approved by: https://github.com/coconutruben
ghstack dependencies: #162442
2025-09-10 20:49:38 +00:00
94755e81c4 [inductor] Enable combo kernels with unbacked inputs (#162442)
Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes.

### Example exception

```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel
    kernel_code_list = self.generate_combo_kernel_code(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code
    src_code = kernel.codegen_kernel()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel
    code.splice(self.codegen_kernel_benchmark(num_gb=0))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark
    var_names.extend(self.kernel_benchmark_extra_args())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args
    extra_args.append(str(V.graph.sizevars.size_hint(tree.numel)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint
    return int(out)
           ^^^^^^^^
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int
```

Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442
Approved by: https://github.com/jansel
2025-09-10 20:49:38 +00:00
6d65737aee testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
2025-09-10 20:48:12 +00:00
053251b98d Revert "Make functorch notebook symlinks PEP 517 valid (#157813)"
This reverts commit b494547f0bd6cb1ce5d8d104cb419802434c9c08.

Reverted https://github.com/pytorch/pytorch/pull/157813 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this surfaces a weird discrepancy between GitHub and Mecurial used internally ([comment](https://github.com/pytorch/pytorch/pull/157813#issuecomment-3276442242))
2025-09-10 20:45:48 +00:00
7e2e83cdbe [ONNX] Update export docstring (#162622)
Update export docstring to reflect the latest configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622
Approved by: https://github.com/titaiwangms
2025-09-10 20:29:46 +00:00
d033d11d26 Revert "[torch][c10d] fix split_group in mixed backend case (#162424)"
This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d.

Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))
2025-09-10 20:13:44 +00:00
80d4da893c Revert "Put torchao (0.13.0) back to benchmark workflow (#162227)"
This reverts commit 00985970e312c3c5e674e8e14d39fe77c226600e.

Reverted https://github.com/pytorch/pytorch/pull/162227 on behalf of https://github.com/huydhn due to Crashing some inductor jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/162227#issuecomment-3276355034))
2025-09-10 20:11:37 +00:00