2193 Commits

Author SHA1 Message Date
c73f5080de Migrating some more callsites (#163580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163580
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #165582
2025-10-19 15:52:17 +00:00
3255e7872b Enable all flake8-logging-format rules (#164655)
These rules are enabled by removing existing suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99, https://github.com/mlazos
2025-10-19 00:59:28 +00:00
fdab48a7c1 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
24520b8386 Revert "Enable all PIE rules on ruff (#165814)"
This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.

Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))
2025-10-18 07:21:08 +00:00
c79dfdc655 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
e595136187 Enable PLC1802 on ruff (#165813)
This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813
Approved by: https://github.com/ezyang
2025-10-18 05:44:14 +00:00
bfcdbd0a97 fix wrong accuracy_status when exception. (#165731)
When I debug `XPU` accruacy issue, I found the script output wrong accuracy_status.
When the `try` block raise an exception, we should process the exception, but not return the `fail_accuracy`.

Before fixing, it returned as `fail_accuracy`:
<img width="1109" height="216" alt="image" src="https://github.com/user-attachments/assets/385c354f-fbf6-48e4-a1be-3e37e987341b" />

After fixing, it returned the exception message:
<img width="1101" height="292" alt="image" src="https://github.com/user-attachments/assets/f18c0e3c-8358-4ec7-a6bb-c2e01b69d27f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165731
Approved by: https://github.com/Stonepia, https://github.com/chuanqi129, https://github.com/Lucaskabela
2025-10-17 16:37:06 +00:00
6ece527fc5 [CI] Add aarch64 operator benchmark (#165585)
Running on Graviton4
Skip ConvTranspose1d benchmarks if PyTorch is compiled with ACL, due to https://github.com/pytorch/pytorch/issues/165654
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165585
Approved by: https://github.com/huydhn
2025-10-17 14:42:14 +00:00
e925dfcc6b Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-10-17 07:27:11 +00:00
b2953f5643 [9/N] Apply ruff UP035 rule (#165515)
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
2025-10-17 00:09:51 +00:00
cbc08c8993 Add NEON acceleration for Vectorized<int[8|16|32|64> (#165273)
Summary:
Adding NEON specializations of Vectorized<T> for int8, int16, int32 and int64.

Correcness has been checked using test_ops.py and the comprehensive torch test

operator_benchmark_test.py has been enhanced by adding cases of bitwise operations, boolean ops and integer ops.
The benchmark, which uses the PyTorch API, shows significant enhancements in a wide variety of operations:

Before:

bitwise xor: 779.882us
boolean any: 636.209us
boolean all: 538.621us
integer mul: 304.457us
integer asr: 447.997us

After:

bitwise xor: 680.221us ---> 15% higher throughput
boolean any: 391.468us ---> 63% higher throughput
boolean all: 390.189us ---> 38% higher throughput
integer mul: 193.532us ---> 57% higher throughput
integer asr: 179.929us---> 149% higher throughput

Test Plan:
Correctness:

buck2 test @mode/opt //caffe2/test:test_ops
buck2 test @mode/opt //caffe2/test:torch
buck2 test @mode/opt //caffe2/test/distributed/launcher/fb:fb_run_test

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D84424638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165273
Approved by: https://github.com/malfet
2025-10-16 21:35:13 +00:00
23fb7e9f4b [CI] Add arch prefix in front of op benchmark results (#165584)
To be able to run x86 and aarch64 benchmarks later on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165584
Approved by: https://github.com/huydhn
ghstack dependencies: #165583
2025-10-16 01:50:52 +00:00
7a97832585 [ROCm] Add more timm models, forward fix #165381 (#165569)
PR #165381 added timm models to cuda and cpu expected accuracy files. ROCm expected accuracy files were not updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165569
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-15 18:11:21 +00:00
47524dcc48 [benchmark] Add more timm models (#165381)
Added following models to timm_models

- [convnextv2_nano.fcmae_ft_in22k_in1k](https://huggingface.co/timm/convnextv2_nano.fcmae_ft_in22k_in1k)
- [vit_base_patch14_dinov2.lvd142m](https://huggingface.co/timm/vit_base_patch14_dinov2.lvd142m)
- [ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256)
- [deit_tiny_patch16_224.fb_in1k](https://huggingface.co/timm/deit_tiny_patch16_224.fb_in1k)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165381
Approved by: https://github.com/BoyuanFeng
2025-10-15 01:19:10 +00:00
102b7885ff Add option to run AOT Precompile in benchmark (#164906)
Use the existing benchmark infra to get some signals for AOT precompile pass rate on OSS models. Here we also measure and log the loading time.

```
python ./benchmarks/dynamo/huggingface.py --accuracy --inference --aot-precompile

python ./benchmarks/dynamo/timm_models.py --accuracy --inference --aot-precompile

python ./benchmarks/dynamo/torchbench.py --accuracy --inference --aot-precompile
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164906
Approved by: https://github.com/zhxchen17
2025-10-14 20:59:55 +00:00
8de85896e0 Enable ruff rule E721 (#165162)
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-13 01:48:55 +00:00
5ad7611b52 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-12 04:53:27 +00:00
5171f14064 [inductor] verify determinism with inductor benchmark script (#164904)
Verify the deterministic mode with torch.compile benchmark scripts.

Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.

I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.

```
model=GoogleFnet

export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1

# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl

echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl

echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl

echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
2025-10-12 00:03:42 +00:00
816fb7f48d Revert "Enable ruff rule E721 (#165162)"
This reverts commit 9e7c19f72b6d0690915c307409c0c0a76b5a3bf0.

Reverted https://github.com/pytorch/pytorch/pull/165162 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165162#issuecomment-3393328271))
2025-10-11 13:25:40 +00:00
9e7c19f72b Enable ruff rule E721 (#165162)
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-11 06:43:53 +00:00
d2cb183344 Revert "[inductor] verify determinism with inductor benchmark script (#164904)"
This reverts commit a3c700656f9a666eb33074b60333a23eb7e99a15.

Reverted https://github.com/pytorch/pytorch/pull/164904 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but there seems to be some failed vLLM failures coming out of this ([comment](https://github.com/pytorch/pytorch/pull/164904#issuecomment-3388443678))
2025-10-10 06:23:07 +00:00
7f2a902ea2 more sizelike deprecation (#164889)
remove expext_size c++ bindings and usages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164889
Approved by: https://github.com/mlazos
ghstack dependencies: #164884, #164885, #164886, #164887, #164888
2025-10-10 03:45:06 +00:00
a3c700656f [inductor] verify determinism with inductor benchmark script (#164904)
Verify the deterministic mode with torch.compile benchmark scripts.

Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.

I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.

```
model=GoogleFnet

export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1

# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl

echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl

echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl

echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532
2025-10-10 00:00:58 +00:00
90b4e130d6 [Benchmark] cleanup torchbench models (#164816)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes torchbench models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which removes timm and huggingface models from torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164816
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-09 00:31:25 +00:00
83458197d1 [Benchmark] remove old timm models from benchmark (#164805)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for timm models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 60 to 14 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164805
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 17:14:58 +00:00
1927783aa3 Revert "Reland vision pinned commit hash update (#164492)"
This reverts commit 6861a270624b44954826688f8dad668eb0154452.

Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))
2025-10-08 04:38:26 +00:00
f76fdcaaf8 [Benchmark] cleanup huggingface models (#164815)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 03:21:04 +00:00
d444384003 [SymmMem] Tiled reduce (#162243)
Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)`

For now supports only:
- NVSHMEM backed symmetric tensor;
- 2D tensor and tile;
- torch.float.

Testing on right-bottom quandrant:
```
rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243
Approved by: https://github.com/ngimel
2025-10-08 02:03:04 +00:00
6861a27062 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-07 22:45:05 +00:00
afee8062d5 Revert "Fix mesh.get_local_rank when it is > 1d (#164473)"
This reverts commit 83d71dfb2fd993a6242372b8123549acaa85ffdb.

Reverted https://github.com/pytorch/pytorch/pull/164473 on behalf of https://github.com/izaitsevfb due to appears to be causing vision_maskrcnn regression ([comment](https://github.com/pytorch/pytorch/pull/164473#issuecomment-3374738997))
2025-10-07 00:37:41 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
83d71dfb2f Fix mesh.get_local_rank when it is > 1d (#164473)
Previously, we would not take the arguments passed by get_local_rank into account. This means that we wouldn't be able to trace this call if we had a device_mesh > 1d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164473
Approved by: https://github.com/xmfan, https://github.com/Skylion007
2025-10-04 11:27:55 +00:00
412c6d28ec [ROCm][CI] additional dynamo benchmarks for inductor-periodic (#164279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164279
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 00:55:17 +00:00
0319556a35 Revert "[vision hash update] update the pinned vision hash (#154694)"
This reverts commit bcafea5c92ca2ee1b0dc8f6d8b62ecabb6f40228.

Reverted https://github.com/pytorch/pytorch/pull/154694 on behalf of https://github.com/yangw-dev due to break the unittest for inductor with improved, update benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv, see failure example https://github.com/pytorch/pytorch/actions/runs/18185852421/job/51776537817 ([comment](https://github.com/pytorch/pytorch/pull/154694#issuecomment-3362285901))
2025-10-02 17:32:04 +00:00
bcafea5c92 [vision hash update] update the pinned vision hash (#154694)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154694
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-02 07:02:40 +00:00
fa54b08cd5 Replace setup.py install with pip install (#156711)
#156027 already replaced most use of `python setup.py install`.
This PR only adds a few more occurrences and adds `--no-build-isolation` in a few places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156711
Approved by: https://github.com/atalman
2025-09-29 15:15:10 +00:00
54b38f3b46 Add operator benchmarking run to CI nightly (#162530)
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.

**Benchmark Configuration and Coverage:**

* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-29 00:46:38 +00:00
b377c9e365 graph break on tolist if capture_scalar_outputs is false (#163807)
address https://github.com/pytorch/pytorch/issues/163798

its problematic to not graph break because:
1. break current contract.
2. well dynamo trace then we have .item call then if we ever re-trace later in autograd for example we hit a
 failure (We do not know where to graph break at that point)! see the added unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163807
Approved by: https://github.com/bobrenjc93
2025-09-28 04:02:52 +00:00
254d2864d6 Add runtime_overhead PR Time Benchmark (#163866)
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.

Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2025-09-27 03:26:59 +00:00
21a41edd4f Add fake_impl for _native_multi_head_attention (#163700)
Test Plan: See added test in test_export.py

Differential Revision: D83099187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700
Approved by: https://github.com/angelayi
2025-09-25 19:01:27 +00:00
dad54ca7c0 Add mistral/gpt-oss to benchmarks (#163565)
Potential issues
* gpt-oss-20b is probably too big (I can't run on my devserver)
* Mistral requires HF authentication
* Mistral also takes a while to run the performance checks (need to wait for CI)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163565
Approved by: https://github.com/huydhn
2025-09-24 06:12:36 +00:00
bfe9e60ffb Simplify PrecompileContext to no longer be a CacheArtifactManager (#162886)
Summary:
This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache.

This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries.

The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries().

Saving single entries and such with DynamoCache still works normally.

Test Plan:
All existing unit tests pass.

Rollback Plan:

Differential Revision: D82380307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886
Approved by: https://github.com/zhxchen17
2025-09-20 01:24:37 +00:00
33e6c5a93d [Dependabot] Update(deps): Bump transformers from 4.54.0 to 4.56.0 in /.ci/docker/ci_commit_pins (#162063)
* [Dependabot] Update(deps): Bump transformers

Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.54.0...v4.56.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.56.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Refresh results

Signed-off-by: Huy Do <huydhn@gmail.com>

* Another round of updates

Signed-off-by: Huy Do <huydhn@gmail.com>

* Another round of update

Signed-off-by: Huy Do <huydhn@gmail.com>

* Hopefully the last round of update

Signed-off-by: Huy Do <huydhn@gmail.com>

* Plz

Signed-off-by: Huy Do <huydhn@gmail.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-19 02:50:36 -07:00
ddc56f6f92 [functional] Use the saved device on storage instead for device_custom (#162987)
Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162987
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519
2025-09-18 23:43:20 +00:00
62a746f62c [ROCm] update ci_expected_accuracy for dynamo benchmarks (#163256)
Some tests that were already failing changed status to skipped.  Some model entries were missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163256
Approved by: https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 19:05:19 +00:00
c7fa16a05c [ROCm][CI] update _rocm-test.yml based on _linux-test.yml (#163014)
Fixes missing huggingface secrets and aligns _rocm-test.yml with other updates from _linux-test.yml that it was initially based on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163014
Approved by: https://github.com/huydhn
2025-09-16 02:14:38 +00:00
b334a5a379 [ROCm][benchmark] Add HF LLM benchmark expected accuracy (#162965)
PR #156967 added HF LLM benchmarks but did not add the ci expected accuracy files for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162965
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-15 18:04:39 +00:00
972140b7e9 [benchmark] Add HF LLM benchmarks (#156967)
Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-14 07:41:06 +00:00
f7ea4975ab update the baseline data for the operator benchmark (#162693)
According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data.
We use the average results from the four runs as the new baseline for the five models.

And add a pull request trigger for the operator benchmark workflow

Benchmarking   Framework | Benchmarking   Module Name | Case Name | tag | run_backward | baseline   old | r1 | r2 | r3 | r4 | avg | speedup
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
PyTorch | add | add_M1_N1_K1_cpu | short | FALSE | 3.9497 | 2.57 | 2.54 | 2.38 | 2.31 | 2.45 | 1.61
PyTorch | functional.hardtanh | functional.hardtanh_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.118 | 50.02 | 49.80 | 46.78 | 48.94 | 48.88 | 1.37
PyTorch | relu6 | relu6_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 68.739 | 51.17 | 51.19 | 48.07 | 50.42 | 50.21 | 1.37
PyTorch | relu6 | relu6_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 69.1875 | 51.97 | 52.77 | 50.00 | 51.24 | 51.50 | 1.34
PyTorch | functional.hardtanh | functional.hardtanh_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.436 | 50.98 | 51.69 | 49.06 | 49.87 | 50.40 | 1.34

@chuanqi129 @huydhn @desertfire @jainapurva

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693
Approved by: https://github.com/huydhn
2025-09-12 20:53:29 +00:00