pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
angelayi	dad54ca7c0	Add mistral/gpt-oss to benchmarks (#163565 ) Potential issues * gpt-oss-20b is probably too big (I can't run on my devserver) * Mistral requires HF authentication * Mistral also takes a while to run the performance checks (need to wait for CI) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163565 Approved by: https://github.com/huydhn	2025-09-24 06:12:36 +00:00
James Wu	bfe9e60ffb	Simplify PrecompileContext to no longer be a CacheArtifactManager (#162886 ) Summary: This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache. This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries. The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries(). Saving single entries and such with DynamoCache still works normally. Test Plan: All existing unit tests pass. Rollback Plan: Differential Revision: D82380307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886 Approved by: https://github.com/zhxchen17	2025-09-20 01:24:37 +00:00
dependabot[bot]	33e6c5a93d	[Dependabot] Update(deps): Bump transformers from 4.54.0 to 4.56.0 in /.ci/docker/ci_commit_pins (#162063 ) * [Dependabot] Update(deps): Bump transformers Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.54.0...v4.56.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Refresh results Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of updates Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Hopefully the last round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Plz Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-19 02:50:36 -07:00
Animesh Jain	ddc56f6f92	[functional] Use the saved device on storage instead for device_custom (#162987 ) Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162987 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519	2025-09-18 23:43:20 +00:00
Jeff Daily	62a746f62c	[ROCm] update ci_expected_accuracy for dynamo benchmarks (#163256 ) Some tests that were already failing changed status to skipped. Some model entries were missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163256 Approved by: https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-18 19:05:19 +00:00
Jeff Daily	c7fa16a05c	[ROCm][CI] update _rocm-test.yml based on _linux-test.yml (#163014 ) Fixes missing huggingface secrets and aligns _rocm-test.yml with other updates from _linux-test.yml that it was initially based on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163014 Approved by: https://github.com/huydhn	2025-09-16 02:14:38 +00:00
Jeff Daily	b334a5a379	[ROCm][benchmark] Add HF LLM benchmark expected accuracy (#162965 ) PR #156967 added HF LLM benchmarks but did not add the ci expected accuracy files for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162965 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-15 18:04:39 +00:00
angelayi	972140b7e9	[benchmark] Add HF LLM benchmarks (#156967 ) Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-14 07:41:06 +00:00
LifengWang	f7ea4975ab	update the baseline data for the operator benchmark (#162693 ) According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data. We use the average results from the four runs as the new baseline for the five models. And add a pull request trigger for the operator benchmark workflow Benchmarking Framework \| Benchmarking Module Name \| Case Name \| tag \| run_backward \| baseline old \| r1 \| r2 \| r3 \| r4 \| avg \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- PyTorch \| add \| add_M1_N1_K1_cpu \| short \| FALSE \| 3.9497 \| 2.57 \| 2.54 \| 2.38 \| 2.31 \| 2.45 \| 1.61 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.118 \| 50.02 \| 49.80 \| 46.78 \| 48.94 \| 48.88 \| 1.37 PyTorch \| relu6 \| relu6_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 68.739 \| 51.17 \| 51.19 \| 48.07 \| 50.42 \| 50.21 \| 1.37 PyTorch \| relu6 \| relu6_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 69.1875 \| 51.97 \| 52.77 \| 50.00 \| 51.24 \| 51.50 \| 1.34 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.436 \| 50.98 \| 51.69 \| 49.06 \| 49.87 \| 50.40 \| 1.34 @chuanqi129 @huydhn @desertfire @jainapurva Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693 Approved by: https://github.com/huydhn	2025-09-12 20:53:29 +00:00
David Berard	cad052423b	[triton] Update 3.5 pin to 5ae38bdb0dc066c5823e34dc9797afb9de42c866 (#162821 ) Include @aakhundov's sam_fast patch, plus NVIDIA's sm88/sm110 patches (thanks @nWEIdia) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162821 Approved by: https://github.com/atalman	2025-09-12 18:34:22 +00:00
jainapurva	5f66902ecf	Fix operator benchmark issue#162708 (#162744 ) This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug Fixes https://github.com/pytorch/pytorch/issues/162708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744 Approved by: https://github.com/huydhn	2025-09-12 06:51:14 +00:00
atalman	e8eeb06034	Move inductor jobs 3.9->3.10 (#162323 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007 Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-12 03:43:06 +00:00
PyTorch MergeBot	23170dfebc	Revert "Move inductor jobs 3.9->3.10 (#162323 )" This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc. Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))	2025-09-11 05:57:13 +00:00
atalman	0663bdb123	Move inductor jobs 3.9->3.10 (#162323 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-09-10 20:58:41 +00:00
PyTorch MergeBot	e1f0a69943	Revert "test fixing benchmarks (#162503 )" This reverts commit 484c4093a87a3e6767e55ed553f95db8fc137442. Reverted https://github.com/pytorch/pytorch/pull/162503 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it regresses CPU perf smoke test ([comment](https://github.com/pytorch/pytorch/pull/162503#issuecomment-3273554680))	2025-09-10 06:55:35 +00:00
angelayi	484c4093a8	test fixing benchmarks (#162503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162503 Approved by: https://github.com/huydhn ghstack dependencies: #160741	2025-09-10 03:15:49 +00:00
jainapurva	af60398c3a	Update the operator benchmarking, to benchmark using torch.compile (#161394 ) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394 Approved by: https://github.com/jbschlosser	2025-09-09 18:17:37 +00:00
David Berard	3f5993316e	[upstream triton] update triton pin to triton 3.5 (#162278 ) Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does not include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278 Approved by: https://github.com/atalman ghstack dependencies: #162244, #162309	2025-09-08 14:29:24 +00:00
Animesh Jain	e9481b6617	[dynamo] Prevent unnecessary recompile on disabled functions in the compiled frame (#161883 ) Trying out a re-impl of https://github.com/pytorch/pytorch/pull/160934 The above PR led to OOM, most likely because of the cache holding to a nested function (which if not held in the cache would have been garbage collected), which holds on to cuda tensors in its closure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161883 Approved by: https://github.com/jansel	2025-09-02 01:13:48 +00:00
PyTorch MergeBot	9b67d8e344	Revert "[RELAND] Close some sources of fake tensor leakage (#161589 )" This reverts commit 5790b009751e6ebba35d3e6d05e7c1b135553eee. Reverted https://github.com/pytorch/pytorch/pull/161589 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/17305150611/job/49128381649) [HUD commit link](`5790b00975`) ([comment](https://github.com/pytorch/pytorch/pull/161589#issuecomment-3235224249))	2025-08-28 23:19:36 +00:00
Tugsbayasgalan Manlaibaatar	5790b00975	[RELAND] Close some sources of fake tensor leakage (#161589 ) Reland of https://github.com/pytorch/pytorch/pull/159923 Couple of fixes: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability. 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict 3. We modify yolov3 to fix the previous silent incorrect behaviour 4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list. Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589 Approved by: https://github.com/avikchaudhuri	2025-08-28 09:46:42 +00:00
rzou	199c3633bf	Fix Inductor Periodic (#161617 ) Models are now passing accuracy. # of graph breaks is larger because these were not actually tested in CI (if the model fails accuracy we do not assert on # of graph breaks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617 Approved by: https://github.com/anijain2305	2025-08-28 02:36:08 +00:00
Animesh Jain	07a4e9fea8	[benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570 ) To keep the CI green - https://github.com/pytorch/pytorch/issues/161419 Its unclear if this is a real failure. And debugging it is non trivial. Skipping for now to keep the CI greenst Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519	2025-08-27 03:44:04 +00:00
rzou	4e19c1906a	Get Inductor periodic CI green (#161297 ) I'll file hi-pri issues for the things that need looking into. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297 Approved by: https://github.com/angelayi	2025-08-26 00:49:49 +00:00
Yiming Zhou	9d882fd9ff	[benchmark] Add torchscript jit.trace to benchmark option (#161223 ) For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223 Approved by: https://github.com/huydhn	2025-08-22 21:38:28 +00:00
Will Feng	8047cde0f3	Try to fix Inductor CI periodic tests (#160932 ) - hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count. - dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work. - hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932 Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet	2025-08-20 20:36:46 +00:00
Nichols A. Romero	0298ebc97a	[ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001 ) GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small. This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`. In the PR comment below, I include a short analysis of the numerics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001 Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily	2025-08-18 15:33:30 +00:00
Sun, Jiayi	95e456fcc5	[inductor] pack linear for FP32 dynamic mode (#157542 ) Summary: Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode. I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 10:18:46 +00:00
Laith Sakka	6f0f4e0c3e	reduce threshold to suggest changes to expected results (#160463 ) Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463 Approved by: https://github.com/jamesjwu	2025-08-14 09:11:27 +00:00
Laith Sakka	dd21c8a578	refresh expected results (#160537 ) regression introduced by https://github.com/pytorch/pytorch/pull/160314 not much worried about it since it did not effect other inductor benchmarks could not repo locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537 Approved by: https://github.com/eellison	2025-08-14 00:56:14 +00:00
Laith Sakka	96bd33b2de	Fix get_free_symbol_uses for several nodes (#160314 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314 Approved by: https://github.com/eellison	2025-08-13 12:28:29 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
Laith Sakka	1bb5e6c076	update expected results (#159867 ) refresh due to https://github.com/pytorch/pytorch/pull/159696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867 Approved by: https://github.com/masnesral	2025-08-07 01:18:36 +00:00
Divyansh Khanna	6fa3592dc6	Dataloader benchmark script (#159432 ) This script adds a simple dataloading benchmark tracking throughput and memory. The output looks like this ``` System Information: PyTorch version: 2.9.0a0+gitf87d117 PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py Torchvision version: 0.24.0a0+f52c4f1 Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py CUDA available: True CUDA device: NVIDIA PG509-210 CPU count: 192 Physical CPU cores: 96 Total system memory: 1510.11 GB Loading dataset from imagenet/val (1 copies) Dataset size: 50000 --- Benchmarking DataLoader with worker_method=multiprocessing --- Memory before DataLoader creation: 500.59 MB Detailed memory information: USS (Unique Set Size): 499.00 MB PSS (Proportional Set Size): 500.74 MB RSS (Resident Set Size): 497.39 MB Memory after DataLoader creation: 1127.61 MB Memory increase: 627.02 MB Starting training loop with 1 epochs (max 100 batches per epoch) Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB Results: Worker method: multiprocessing DataLoader init time: 0.1553 seconds Average batch time: 0.3408 seconds Samples per second: 375.53 Peak memory usage: 12738.76 MB Memory increase: 12238.17 MB ``` > TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432 Approved by: https://github.com/ramanishsingh	2025-08-06 19:05:19 +00:00
Laith Sakka	978e3a9142	refresh expected results (#159727 ) Just regular update due to recent <10% changes CI is stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727 Approved by: https://github.com/anijain2305	2025-08-03 22:47:50 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Huy Do	465fe4d9f7	Enable sample nightly PT2 benchmark on B200 (#158011 ) Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200 ### Testing https://github.com/pytorch/pytorch/actions/runs/16615101382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-01 23:47:44 +00:00
LifengWang	838924436e	update the baseline for nightly max_autotune tests (#154973 ) Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data: In the latest nightly test, two models require baseline updates: - vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly. - detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model. ``` vision_maskrcnn IMPROVED: graph_breaks=29, expected=30 Improvement: 1 models have fixed dynamo graph breaks: vision_maskrcnn ``` ``` detectron2_fcos_r_50_fpn XFAIL detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22 Error: 1 models have new dynamo graph breaks: detectron2_fcos_r_50_fpn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973 Approved by: https://github.com/desertfire	2025-07-31 11:38:55 +00:00
Arsh Zahed	24d07b3a67	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-30 16:34:15 +00:00
Animesh Jain	8c0c5c58c7	[benchmarks] Set model name early to keep warmup and main model same (#159231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231 Approved by: https://github.com/williamwen42 ghstack dependencies: #159209	2025-07-28 18:18:16 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
James Wu	f55c5d085e	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-24 14:09:54 +00:00
Aditya Tewari	7001d6fbc9	Skip slow tests for aarch64-inductor-benchmarks (#158842 ) This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench. The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml) • The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout. Action: • An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025. To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in: * timm benchmarks * torchbench benchmarks As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but: • An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated. Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842 Approved by: https://github.com/malfet	2025-07-24 00:21:38 +00:00
Guilherme Leobas	b67f97c166	Correctly handle `OP_CONTAINS` (#158660 ) CPython can fallback to `__iter__` if object doesn't implement `__contains__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660 Approved by: https://github.com/zou3519	2025-07-23 22:31:51 +00:00
PyTorch MergeBot	76be282e3a	Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 )" This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8. Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))	2025-07-23 18:25:46 +00:00
James Wu	d898d0d437	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-23 15:06:54 +00:00
Boyuan Feng	a155f742ad	[benchmark] allow default mode for compile (#158792 ) Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes. <img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" /> <img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" /> <img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" /> <img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" /> <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" /> <img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" /> <img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" /> <img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792 Approved by: https://github.com/zou3519	2025-07-22 03:07:22 +00:00
Benjamin Glass	22920c9138	Grab bag of (mostly) typing improvements (#158075 ) Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are: 1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`. 2. Removing two unused parameters from `torch.export._unlift._unlift`. 3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency. 4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075 Approved by: https://github.com/Skylion007	2025-07-21 19:17:01 +00:00

1 2 3 4 5 ...

2152 Commits