pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yiming Zhou	9d882fd9ff	[benchmark] Add torchscript jit.trace to benchmark option (#161223 ) For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223 Approved by: https://github.com/huydhn	2025-08-22 21:38:28 +00:00
Will Feng	8047cde0f3	Try to fix Inductor CI periodic tests (#160932 ) - hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count. - dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work. - hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932 Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet	2025-08-20 20:36:46 +00:00
Nichols A. Romero	0298ebc97a	[ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001 ) GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small. This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`. In the PR comment below, I include a short analysis of the numerics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001 Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily	2025-08-18 15:33:30 +00:00
Sun, Jiayi	95e456fcc5	[inductor] pack linear for FP32 dynamic mode (#157542 ) Summary: Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode. I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 10:18:46 +00:00
Laith Sakka	6f0f4e0c3e	reduce threshold to suggest changes to expected results (#160463 ) Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463 Approved by: https://github.com/jamesjwu	2025-08-14 09:11:27 +00:00
Laith Sakka	dd21c8a578	refresh expected results (#160537 ) regression introduced by https://github.com/pytorch/pytorch/pull/160314 not much worried about it since it did not effect other inductor benchmarks could not repo locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537 Approved by: https://github.com/eellison	2025-08-14 00:56:14 +00:00
Laith Sakka	96bd33b2de	Fix get_free_symbol_uses for several nodes (#160314 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314 Approved by: https://github.com/eellison	2025-08-13 12:28:29 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
Laith Sakka	1bb5e6c076	update expected results (#159867 ) refresh due to https://github.com/pytorch/pytorch/pull/159696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867 Approved by: https://github.com/masnesral	2025-08-07 01:18:36 +00:00
Laith Sakka	978e3a9142	refresh expected results (#159727 ) Just regular update due to recent <10% changes CI is stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727 Approved by: https://github.com/anijain2305	2025-08-03 22:47:50 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Huy Do	465fe4d9f7	Enable sample nightly PT2 benchmark on B200 (#158011 ) Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200 ### Testing https://github.com/pytorch/pytorch/actions/runs/16615101382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-01 23:47:44 +00:00
LifengWang	838924436e	update the baseline for nightly max_autotune tests (#154973 ) Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data: In the latest nightly test, two models require baseline updates: - vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly. - detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model. ``` vision_maskrcnn IMPROVED: graph_breaks=29, expected=30 Improvement: 1 models have fixed dynamo graph breaks: vision_maskrcnn ``` ``` detectron2_fcos_r_50_fpn XFAIL detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22 Error: 1 models have new dynamo graph breaks: detectron2_fcos_r_50_fpn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973 Approved by: https://github.com/desertfire	2025-07-31 11:38:55 +00:00
Arsh Zahed	24d07b3a67	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-30 16:34:15 +00:00
Animesh Jain	8c0c5c58c7	[benchmarks] Set model name early to keep warmup and main model same (#159231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231 Approved by: https://github.com/williamwen42 ghstack dependencies: #159209	2025-07-28 18:18:16 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
James Wu	f55c5d085e	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-24 14:09:54 +00:00
Aditya Tewari	7001d6fbc9	Skip slow tests for aarch64-inductor-benchmarks (#158842 ) This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench. The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml) • The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout. Action: • An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025. To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in: * timm benchmarks * torchbench benchmarks As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but: • An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated. Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842 Approved by: https://github.com/malfet	2025-07-24 00:21:38 +00:00
Guilherme Leobas	b67f97c166	Correctly handle `OP_CONTAINS` (#158660 ) CPython can fallback to `__iter__` if object doesn't implement `__contains__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660 Approved by: https://github.com/zou3519	2025-07-23 22:31:51 +00:00
PyTorch MergeBot	76be282e3a	Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 )" This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8. Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))	2025-07-23 18:25:46 +00:00
James Wu	d898d0d437	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-23 15:06:54 +00:00
Boyuan Feng	a155f742ad	[benchmark] allow default mode for compile (#158792 ) Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes. <img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" /> <img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" /> <img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" /> <img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" /> <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" /> <img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" /> <img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" /> <img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792 Approved by: https://github.com/zou3519	2025-07-22 03:07:22 +00:00
Benjamin Glass	22920c9138	Grab bag of (mostly) typing improvements (#158075 ) Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are: 1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`. 2. Removing two unused parameters from `torch.export._unlift._unlift`. 3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency. 4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075 Approved by: https://github.com/Skylion007	2025-07-21 19:17:01 +00:00
Boyuan Feng	22d82222c6	GenAI Layer Benchmark (#158536 ) This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack. It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd. ## Motivations - Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark. - We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize. ## Key Settings - Avoid reusing a kernel specializing on 1 shape for benchmark on another shape. ```python torch._dynamo.config.automatic_dynamic_shapes = False # Needed since changing args to function causes recompiles torch._dynamo.config.recompile_limit = 1000000 ``` - For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark. ``` torch._dynamo.mark_dynamic(x, 0) ``` GPU: H100 (devvm006.dkl0) Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170) Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels. Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns. <img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" /> <img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" /> <img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" /> <img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" /> <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" /> <img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" /> <img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" /> <img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536 Approved by: https://github.com/yf225, https://github.com/eellison	2025-07-19 05:41:01 +00:00
Laith Sakka	64dabb2cf5	only fail regressions>10% on pr_time benchmarks (#158577 ) Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking. 1. only fail PRs >10% regressions. 2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes) (supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics) 3. setting up some one detections detectors warnings that would be triggered at regressions and notify internally post land https://www.internalfb.com/monitoring/detector/1140915271179237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577 Approved by: https://github.com/xmfan, https://github.com/janeyx99	2025-07-19 04:35:31 +00:00
Jack Taylor	7ebbf2cae7	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 ) (#158550 ) This reverts commit 8554c8007ddaa8029e7e01bb1af12f358bf597c2 #157563 due to causing a few breakages on ROCm Reverted expected_results.csv to 26807dcf277feb2d99ab88d7b6da526488baea93 > @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: `f4d8bc46c7/2` and the failures are still showing up on tip-of-tree on HUD Context https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857 Needs to be relanded in non bc-breaking way, or sanity checked for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-07-17 19:47:41 +00:00
Laith Sakka	306dd19216	update expeced results (#158497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497 Approved by: https://github.com/xmfan	2025-07-17 00:02:52 +00:00
Xuan Zhang	8554c8007d	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-16 01:05:25 +00:00
PyTorch MergeBot	26807dcf27	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit c062550a3598d27c2d6572db7c0f4ff90a84cc84. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main `c062550a35`, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test). Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))	2025-07-15 16:35:55 +00:00
PyTorch MergeBot	4f36743f5e	Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 )" This reverts commit 5a54db14e3843cfa87fd8d27487dbf2f2dfb6c47. Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))	2025-07-15 16:31:13 +00:00
IvanKobzarev	5a54db14e3	[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 ) Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062 Approved by: https://github.com/wconstab	2025-07-15 14:27:57 +00:00
Xuan Zhang	c062550a35	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-14 22:27:21 +00:00
PyTorch MergeBot	e90148c91d	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit 4b9a6f7211123511e856ac8c8524bc332a741241. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))	2025-07-12 04:52:11 +00:00
Xuan Zhang	4b9a6f7211	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-11 21:07:57 +00:00
Xuehai Pan	af3d069094	[BE][Easy] remove unused build-time dependency `astunparse` and change `astunparse.unparse` -> `ast.unparse` (#157907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157907 Approved by: https://github.com/Skylion007	2025-07-10 07:04:42 +00:00
Ryan Guo	f742b32a2f	[dynamo] Avoid recompiling over unused objects (#156891 ) Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in `STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either (1). exceed the recompialtion limit of 8, which results in suboptimal performance, and (2). even if recompilation limit is increased, the compilation time gets unnecessarily high (180s v.s. 20s for Flux). This patch fixes the recompilation issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2025-07-09 20:14:34 +00:00
Weishi.Deng	44d0800d60	[Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920 ) We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device. This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920 Approved by: https://github.com/soulitzer	2025-07-08 17:49:54 +00:00
Jason Ansel	41f6acef83	Update pr_time_benchmarks expected results (#157214 ) The job has been unstable Pull Request resolved: https://github.com/pytorch/pytorch/pull/157214 Approved by: https://github.com/laithsakka	2025-06-29 19:12:13 +00:00
Laith Sakka	e6ed4074e8	update expected results (#157010 ) <img width="1490" alt="Screenshot 2025-06-26 at 12 30 46 PM" src="https://github.com/user-attachments/assets/4df626d4-3010-4362-974c-fb96fa68b29f" /> <img width="904" alt="Screenshot 2025-06-26 at 12 28 29 PM" src="https://github.com/user-attachments/assets/42626892-27e1-4e69-9efc-c9baf80c5384" /> <img width="752" alt="Screenshot 2025-06-26 at 12 29 05 PM" src="https://github.com/user-attachments/assets/0b1afb30-5868-4ba6-9985-2cc7994a4227" /> PR https://github.com/pytorch/pytorch/pull/152011 added slight regression <br class="Apple-interchange-newline"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157010 Approved by: https://github.com/zou3519	2025-06-26 21:56:57 +00:00
Laith Sakka	85df746892	refresh expected numbers (#156877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156877 Approved by: https://github.com/huydhn	2025-06-26 00:03:09 +00:00
IvanKobzarev	313a6a8ef9	[pt2][pr_time_benchmarks] Refresh instructions count after disabled test (#156738 ) https://github.com/pytorch/pytorch/issues/153987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156738 Approved by: https://github.com/laithsakka	2025-06-24 23:45:02 +00:00
Tom Ritchford	e2c9d8d641	Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 ) Fix https://github.com/pytorch/pytorch/issues/145838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845 Approved by: https://github.com/Skylion007	2025-06-24 15:41:34 +00:00
PyTorch MergeBot	e600e044a7	Revert "[aotd] Support mutations of the same input in fw and bw (#155354 )" This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f. Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see `930b575389/1` ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))	2025-06-24 04:42:14 +00:00
IvanKobzarev	3f920f3d8f	[aotd] Support mutations of the same input in fw and bw (#155354 ) Original issue: https://github.com/pytorch/pytorch/issues/154820 The issue happens when there is a mutation for the same input in forward AND in backward. AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward). After that partitioner can put it either in forward or in backward. The fix: 1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation. 2/ Exposing mutation_counter to python We want to keep invariant that copy_ exist only in the end of joint graph. 3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward. Emit post_forward mutations after joint graph fully traced. add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward. 4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward. For this set MUST_SAVE for the source of mutation in forward. proxy_tensor changes: By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained. But we want that this copy_ will be independent and applied just to primals. For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354 Approved by: https://github.com/bdhirsh	2025-06-23 22:25:45 +00:00
Boyuan Feng	a95504b10f	[torchbench] update environment setup script (#156465 ) Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model. This PR adds a config to only install torchbench models we want to run. Example usage: ``` # Install 1 torchbench model make build-deps TORCHBENCH_MODELS="alexnet" # Install 3 torchbench models make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch" # Install all models make build-deps # Install all models make build-deps TORCHBENCH_MODELS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465 Approved by: https://github.com/ezyang	2025-06-23 17:41:29 +00:00
Edward Yang	333e0e6147	Make build-deps drop builds into current venv again (#156200 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200 Approved by: https://github.com/malfet	2025-06-22 00:45:02 +00:00
Xuan Zhang	c2d1b225e6	[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 ) Problem & Solution: Assume we have something like: ``` x = some_op(...) x0 = x[0] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() x1 = x[1] ``` In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as ``` x = some_op(...) x0 = x[0] x1 = x[1] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() ``` Results: For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are * baseline: 7.73GiB * with the chage: 6.45GiB As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed. cc and credit to @ShatianWang for noticing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809 Approved by: https://github.com/fmassa, https://github.com/bdhirsh	2025-06-21 19:57:21 +00:00
PyTorch MergeBot	754c04aa06	Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 )" This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b. Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))	2025-06-20 20:18:24 +00:00
PyTorch MergeBot	96d082d06b	Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385 )" This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66. Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))	2025-06-20 18:17:18 +00:00

1 2 3 4 5 ...

1211 Commits