pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Yiming Zhou	9d882fd9ff	[benchmark] Add torchscript jit.trace to benchmark option (#161223 ) For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223 Approved by: https://github.com/huydhn	2025-08-22 21:38:28 +00:00
Sun, Jiayi	95e456fcc5	[inductor] pack linear for FP32 dynamic mode (#157542 ) Summary: Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode. I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 10:18:46 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Animesh Jain	8c0c5c58c7	[benchmarks] Set model name early to keep warmup and main model same (#159231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231 Approved by: https://github.com/williamwen42 ghstack dependencies: #159209	2025-07-28 18:18:16 +00:00
James Wu	f55c5d085e	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-24 14:09:54 +00:00
Aditya Tewari	7001d6fbc9	Skip slow tests for aarch64-inductor-benchmarks (#158842 ) This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench. The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml) • The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout. Action: • An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025. To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in: * timm benchmarks * torchbench benchmarks As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but: • An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated. Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842 Approved by: https://github.com/malfet	2025-07-24 00:21:38 +00:00
PyTorch MergeBot	76be282e3a	Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 )" This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8. Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))	2025-07-23 18:25:46 +00:00
James Wu	d898d0d437	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-23 15:06:54 +00:00
Benjamin Glass	22920c9138	Grab bag of (mostly) typing improvements (#158075 ) Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are: 1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`. 2. Removing two unused parameters from `torch.export._unlift._unlift`. 3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency. 4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075 Approved by: https://github.com/Skylion007	2025-07-21 19:17:01 +00:00
PyTorch MergeBot	96d082d06b	Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385 )" This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66. Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))	2025-06-20 18:17:18 +00:00
Nikita Shulga	242eb19c83	[InductorBench] Fix accuracy validation logic for MPS (#156385 ) As it does not support full fp64, validate against float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385 Approved by: https://github.com/Skylion007	2025-06-19 05:37:51 +00:00
Xuehai Pan	42015db6a9	[BE] fix typos in benchmarks/ (#156077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #156069	2025-06-17 13:12:18 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
bobrenjc93	28f27886eb	Vary batch size when running dynamic shapes benchmarks (#154805 ) This better measures the actual runtime performance of dynamic shapes where we aren't guaranteed to have similar shapes as the original hint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805 Approved by: https://github.com/Skylion007 ghstack dependencies: #154802, #154826, #154822, #154823	2025-06-02 18:56:18 +00:00
bobrenjc93	b90fc2ec27	[ez] delete code that died a long time ago (#154802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802 Approved by: https://github.com/Skylion007	2025-06-01 14:57:03 +00:00
Boyuan Feng	514409d032	update torchvision pin (#154255 ) Fixes #153985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255 Approved by: https://github.com/desertfire	2025-05-27 16:15:25 +00:00
Eddie Yan	76ed9db468	[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556 ) Also enables unified workspaces by default for non-FBCODE use cases. Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0). Recommended defaults are documented here: https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-24 03:43:35 +00:00
Huy Do	7509b150af	Don't upload compiler benchmark debug info to the benchmark database (#153769 ) During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect. After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm. In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE)) ``` Benchmark,Metric,Count "TorchInductor","user_stack","1926014" "TorchInductor","reason","1926014" "TorchInductor","model","1926014" ``` Let's skip uploading them to avoid bloating the database. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769 Approved by: https://github.com/malfet	2025-05-23 01:18:26 +00:00
PyTorch MergeBot	3443627e07	Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473 )" This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948. Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))	2025-05-16 08:29:26 +00:00
Nikita Shulga	754b758ea1	[BE] Extend empty_gpu_cache to mps (#153657 ) And replace `if: elif:` with `getattr()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657 Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi	2025-05-16 01:08:54 +00:00
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
Benjamin Glass	e8596c291b	Fix misleadingly high AOT Inductor dashboard performance (#153060 ) Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good. An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x. Only two benchmarks we regularly run are affected by this, both in the TIMM set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060 Approved by: https://github.com/desertfire	2025-05-13 20:59:59 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
Gabriel Ferns	ce00ec7ecf	Enable max autotune for AOTInductor benchmark (#149309 ) With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309 Approved by: https://github.com/desertfire Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-04-28 06:54:26 +00:00
Eddie Yan	dcc32ff5bf	[CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163 ) opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue `TORCH_CUBLASLT_UNIFIED_WORKSPACE=1` @izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else? Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163 Approved by: https://github.com/ngimel	2025-04-23 15:24:22 +00:00
Zhang, Jianyi	a756c50315	[Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996 ) sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device. I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996 Approved by: https://github.com/drisspg, https://github.com/EikanWang	2025-04-15 08:08:01 +00:00
Bin Bao	6a8ab902a2	[AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695 ) Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695 Approved by: https://github.com/yushangdi	2025-04-06 19:51:22 +00:00
Bin Bao	d4c30b4599	[AOTI][dashboard] Update how peak memory is measured (#150534 ) Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534 Approved by: https://github.com/yushangdi	2025-04-03 12:01:43 +00:00
PyTorch MergeBot	203a27e0ce	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 8f7fbe3d7d2cd301df48fcbe8a14f8aa1a9c1e48. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))	2025-04-01 23:07:28 +00:00
Xuehai Pan	a10b765bf1	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-04-01 10:40:43 +00:00
Zhang, Jianyi	0f12951fc2	[Intel gpu] always set deterministic for xpu accuracy test (#149028 ) On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028 Approved by: https://github.com/guangyey, https://github.com/desertfire Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-01 01:00:11 +00:00
Simon Fan	86ee3bf3d5	[ca] use torch.compile ca API for benchmarks (#149647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647 Approved by: https://github.com/jansel	2025-03-24 19:06:45 +00:00
eqy	8f7fbe3d7d	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-03-22 05:50:11 +00:00
Pian Pawakapan	e0e8639a10	[torchbench] fix dynamic_shapes spec for moco (#148772 ) Fixes https://github.com/pytorch/pytorch/issues/148333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-03-18 18:16:54 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
Rengan Xu	da4bb72a71	Backout D70075331 (#148824 ) Summary: The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0" So we revert D70075331 as a workaround now. Test Plan: The model could be lowered and published successfully. e.g. 702869739_16 Differential Revision: D70823254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824 Approved by: https://github.com/eqy	2025-03-11 12:51:17 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
Xuehai Pan	f08146b67b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-06 18:59:02 +00:00
Bin Bao	d10bacd4ce	[AOTI][dashboard] Skip torchbench models not supported by export (#148359 ) Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359 Approved by: https://github.com/angelayi, https://github.com/ysiraichi	2025-03-06 18:08:17 +00:00
Boyuan Feng	6e10471966	[ci] disable cudagraph for tts_angular on dashboard (#148221 ) tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular. [Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221 Approved by: https://github.com/eellison	2025-03-02 03:31:19 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
Katarzyna Fojcik	edaf9ddeb5	Add basic Gaudi support to benchmarks/dynamo (#145920 ) This PR adds basic Gaudi support to benchmarks/dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920 Approved by: https://github.com/eellison	2025-02-26 14:50:22 +00:00
eqy	718cf68aee	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-02-23 22:01:39 +00:00
Aaron Orenstein	086d146f6f	Update ruff linter for PEP585 (#147540 ) This turns on PEP585 enforcement in RUFF. - Updates the target python version - Stops ignoring UP006 warnings (PEP585) - Fixes a few issues which crept into the tree in the last day Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-22 04:45:17 +00:00
Animesh Jain	71484a2106	[pt2-benchmarks] Compiler reset on every run (#147313 ) Internal benchmarks call `run` in a loop. Compiler reset gives a clean env Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313 Approved by: https://github.com/jansel	2025-02-18 02:09:19 +00:00
PyTorch MergeBot	80a1696679	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))	2025-02-07 21:04:23 +00:00

1 2 3 4 5 ...

521 Commits