39df901b2a
introduce definitely_contiguous and use it for reshape and tensor meta data computation. ( #153432 )
...
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.
This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
514409d032
update torchvision pin ( #154255 )
...
Fixes #153985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
3f64502c98
Revert "Re-enable FakeTensor caching for SymInts ( #152662 )"
...
This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e.
Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1
([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593 ))
2025-05-26 17:13:22 +00:00
7d11c61c26
Re-enable FakeTensor caching for SymInts ( #152662 )
...
Summary:
This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.
There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.
Test Plan: Reran the tests listed in T196779132 and they pass.
## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
76ed9db468
[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt ( #153556 )
...
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).
Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007 , https://github.com/ngimel
2025-05-24 03:43:35 +00:00
9e089bb5b6
change guard_or impl for better perf and simplicity ( #153674 )
...
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
7509b150af
Don't upload compiler benchmark debug info to the benchmark database ( #153769 )
...
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect. After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm. In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE ))
```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```
Let's skip uploading them to avoid bloating the database.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
768cb734ec
cpp_wrapper: build non-performance-sensitive code at O1 ( #148773 )
...
Builds on #148212 , applying the same improvements to `cpp_wrapper` mode.
Benchmark results:
* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13 )
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
6cd9d66b7f
Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 ( #154109 )
...
After https://github.com/pytorch/pytorch/pull/154004 , one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8. I can reproduce it locally with:
```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet
E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```
I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.
Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007 , https://github.com/malfet
2025-05-22 14:25:12 +00:00
254293b777
Add flag _metrics_log_runtime
to disable runtime metric logging by default ( #153506 )
...
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).
Update: added support for TORCH_LOGS for the metrics logging.
Test Plan:
mm_loop.py and many existing tests cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
996c4d803d
Removing conda references from PyTorch Docs ( #152702 )
...
Addresses #148339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702
Approved by: https://github.com/svekars , https://github.com/albanD , https://github.com/atalman
2025-05-20 20:33:28 +00:00
3443627e07
Revert "[BE]: Enable RUFF TRY400 rule - log.exception ( #153473 )"
...
This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948.
Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075 ))
2025-05-16 08:29:26 +00:00
4d073af58c
Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion ( #152353 )"
...
This reverts commit 725bbb6b5fffa2f2d219a0692ed27e376c9dd48a.
Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862 ))
2025-05-16 08:20:39 +00:00
754b758ea1
[BE] Extend empty_gpu_cache to mps ( #153657 )
...
And replace `if: elif:` with `getattr()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman , https://github.com/wdvr , https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
4f4ecc583e
[BE]: Enable RUFF TRY400 rule - log.exception ( #153473 )
...
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD , https://github.com/cyyever
2025-05-15 13:36:59 +00:00
d2f6c6df1d
unbreak fb:operator_benchmark_test ( #152049 )
...
Summary: unbreak fb:operator_benchmark_test
Test Plan: works on my machine
Differential Revision: D73540912
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152049
Approved by: https://github.com/hl475
2025-05-15 03:38:48 +00:00
725bbb6b5f
[inductor][dynamo] Include operator name in size/stride/alignment assertion ( #152353 )
...
Fixes #151930
This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp ) to accept an optional `op_name` argument and includes it in the error messages.
The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi ) are updated to match the new function arg.
In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py ) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.
Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py ).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
03d01860fd
[dynamo][compile-time] Compute logging related flags once ( #153426 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
8f3d7972ad
[dynamo][compile-time] Cache the function signature to speedup inlining ( #153396 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel , https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
864a5f4434
[dynamo][compile-time] Cache the cleaned insturctions while inlining ( #153333 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi , https://github.com/jansel , https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
11c64b7cf8
[dynamo][compile-time] Cache whether a function is inlineable ( #153192 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi , https://github.com/jansel , https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
e8596c291b
Fix misleadingly high AOT Inductor dashboard performance ( #153060 )
...
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.
An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base ) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.
Only two benchmarks we regularly run are affected by this, both in the TIMM set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
ff039d39ec
[Dynamo] Optimize dedupe region ancestor tracking ( #152589 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389 , #152505 , #152410 , #152506 , #152570 , #152572
2025-05-13 12:17:59 +00:00
c4fb0b6f33
refresh expected results ( #150166 )
...
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93 , https://github.com/eellison , https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
3555ebb63d
[BE]: Update ruff to 0.11.8 ( #153249 )
...
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever , https://github.com/albanD , https://github.com/seemethere
2025-05-12 18:30:52 +00:00
aa7fe6af41
Revert "[Dynamo] Optimize dedupe region ancestor tracking ( #152589 )"
...
This reverts commit b5f1345f72ec6d1b004b05284e9553e65ee03abc.
Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503 ) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679 ))
2025-05-12 07:15:09 +00:00
b5f1345f72
[Dynamo] Optimize dedupe region ancestor tracking ( #152589 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389 , #152505 , #152410 , #152506 , #152570 , #152572
2025-05-10 08:27:56 +00:00
d808a3e203
[dynamic shapes] guard_or_false for computeStorageNbytes ( #150483 )
...
removes fast path for computing storage, fixes some adjacent tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
8ea95d2e73
[inductor] dtype promotion error in cat decomp ( #152995 )
...
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi , https://github.com/eellison
2025-05-09 16:58:58 +00:00
ab829ec629
[dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining ( #153159 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883 , #153105
2025-05-09 00:09:19 +00:00
4166373908
[dynamic shapes] guard_or_false for infer_size ( #152146 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
ecd74c953f
[dynamo] Recursively realize the stack_values ( #152853 )
...
Might also fix - https://github.com/pytorch/pytorch/issues/135696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela , https://github.com/mlazos , https://github.com/jansel
2025-05-07 02:36:44 +00:00
07a29dbe81
[BE]: Update cutlass submodule to 3.9.2 ( #152779 )
...
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
fcd5e49138
Revert "[dynamo] Recursively realize the stack_values ( #152853 )"
...
This reverts commit 460888f908ea4b634ecc863a6da6b2132108bc79.
Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485 ))
2025-05-06 15:02:57 +00:00
460888f908
[dynamo] Recursively realize the stack_values ( #152853 )
...
Might also fix - https://github.com/pytorch/pytorch/issues/135696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela , https://github.com/mlazos , https://github.com/jansel
2025-05-06 06:30:31 +00:00
45efa1aaa8
[3/N] Use internal linkage in C++ files ( #151297 )
...
Follows #151070 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
2b37a726e0
Refactor layout constraint selection logic ( #148104 )
...
This PR:
- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides
Test Plan:
- tests + CI
disable padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314 , https://github.com/eellison
2025-05-03 00:02:24 +00:00
64957db6c9
Fix some inductor periodic benchmarks ( #152605 )
...
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.
I filed a separate issue for the now-regressions for AOTI:
https://github.com/pytorch/pytorch/issues/152606 . These should be looked
at.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152605
Approved by: https://github.com/eellison , https://github.com/huydhn
2025-05-01 22:18:30 +00:00
15a3f58f91
Return ConstantVariable(None) from WithExitFunctionVariable.exit to prevent NoneType crash inside autocast exception path ( #152503 )
...
Copy of #152013 with PR time benchmarks updated (regressions seem unrelated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152503
Approved by: https://github.com/anijain2305 , https://github.com/Skylion007
Co-authored-by: Witold Dziurdz <wdziurdz@habana.ai >
2025-05-01 04:01:24 +00:00
3f10091d3c
Clean up conda usage in benchmark scripts ( #152552 )
...
Fixes https://github.com/pytorch/pytorch/issues/152123 .
* Switch `benchmarks/dynamo/Makefile` to use uv. Note that these scripts are only used locally, so it's kind of ok to keep conda here IMO. But switching to uv is probably nicer to most folks.
* Delete some files that are outdated and not used anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152552
Approved by: https://github.com/atalman , https://github.com/albanD
2025-04-30 21:27:29 +00:00
00ebbbb701
[cutlass backend] add addmm and bmm for cutlass backend benchmark ( #152163 )
...
Copying what @kadeng did.
```
FINAL results...
Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 44.454172253608704 | 3.0991086587309837 | NA |
| triton | 44.06978189945221 | 0.07496077567338943 | -0.8646890374284049 |
| triton_persistent_tma | 43.598245829343796 | 0.06154991965740919 | -1.9254130284597197 |
| cutlass_lvl_default | 39.91834074258804 | 0.056073310784995556 | -10.20338762612423 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+-------------------+----------------------+---------------------+
| aten | 49.05610531568527 | 0.160279156640172 | NA |
| triton | 43.97720843553543 | 0.0660805031657219 | -10.353241145961718 |
| triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 |
| cutlass_lvl_default | 40.2066633105278 | 0.034127906896173954 | -18.039430460713596 |
+-----------------------+-------------------+----------------------+---------------------+
Average edge over aten (max(-edge, 0), higher is better):
triton: 5.608965091695062 (from 2 valid values)
triton_persistent_tma: 6.175686863092341 (from 2 valid values)
cutlass_lvl_default: 14.121409043418913 (from 2 valid values)
```
Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163
Approved by: https://github.com/jingsh
2025-04-28 20:16:17 +00:00
ce00ec7ecf
Enable max autotune for AOTInductor benchmark ( #149309 )
...
With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309
Approved by: https://github.com/desertfire
Co-authored-by: Gabriel Ferns <gabeferns@meta.com >
2025-04-28 06:54:26 +00:00
e2f9759bd0
Fix broken URLs ( #152237 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn , https://github.com/malfet
2025-04-27 09:56:42 +00:00
3c1a17a08b
[Dynamo] Use LazyVariableTracker in base VT ( #151847 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847
Approved by: https://github.com/StrongerXi
2025-04-23 18:18:01 +00:00
dcc32ff5bf
[CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces ( #151163 )
...
opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue
`TORCH_CUBLASLT_UNIFIED_WORKSPACE=1`
@izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163
Approved by: https://github.com/ngimel
2025-04-23 15:24:22 +00:00
09e8ff92cc
refresh benchmark results ( #151622 )
...
updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469
not all benchmarks were updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622
Approved by: https://github.com/oulgen
2025-04-18 02:39:13 +00:00
ef64beb232
Include post grad gm and fx runnable in cache artifacts for tlparse ( #151469 )
...
Fixed #151462
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469
Approved by: https://github.com/bdhirsh
2025-04-17 17:14:13 +00:00
41b82611ee
Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device ( #144756 )"
...
This reverts commit 300e0ee13c08ef77e88f32204a2e0925c17ce216.
Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970 ))
2025-04-17 11:09:01 +00:00
300e0ee13c
[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device ( #144756 )
...
Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192
We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756
Approved by: https://github.com/chuanqi129 , https://github.com/EikanWang , https://github.com/desertfire
2025-04-17 00:26:55 +00:00
eea4a7b424
update expected results for comptime benchmark ( #151319 )
...
This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark.
Modeled this PR after https://github.com/pytorch/pytorch/pull/144274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319
Approved by: https://github.com/jamesjwu , https://github.com/laithsakka
2025-04-15 19:40:13 +00:00