1211 Commits

Author SHA1 Message Date
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb979da16b0a0b758020b330a49c72b7e7.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
4d073af58c Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 725bbb6b5fffa2f2d219a0692ed27e376c9dd48a.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))
2025-05-16 08:20:39 +00:00
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
03d01860fd [dynamo][compile-time] Compute logging related flags once (#153426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
c4fb0b6f33 refresh expected results (#150166)
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72ec6d1b004b05284e9553e65ee03abc.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
ab829ec629 [dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883, #153105
2025-05-09 00:09:19 +00:00
4166373908 [dynamic shapes] guard_or_false for infer_size (#152146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
ecd74c953f [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-07 02:36:44 +00:00
07a29dbe81 [BE]: Update cutlass submodule to 3.9.2 (#152779)
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
fcd5e49138 Revert "[dynamo] Recursively realize the stack_values (#152853)"
This reverts commit 460888f908ea4b634ecc863a6da6b2132108bc79.

Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))
2025-05-06 15:02:57 +00:00