Commit Graph

94597 Commits

Author SHA1 Message Date
e925dfcc6b Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang, https://github.com/mlazos
trunk/e925dfcc6b4fd76d744d04ecaa451fc2936155a8
2025-10-17 07:27:11 +00:00
f1d882212a [annotate] add annotate_fn function decorator (#165703)
Example usage:

```
        @fx_traceback.annotate_fn({"pp_stage": 1})
        def example_function(x):
            return x * x

        class SimpleLinear(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(3, 2)

            def forward(self, x):
                with fx_traceback.annotate({"pp_stage": 0}):
                    y = self.linear(x)
                y = example_function(y)
                return y - 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165703
Approved by: https://github.com/SherlockNoMad
trunk/f1d882212afc3a73ce1e319d80b6406f9dc4a0c8
2025-10-17 07:18:47 +00:00
24879f0de9 [dynamo] Use Variable Builder to build the property fget object (#165683)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165683
Approved by: https://github.com/ezyang, https://github.com/williamwen42
trunk/24879f0de97e0caaafa083ddc5ee28d6079fb1c0
2025-10-17 06:29:24 +00:00
9e94ec76b8 Revert "Turn some const variables into constexpr in C++ code (#165401)"
This reverts commit 5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4.

Reverted https://github.com/pytorch/pytorch/pull/165401 on behalf of https://github.com/seemethere due to This is breaking test/distributions/test_distributions.py::TestDistributions::test_binomial_sample on HUD, see 5b2afe4c5d ([comment](https://github.com/pytorch/pytorch/pull/165401#issuecomment-3414023134))
viable/strict/1760696731 trunk/9e94ec76b8b29812a1c9dcbb46f00b44e8c3719d
2025-10-17 06:14:09 +00:00
364624e209 [codemod][lowrisk] Remove unused exception parameter from some files (#165700)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Differential Revision: D84868162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165700
Approved by: https://github.com/Skylion007
trunk/364624e2091749d34aecbad843262643ad9a366f
2025-10-17 05:30:06 +00:00
7e150467f7 allow providing full fr trace path (#165639)
Summary:
- allow users to specify the full path instead of fr suffixing the rank id
- this will be used by torchft to provide the global rank id accross all replicas
- we can't just prefix the replica id because analysis tool expects the file name to provide a unique integer

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165639).
* #165638
* #165640
* #165677
* #165642
* __->__ #165639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165639
Approved by: https://github.com/fduwjj
trunk/7e150467f753360277c00585e4e689f91f3aef63
2025-10-17 04:43:44 +00:00
43d78423ac Pyrefly suppressions 2 (#165692)
This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692
Approved by: https://github.com/oulgen
trunk/43d78423ac224cce432bf34ed9627035169d5433
2025-10-17 04:15:25 +00:00
fcbde24c1c [ONNX] Remove common imports from torchlib (#165156)
The Rank and IsScalar functions are no longer used in the torchlib. Requires onnxscript v0.5.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165156
Approved by: https://github.com/Skylion007, https://github.com/cyyever
trunk/fcbde24c1cb54f3e0417e123bdb9ae09da134c8d
2025-10-17 03:25:34 +00:00
861cdb887b use statically_known_leq & *=2 instead of bound_sympy in persistent rblock (#165657)
While these should be equivalent, we've found instances where they are not, and an error was caused. update until we figure out underlying issue.

Differential Revision: [D84835898](https://our.internmc.facebook.com/intern/diff/D84835898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165657
Approved by: https://github.com/bobrenjc93
trunk/861cdb887b73818a7e96dc07c5aa6a308216daa4
2025-10-17 02:48:03 +00:00
3154482072 [CUDA][cuBLAS] Only xFail addmm with reduced precision reductions on non-RTX skus (#165379)
RTX Blackwells don't behave quite like their datacenter counterparts here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165379
Approved by: https://github.com/Skylion007
trunk/3154482072cefc49b69bd377a0774707b021fea7
2025-10-17 02:45:07 +00:00
9fccbdd4f0 Fix incorrect function signature in template (#165567)
Summary:
In https://github.com/pytorch/pytorch/pull/148305 we refactored the grid
argument out, but it's not reflected in our template.

Test Plan:
Included in commit.
python test/inductor/test_aot_inductor.py
AOTInductorTestABICompatibleGpu.test_cond_symint_input_disable_one_pass_cuda

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165567
Approved by: https://github.com/desertfire
trunk/9fccbdd4f05820fed8ccf66422b056c932649d62
2025-10-17 02:40:56 +00:00
7dabfb07cb [torchfuzz] add support for --stop-at-first-failure flag (#165529)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165529
Approved by: https://github.com/pianpwk
ghstack dependencies: #164749
trunk/7dabfb07cb896e9c31734c17d215e59418e071e0
2025-10-17 02:18:07 +00:00
d0add0be43 [torchfuzz] check in some more ignore regexes (#164749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164749
Approved by: https://github.com/pianpwk
2025-10-17 02:18:07 +00:00
11e2084308 Revert "[Mem Snapshot] Add Metadata Field (#165490)"
This reverts commit 5b3ea758951558e7d9f681ae784acb57eaa07910.

Reverted https://github.com/pytorch/pytorch/pull/165490 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165490#issuecomment-3413491091))
trunk/11e20843086cf58b3976ed3ac75ac1bbbebd715d
2025-10-17 02:01:53 +00:00
9726553653 [BE][Ez]: Use sys.executable instead of hardcoded Python (#165679)
Handles edgecase to ensure proper interpreter is called. Inspired by #165633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165679
Approved by: https://github.com/FindHao
trunk/9726553653ee1c53fc9a1f436a92b29f456082ca
2025-10-17 01:07:40 +00:00
d82527b32a [Windows] Add AOTI cross-compilation CI (#165573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165573
Approved by: https://github.com/malfet
ghstack dependencies: #165560
trunk/d82527b32ad0e09309ff874458139ecf6994e030
2025-10-17 01:05:35 +00:00
5d9b024276 Add mingw to docker (#165560)
Add mingw to `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11` docker image to support AOTI cross-compilation

This PR will make docker container rebuild, and upgrade python version from 3.13.7 to 3.13.8. and it relies on https://github.com/pytorch/pytorch/pull/165667
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165560
Approved by: https://github.com/malfet
trunk/5d9b0242762e7a416a789365e987b63dfe6b030a
2025-10-17 00:47:01 +00:00
5b2afe4c5d Turn some const variables into constexpr in C++ code (#165401)
This PR checks the C++ code and turns some const variables into constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401
Approved by: https://github.com/Skylion007
trunk/5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4
2025-10-17 00:40:11 +00:00
b2953f5643 [9/N] Apply ruff UP035 rule (#165515)
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
trunk/b2953f5643c6627d2bd0ceb9d2ccb32e2545c549
2025-10-17 00:09:51 +00:00
470e2f61c3 Revert "[Fix] Use sys.executable instead of hardcoded python (#165633)"
This reverts commit 37f3ba274a8ccebc6b3409f52cf068a8b23617d4.

Reverted https://github.com/pytorch/pytorch/pull/165633 on behalf of https://github.com/malfet due to Looks like it broke test_collect_callgrind in slow workflows, see e0fe37fa68/1 ([comment](https://github.com/pytorch/pytorch/pull/165633#issuecomment-3413290813))
trunk/470e2f61c3b2083e8d895b6aae5ede198bba5696
2025-10-17 00:06:40 +00:00
e0fe37fa68 [MPS] Move torch.cat impl to Metal (#165373)
After this change, all of the cases tested in [this performance measurement script](10de64c5ac/cat/perf0.py) take either roughly the same runtime or less.

Before:

```
idx: cpu time, mps time, speedup, op, args, kwargs
-----------------------------------------
0: 0.000857 ms, 0.016098 ms, 0.05, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1}
1: 0.000858 ms, 0.014861 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1}
2: 0.000806 ms, 0.015145 ms, 0.05, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0}
3: 0.000829 ms, 0.015355 ms, 0.05, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2}
4: 0.000591 ms, 0.000582 ms, 1.02, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0}
5: 0.001076 ms, 0.022387 ms, 0.05, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1}
6: 0.000708 ms, 0.022300 ms, 0.03, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0}
7: 0.000640 ms, 0.014367 ms, 0.04, cat, [[tensor(shape[1]), tensor(shape[1])]], {}
8: 0.000777 ms, 0.027506 ms, 0.03, cat, [[tensor(shape[2, 2, 2, 2])], 1], {}
9: 0.003383 ms, 0.269277 ms, 0.01, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
10: 0.526138 ms, 0.650852 ms, 0.81, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
11: 0.444091 ms, 0.628630 ms, 0.71, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0}
12: 2.011870 ms, 0.989525 ms, 2.03, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0}
13: 3.100653 ms, 0.948178 ms, 3.27, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1}
14: 3.112174 ms, 0.954174 ms, 3.26, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2}
```

After:

```
idx: cpu time, mps time, speedup, op, args, kwargs
-----------------------------------------
0: 0.000790 ms, 0.013111 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1}
1: 0.000800 ms, 0.014419 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1}
2: 0.000748 ms, 0.010019 ms, 0.07, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0}
3: 0.000767 ms, 0.010063 ms, 0.08, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2}
4: 0.000591 ms, 0.000591 ms, 1.00, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0}
5: 0.001220 ms, 0.009763 ms, 0.12, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1}
6: 0.000739 ms, 0.006203 ms, 0.12, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0}
7: 0.000647 ms, 0.009905 ms, 0.07, cat, [[tensor(shape[1]), tensor(shape[1])]], {}
8: 0.000753 ms, 0.007818 ms, 0.10, cat, [[tensor(shape[2, 2, 2, 2])], 1], {}
9: 0.003823 ms, 0.192723 ms, 0.02, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
10: 0.576564 ms, 0.733920 ms, 0.79, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
11: 0.462957 ms, 0.692799 ms, 0.67, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0}
12: 2.017181 ms, 0.968345 ms, 2.08, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0}
13: 3.203508 ms, 0.986382 ms, 3.25, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1}
14: 3.181249 ms, 1.007773 ms, 3.16, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2}
```

Fixes #165350
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165373
Approved by: https://github.com/kulinseth, https://github.com/malfet
trunk/e0fe37fa687a39e42ddeeb5c03986ffd5c40e662
2025-10-17 00:03:04 +00:00
d2c82bafb7 Revert "158232 Fix autocast cache incorrectly retaining no_grad state (#165068)"
This reverts commit 5daef30b26b794d237fbbc399c1d47ec0380200a.

Reverted https://github.com/pytorch/pytorch/pull/165068 on behalf of https://github.com/jeffdaily due to This broke ROCm CI. test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_True_d_model_256_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18572589089/job/52952074008) [HUD commit link](5daef30b26) ([comment](https://github.com/pytorch/pytorch/pull/165068#issuecomment-3413184445))
trunk/d2c82bafb7086a1dd109a0a6407ca7fed27337f4
2025-10-16 23:08:27 +00:00
98a488c9aa Start recording inductor provenance (#162669)
Summary:
This stores information on where fx graphs come from, which makes it
significantly easier to debug.

One outstanding question

1) I only stored the kernel stack traces, do we also want the node mappings?

Test Plan:
I wrote a explicit logging test which makes a module, fx traces it, compiles it, and makes sure the logging infomration shows up.

```
clr@devvm17763 ~/fbsource/fbcode/caffe2/test/dynamo
 % buck2 test @//mode/opt fbcode//caffe2/test/dynamo:test_dynamo -- test_utils

File changed: fbsource//xplat/caffe2/test/dynamo/test_utils.py
File changed: fbcode//caffe2/test/dynamo/test_utils.py
Buck UI: https://www.internalfb.com/buck2/528dea32-2416-4a62-a1ec-39f3c0efdd2e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229324015574003
Network: Up: 0B  Down: 0B
Executing actions. Remaining     0/2
Command: test.
Time elapsed: 17.3s
Tests finished: Pass 16. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Differential Revision: D82037582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162669
Approved by: https://github.com/yushangdi
trunk/98a488c9aaadd4b137b7a63dad31543aee75c454
2025-10-16 23:05:31 +00:00
5b3ea75895 [Mem Snapshot] Add Metadata Field (#165490)
Summary:
The implementation adds the ability to:

Set custom metadata strings that will be attached to all subsequent allocations
Clear or change the metadata at any point
View the metadata in memory snapshots via _dump_snapshot()

Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added.

Differential Revision: D84654933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490
Approved by: https://github.com/yushangdi
trunk/5b3ea758951558e7d9f681ae784acb57eaa07910
2025-10-16 22:54:27 +00:00
556fc09a9f [DebugMode][1/N] refactor logs into _DebugCalls (#165376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165376
Approved by: https://github.com/SherlockNoMad
trunk/556fc09a9f67f24ca5591ec049c5d0c347c5f62a viable/strict/1760675352
2025-10-16 22:43:52 +00:00
ce109b3f79 Add torch.backends.mkldnn.is_acl_available() method (#165678)
That tells whether or not PyTorch was compiled with Arm Compute Library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165678
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/albanD
ghstack dependencies: #165583, #165584, #165676
trunk/ce109b3f79d47618c37d11fa7066d05e9158f803 viable/strict/1760673121
2025-10-16 22:34:21 +00:00
4d833f859b [BE] [CI] Fix aarch64 arch checks (#165676)
Instead of relying on `TEST_CONFIG` environment variable  to contain `aarch64`, which is prone to errors,  use output of  `$(uname -m)` that is equal to `aarch64` on Linux ARM systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165676
Approved by: https://github.com/huydhn, https://github.com/atalman
ghstack dependencies: #165583, #165584
trunk/4d833f859b89f3401e4b395dc1b31cc683ab6019 viable/strict/1760671704
2025-10-16 22:19:53 +00:00
d7e275d4b4 [CI][CUDA] Add periodic b200 distributed job (#159323)
1. Run distributed job with B200 runner, periodically.
2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323
Approved by: https://github.com/eqy

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
trunk/d7e275d4b43105a23db9c958b1675b543584747f
2025-10-16 21:54:04 +00:00
d5db3aee0d [CI] Use 1-GPU runners for rocm-mi355.yml (#165658)
Should only need 1-GPU runners for rocm-mi355.yml since it runs `default` test config which only needs 1 GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165658
Approved by: https://github.com/jeffdaily
trunk/d5db3aee0d22db69968d32a7190e775d7120de81 viable/strict/1760670600
2025-10-16 21:53:22 +00:00
5641de7b6b Add suppressions for _inductor/codegen (#165659)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659
Approved by: https://github.com/oulgen
trunk/5641de7b6b3470f8e7709554dd582f8a0b1a4ee0
2025-10-16 21:37:37 +00:00
cbc08c8993 Add NEON acceleration for Vectorized<int[8|16|32|64> (#165273)
Summary:
Adding NEON specializations of Vectorized<T> for int8, int16, int32 and int64.

Correcness has been checked using test_ops.py and the comprehensive torch test

operator_benchmark_test.py has been enhanced by adding cases of bitwise operations, boolean ops and integer ops.
The benchmark, which uses the PyTorch API, shows significant enhancements in a wide variety of operations:

Before:

bitwise xor: 779.882us
boolean any: 636.209us
boolean all: 538.621us
integer mul: 304.457us
integer asr: 447.997us

After:

bitwise xor: 680.221us ---> 15% higher throughput
boolean any: 391.468us ---> 63% higher throughput
boolean all: 390.189us ---> 38% higher throughput
integer mul: 193.532us ---> 57% higher throughput
integer asr: 179.929us---> 149% higher throughput

Test Plan:
Correctness:

buck2 test @mode/opt //caffe2/test:test_ops
buck2 test @mode/opt //caffe2/test:torch
buck2 test @mode/opt //caffe2/test/distributed/launcher/fb:fb_run_test

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D84424638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165273
Approved by: https://github.com/malfet
trunk/cbc08c899310f1e24251ee56060b79721829d212
2025-10-16 21:35:13 +00:00
1a54d3333d [easy] Fix graph_capture in aot_joint_with_descriptors test (#165660)
when `with_export=True`, `aot_export_joint_with_descriptors` should take the graph produced by `_dynamo_graph_capture_for_export`

```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_simple
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_flex_attention
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165660
Approved by: https://github.com/yushangdi
trunk/1a54d3333de6b9d2e8aa785b3d791c87201be45a
2025-10-16 21:10:11 +00:00
4c1c341fa0 FakeTensorMode shouldn't cache syms when tracing (#164718)
Improve FakeTensor cache to handle SymNode and tracing properly.

For now, when we're proxy tracing just don't bother caching operations that contain SymNodes in the output. The problem is that the proxy tracer relies on SymNode identity and our cache doesn't preserve that. It can be fixed (and I left some notes in _validate_symbolic_output_for_caching() how) but it's not worth it for now.

If we aren't proxy tracing then caching is fine.

Thus these changes:

1. Our cache key needs to include whether we were actively tracing or not - this way if we create a cache entry when we weren't tracing and then we try to use it when we ARE tracing it gets rerun.

2. If there's a SymNode in the output then bypass tracing.

3. Some general cleanup of the output validation - we were unnecessarily doing it as a two-step process when it could just be a single step (it's still two parts internally but only a single outer try/except).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164718
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #165266, #164717
trunk/4c1c341fa06e6ac2cf7d9089e9628529f89b0b62
2025-10-16 20:57:07 +00:00
5f21cc786a Teach ProxyTorchDispatchMode how to decompose sympy.Expr into known inputs (#164717)
In a training library we hit a weird conflict between dtensor, dynamic shapes, and proxy tensor.

The problem is occuring because in sharding_prop we use FakeTensors to compute an operation size (so we don't have to  use the full "real" data). We turn off proxy tracing while we're doing that because we don't want the FakeTensor ops to end up in the graph.  We then use that size when doing later operations.

Normally this is no problem - but when those sizes are dynamic shapes then we have a problem - the proxy tracer wants to track the provenance of all shape operations (`s1*s2`) but since tracing is disabled it doesn't see the operation and when we then use the result shape later on the proxy tracer gets all confused (because the SymNode appeared out of nowhere).

At first we were thinking to never disable shape tracing - but that caused a slew of other downstream problems (lots of code that actually needs the shape tracing to be disabled) so instead we enable having a "sym tracing override" and surgically when we disable proxy tracing we leave shape tracing enabled.

After this change the dtensor embedding is "fixed" but then runs afoul of a FakeTensor cache bug - which is fixed in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164717
Approved by: https://github.com/bobrenjc93, https://github.com/ezyang
ghstack dependencies: #165266
2025-10-16 20:57:06 +00:00
e86942f422 minor proxy_tensor reorg (#165266)
Moving some code around in proxy_tensor in preparation for the next PR. There we
no actual changes (other than simple relabeling such as `self.tracer` ->
`tracer`):

- Move _compute_proxy() out of ProxyTorchDispatchMode.

- Give `sympy_expr_tracker` a structured type instead of `object`.

- Split SymNode registration out of ProxyTorchDispatchMode.__sym_dispatch__() so
  it can be reused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165266
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-10-16 20:57:06 +00:00
2cd5fd1588 Enable local tensor mode on DTensor view ops test (#165596)
While enabling this test discovered lack of support for sub meshes. Added limited support
for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation
follows similar approach to collectives. We infer all sub meshes for the given dimensions and
compute each rank's coordinates with respect to is sub mesh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596
Approved by: https://github.com/ezyang
trunk/2cd5fd15882ad940caa28346bac23b9f5ff2c893 viable/strict/1760668990
2025-10-16 20:52:06 +00:00
7d0f872cb3 Use union syntax in torch/_inductor runtime and fx_passes (#165652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652
Approved by: https://github.com/aorenste
trunk/7d0f872cb36841e9e975002bcee16aa3177c7f46 viable/strict/1760666101
2025-10-16 20:51:59 +00:00
fb06e49ce8 Revert "[inductor] print 0.0 as 0 for triton (#164291)"
This reverts commit 99b32a6750bfd0cfe2bc84a47823e1da34802b7b.

Reverted https://github.com/pytorch/pytorch/pull/164291 on behalf of https://github.com/malfet due to Broke slow job, see aba8c43594/1  ([comment](https://github.com/pytorch/pytorch/pull/164291#issuecomment-3412768915))
trunk/fb06e49ce86c120cb070b0b28c7bd49785a68e43
2025-10-16 20:44:29 +00:00
27a98e6ae9 Revert "[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554)"
This reverts commit d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3.

Reverted https://github.com/pytorch/pytorch/pull/165554 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see aba8c43594/1 ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))
trunk/27a98e6ae97a0f82c2deba225b1142b73be2e639
2025-10-16 20:41:37 +00:00
b10f463b1a Revert "[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555)"
This reverts commit 99097b6d89c927c15180ff4683c38be01f9955f6.

Reverted https://github.com/pytorch/pytorch/pull/165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see aba8c43594/1 ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))
2025-10-16 20:41:37 +00:00
431c13cf61 Revert "[DeviceMesh] Simplify unflatten method (#165556)"
This reverts commit 86fd4fc23e697e275d37c36e3cbe521f156434fd.

Reverted https://github.com/pytorch/pytorch/pull/165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see aba8c43594/1 ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))
2025-10-16 20:41:37 +00:00
aead9270f5 12/n : Remove fbandroid_compiler_flags (#165558)
Summary:
Currently `get_c2_fbandroid_xplat_compiler_flags()` is reading the `caffe2.strip_glog` buckconfig which we want to get rid of.
This diff removes the `fbandroid_compiler_flags` arg and merges it with compiler_flags with a nested select and the select version of the method

The goal is to get rid of all the usages of `get_c2_fbandroid_xplat_compiler_flags()` so that we can get rid of the `caffe2.strip_glog` buckconfig

Test Plan: CI

bifferential Revision: D84626885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165558
Approved by: https://github.com/malfet
trunk/aead9270f56ebc7302c7f5fa7e5dff959f26608e
2025-10-16 20:41:24 +00:00
9bf5b38c14 [Inductor][Triton][FP8] Refactor scaled_mm template to accept scaling mode (#164318)
Summary: Refactor `scaled_mm` Inductor template to support template choice based on scaling mode. This modification sets up the infrastructure for adding new templates based on new scaling modes, such as deepseek-style scaling (a follow-up diff), as new scaling modes (deepseek, block, group) scale before the accumulation (as opposed to per-tensor and per-row scaling, which apply scaling after accumulation). This modification also further enables Inductor to infer a scaling type based on the shape of the scaling tensors, which makes existing infrastructure more extensible to new scaling modes.

Test Plan:
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling_rowwise --atol=20 --rtol=2 2>&1 | tee ~/personal/random.log
```

bifferential Revision: D83591083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164318
Approved by: https://github.com/drisspg, https://github.com/slayton58
trunk/9bf5b38c14f7f7c627d2c8775b203f1b3d61597e
2025-10-16 20:40:45 +00:00
aba8c43594 Register var for MTIA (#165382)
Summary: Registers variance kernel

Reviewed By: srsuryadev

Differential Revision: D84546250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165382
Approved by: https://github.com/malfet
trunk/aba8c43594a83772281a62a7961c0b6ddcff321d
2025-10-16 20:35:15 +00:00
37f3ba274a [Fix] Use sys.executable instead of hardcoded python (#165633)
Replace hardcoded "python" string with sys.executable to ensure correct Python interpreter is used. This fixes failures on systems with multiple Python runtimes or where "python" is not in PATH.

Similar to pytorch/pytorch#155918

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165633
Approved by: https://github.com/Skylion007
trunk/37f3ba274a8ccebc6b3409f52cf068a8b23617d4
2025-10-16 20:26:10 +00:00
585b9dbb5e [async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068)
Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim).

When we decompose matmul by reduction dimension we result in partials that needs additional reduction,
we allocate memory for accumulator.

Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan).

scaled_mm is not supported yet for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068
Approved by: https://github.com/ngimel
trunk/585b9dbb5ed8a149e2c6a196537aafe065b61ec4
2025-10-16 20:14:39 +00:00
d795fb225a [RFC] Add pyrefly to lintrunner (#165179)
This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker.

References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283:

test plan:
`lintrunner init`
`lintrunner`
confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179
Approved by: https://github.com/ezyang
trunk/d795fb225ace717f692ceb3f1d20dfb35afbdf8a
2025-10-16 20:07:09 +00:00
7df9aca529 [ROCm][Windows] Enable AOTriton runtime compile on Windows (#165538)
AOTriton uses prebuilt runtime binaries if the user's ROCm version matches the ones used to generate the prebuilt runtime. However, since there's no prebuilt runtime available for Windows, this check needs to be bypassed for Windows. This PR enables it by changing condition to always build AOTriton runtime from source on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165538
Approved by: https://github.com/xinyazhang, https://github.com/jeffdaily
trunk/7df9aca52946ae47ca4d98dbe0685a412fbc77b8
2025-10-16 19:51:43 +00:00
d4a713cd9c Change forkserver test to only run below 3.13.8 (#165667)
A multiprocessing bug is fixed in 3.13.8, see [https://docs.python.org/3.13/whatsnew/changelog.html](https://l.workplace.com/l.php?u=https%3A%2F%2Fdocs.python.org%2F3.13%2Fwhatsnew%2Fchangelog.html&h=AT0qUhHJq5c2UJvQaq9_MrSo0mVhwn1VOfq1nDQl2C1UOhDI80RMbzVayhG7LSAT1uYHKtkftKnBDwiGMhbw0YRvQLe5vwE01qejpPFautHvU3LXeOE1KChPykqz3qnCRzk7czu_iNzQ05shR4F1N_qYOzR5YxejA52ZZQ), [gh-126631](https://github.com/python/cpython/issues/126631)

So this test will fail when we update to python 3.13.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165667
Approved by: https://github.com/malfet
trunk/d4a713cd9c8ea1dc13917d3311d73c13914306a6
2025-10-16 19:34:10 +00:00
5daef30b26 158232 Fix autocast cache incorrectly retaining no_grad state (#165068)
Fixes #158232
The autocast caching heuristic in `aten/src/ATen/autocast_mode.cpp:139` did not account for gradient mode state when deciding whether to cache. FSDP2 is not directly related.

~~This PR adds `GradMode::is_enabled()` check to caching condition. Caching is now disabled in `no_grad()` contexts to prevent storing tensors with incorrect gradient state. Ensures correctness at the cost of using cache.~~
This PR proposes separate caches for gradient-enabled and gradient-disabled modes.
Adds tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165068
Approved by: https://github.com/ngimel, https://github.com/janeyx99
trunk/5daef30b26b794d237fbbc399c1d47ec0380200a
2025-10-16 19:32:01 +00:00