pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	e925dfcc6b	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang, https://github.com/mlazos trunk/e925dfcc6b4fd76d744d04ecaa451fc2936155a8	2025-10-17 07:27:11 +00:00
Shangdi Yu	f1d882212a	[annotate] add annotate_fn function decorator (#165703 ) Example usage: ``` @fx_traceback.annotate_fn({"pp_stage": 1}) def example_function(x): return x * x class SimpleLinear(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(3, 2) def forward(self, x): with fx_traceback.annotate({"pp_stage": 0}): y = self.linear(x) y = example_function(y) return y - 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165703 Approved by: https://github.com/SherlockNoMad trunk/f1d882212afc3a73ce1e319d80b6406f9dc4a0c8	2025-10-17 07:18:47 +00:00
Animesh Jain	24879f0de9	[dynamo] Use Variable Builder to build the property fget object (#165683 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165683 Approved by: https://github.com/ezyang, https://github.com/williamwen42 trunk/24879f0de97e0caaafa083ddc5ee28d6079fb1c0	2025-10-17 06:29:24 +00:00
PyTorch MergeBot	9e94ec76b8	Revert "Turn some const variables into constexpr in C++ code (#165401 )" This reverts commit 5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4. Reverted https://github.com/pytorch/pytorch/pull/165401 on behalf of https://github.com/seemethere due to This is breaking test/distributions/test_distributions.py::TestDistributions::test_binomial_sample on HUD, see `5b2afe4c5d` ([comment](https://github.com/pytorch/pytorch/pull/165401#issuecomment-3414023134)) viable/strict/1760696731 trunk/9e94ec76b8b29812a1c9dcbb46f00b44e8c3719d	2025-10-17 06:14:09 +00:00
Richard Barnes	364624e209	[codemod][lowrisk] Remove unused exception parameter from some files (#165700 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Differential Revision: D84868162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165700 Approved by: https://github.com/Skylion007 trunk/364624e2091749d34aecbad843262643ad9a366f	2025-10-17 05:30:06 +00:00
Tushar Jain	7e150467f7	allow providing full fr trace path (#165639 ) Summary: - allow users to specify the full path instead of fr suffixing the rank id - this will be used by torchft to provide the global rank id accross all replicas - we can't just prefix the replica id because analysis tool expects the file name to provide a unique integer --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165639). * #165638 * #165640 * #165677 * #165642 * __->__ #165639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165639 Approved by: https://github.com/fduwjj trunk/7e150467f753360277c00585e4e689f91f3aef63	2025-10-17 04:43:44 +00:00
Maggie Moss	43d78423ac	Pyrefly suppressions 2 (#165692 ) This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692 Approved by: https://github.com/oulgen trunk/43d78423ac224cce432bf34ed9627035169d5433	2025-10-17 04:15:25 +00:00
Justin Chu	fcbde24c1c	[ONNX] Remove common imports from torchlib (#165156 ) The Rank and IsScalar functions are no longer used in the torchlib. Requires onnxscript v0.5.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165156 Approved by: https://github.com/Skylion007, https://github.com/cyyever trunk/fcbde24c1cb54f3e0417e123bdb9ae09da134c8d	2025-10-17 03:25:34 +00:00
eellison	861cdb887b	use statically_known_leq & *=2 instead of bound_sympy in persistent rblock (#165657 ) While these should be equivalent, we've found instances where they are not, and an error was caused. update until we figure out underlying issue. Differential Revision: [D84835898](https://our.internmc.facebook.com/intern/diff/D84835898) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165657 Approved by: https://github.com/bobrenjc93 trunk/861cdb887b73818a7e96dc07c5aa6a308216daa4	2025-10-17 02:48:03 +00:00
Eddie Yan	3154482072	[CUDA][cuBLAS] Only `xFail` `addmm` with reduced precision reductions on non-RTX skus (#165379 ) RTX Blackwells don't behave quite like their datacenter counterparts here Pull Request resolved: https://github.com/pytorch/pytorch/pull/165379 Approved by: https://github.com/Skylion007 trunk/3154482072cefc49b69bd377a0774707b021fea7	2025-10-17 02:45:07 +00:00
Mu-Chu Lee	9fccbdd4f0	Fix incorrect function signature in template (#165567 ) Summary: In https://github.com/pytorch/pytorch/pull/148305 we refactored the grid argument out, but it's not reflected in our template. Test Plan: Included in commit. python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_cond_symint_input_disable_one_pass_cuda Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165567 Approved by: https://github.com/desertfire trunk/9fccbdd4f05820fed8ccf66422b056c932649d62	2025-10-17 02:40:56 +00:00
bobrenjc93	7dabfb07cb	[torchfuzz] add support for --stop-at-first-failure flag (#165529 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165529 Approved by: https://github.com/pianpwk ghstack dependencies: #164749 trunk/7dabfb07cb896e9c31734c17d215e59418e071e0	2025-10-17 02:18:07 +00:00
bobrenjc93	d0add0be43	[torchfuzz] check in some more ignore regexes (#164749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164749 Approved by: https://github.com/pianpwk	2025-10-17 02:18:07 +00:00
PyTorch MergeBot	11e2084308	Revert "[Mem Snapshot] Add Metadata Field (#165490 )" This reverts commit 5b3ea758951558e7d9f681ae784acb57eaa07910. Reverted https://github.com/pytorch/pytorch/pull/165490 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165490#issuecomment-3413491091)) trunk/11e20843086cf58b3976ed3ac75ac1bbbebd715d	2025-10-17 02:01:53 +00:00
Aaron Gokaslan	9726553653	[BE][Ez]: Use sys.executable instead of hardcoded Python (#165679 ) Handles edgecase to ensure proper interpreter is called. Inspired by #165633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165679 Approved by: https://github.com/FindHao trunk/9726553653ee1c53fc9a1f436a92b29f456082ca	2025-10-17 01:07:40 +00:00
Shangdi Yu	d82527b32a	[Windows] Add AOTI cross-compilation CI (#165573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165573 Approved by: https://github.com/malfet ghstack dependencies: #165560 trunk/d82527b32ad0e09309ff874458139ecf6994e030	2025-10-17 01:05:35 +00:00
Shangdi Yu	5d9b024276	Add mingw to docker (#165560 ) Add mingw to `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11` docker image to support AOTI cross-compilation This PR will make docker container rebuild, and upgrade python version from 3.13.7 to 3.13.8. and it relies on https://github.com/pytorch/pytorch/pull/165667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165560 Approved by: https://github.com/malfet trunk/5d9b0242762e7a416a789365e987b63dfe6b030a	2025-10-17 00:47:01 +00:00
Yuanyuan Chen	5b2afe4c5d	Turn some const variables into constexpr in C++ code (#165401 ) This PR checks the C++ code and turns some const variables into constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401 Approved by: https://github.com/Skylion007 trunk/5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4	2025-10-17 00:40:11 +00:00
Yuanyuan Chen	b2953f5643	[9/N] Apply ruff UP035 rule (#165515 ) This is follow-up of #165214 to continue applying ruff UP035 rule to the code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515 Approved by: https://github.com/Lucaskabela trunk/b2953f5643c6627d2bd0ceb9d2ccb32e2545c549	2025-10-17 00:09:51 +00:00
PyTorch MergeBot	470e2f61c3	Revert "[Fix] Use sys.executable instead of hardcoded python (#165633 )" This reverts commit 37f3ba274a8ccebc6b3409f52cf068a8b23617d4. Reverted https://github.com/pytorch/pytorch/pull/165633 on behalf of https://github.com/malfet due to Looks like it broke test_collect_callgrind in slow workflows, see `e0fe37fa68/1` ([comment](https://github.com/pytorch/pytorch/pull/165633#issuecomment-3413290813)) trunk/470e2f61c3b2083e8d895b6aae5ede198bba5696	2025-10-17 00:06:40 +00:00
Kurt Mohler	e0fe37fa68	[MPS] Move `torch.cat` impl to Metal (#165373 ) After this change, all of the cases tested in [this performance measurement script](`10de64c5ac/cat/perf0.py`) take either roughly the same runtime or less. Before: ``` idx: cpu time, mps time, speedup, op, args, kwargs ----------------------------------------- 0: 0.000857 ms, 0.016098 ms, 0.05, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1} 1: 0.000858 ms, 0.014861 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1} 2: 0.000806 ms, 0.015145 ms, 0.05, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0} 3: 0.000829 ms, 0.015355 ms, 0.05, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2} 4: 0.000591 ms, 0.000582 ms, 1.02, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0} 5: 0.001076 ms, 0.022387 ms, 0.05, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1} 6: 0.000708 ms, 0.022300 ms, 0.03, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0} 7: 0.000640 ms, 0.014367 ms, 0.04, cat, [[tensor(shape[1]), tensor(shape[1])]], {} 8: 0.000777 ms, 0.027506 ms, 0.03, cat, [[tensor(shape[2, 2, 2, 2])], 1], {} 9: 0.003383 ms, 0.269277 ms, 0.01, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1} 10: 0.526138 ms, 0.650852 ms, 0.81, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1} 11: 0.444091 ms, 0.628630 ms, 0.71, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0} 12: 2.011870 ms, 0.989525 ms, 2.03, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0} 13: 3.100653 ms, 0.948178 ms, 3.27, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1} 14: 3.112174 ms, 0.954174 ms, 3.26, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2} ``` After: ``` idx: cpu time, mps time, speedup, op, args, kwargs ----------------------------------------- 0: 0.000790 ms, 0.013111 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1} 1: 0.000800 ms, 0.014419 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1} 2: 0.000748 ms, 0.010019 ms, 0.07, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0} 3: 0.000767 ms, 0.010063 ms, 0.08, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2} 4: 0.000591 ms, 0.000591 ms, 1.00, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0} 5: 0.001220 ms, 0.009763 ms, 0.12, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1} 6: 0.000739 ms, 0.006203 ms, 0.12, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0} 7: 0.000647 ms, 0.009905 ms, 0.07, cat, [[tensor(shape[1]), tensor(shape[1])]], {} 8: 0.000753 ms, 0.007818 ms, 0.10, cat, [[tensor(shape[2, 2, 2, 2])], 1], {} 9: 0.003823 ms, 0.192723 ms, 0.02, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1} 10: 0.576564 ms, 0.733920 ms, 0.79, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1} 11: 0.462957 ms, 0.692799 ms, 0.67, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0} 12: 2.017181 ms, 0.968345 ms, 2.08, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0} 13: 3.203508 ms, 0.986382 ms, 3.25, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1} 14: 3.181249 ms, 1.007773 ms, 3.16, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2} ``` Fixes #165350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165373 Approved by: https://github.com/kulinseth, https://github.com/malfet trunk/e0fe37fa687a39e42ddeeb5c03986ffd5c40e662	2025-10-17 00:03:04 +00:00
PyTorch MergeBot	d2c82bafb7	Revert "158232 Fix autocast cache incorrectly retaining no_grad state (#165068 )" This reverts commit 5daef30b26b794d237fbbc399c1d47ec0380200a. Reverted https://github.com/pytorch/pytorch/pull/165068 on behalf of https://github.com/jeffdaily due to This broke ROCm CI. test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_True_d_model_256_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18572589089/job/52952074008) [HUD commit link](`5daef30b26`) ([comment](https://github.com/pytorch/pytorch/pull/165068#issuecomment-3413184445)) trunk/d2c82bafb7086a1dd109a0a6407ca7fed27337f4	2025-10-16 23:08:27 +00:00
Colin L Reliability Rice	98a488c9aa	Start recording inductor provenance (#162669 ) Summary: This stores information on where fx graphs come from, which makes it significantly easier to debug. One outstanding question 1) I only stored the kernel stack traces, do we also want the node mappings? Test Plan: I wrote a explicit logging test which makes a module, fx traces it, compiles it, and makes sure the logging infomration shows up. ``` clr@devvm17763 ~/fbsource/fbcode/caffe2/test/dynamo % buck2 test @//mode/opt fbcode//caffe2/test/dynamo:test_dynamo -- test_utils File changed: fbsource//xplat/caffe2/test/dynamo/test_utils.py File changed: fbcode//caffe2/test/dynamo/test_utils.py Buck UI: https://www.internalfb.com/buck2/528dea32-2416-4a62-a1ec-39f3c0efdd2e Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229324015574003 Network: Up: 0B Down: 0B Executing actions. Remaining 0/2 Command: test. Time elapsed: 17.3s Tests finished: Pass 16. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D82037582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162669 Approved by: https://github.com/yushangdi trunk/98a488c9aaadd4b137b7a63dad31543aee75c454	2025-10-16 23:05:31 +00:00
Shivam Raikundalia	5b3ea75895	[Mem Snapshot] Add Metadata Field (#165490 ) Summary: The implementation adds the ability to: Set custom metadata strings that will be attached to all subsequent allocations Clear or change the metadata at any point View the metadata in memory snapshots via _dump_snapshot() Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added. Differential Revision: D84654933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490 Approved by: https://github.com/yushangdi trunk/5b3ea758951558e7d9f681ae784acb57eaa07910	2025-10-16 22:54:27 +00:00
Pian Pawakapan	556fc09a9f	[DebugMode][1/N] refactor logs into _DebugCalls (#165376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165376 Approved by: https://github.com/SherlockNoMad trunk/556fc09a9f67f24ca5591ec049c5d0c347c5f62a viable/strict/1760675352	2025-10-16 22:43:52 +00:00
Nikita Shulga	ce109b3f79	Add `torch.backends.mkldnn.is_acl_available()` method (#165678 ) That tells whether or not PyTorch was compiled with Arm Compute Library Pull Request resolved: https://github.com/pytorch/pytorch/pull/165678 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/albanD ghstack dependencies: #165583, #165584, #165676 trunk/ce109b3f79d47618c37d11fa7066d05e9158f803 viable/strict/1760673121	2025-10-16 22:34:21 +00:00
Nikita Shulga	4d833f859b	[BE] [CI] Fix aarch64 arch checks (#165676 ) Instead of relying on `TEST_CONFIG` environment variable to contain `aarch64`, which is prone to errors, use output of `$(uname -m)` that is equal to `aarch64` on Linux ARM systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/165676 Approved by: https://github.com/huydhn, https://github.com/atalman ghstack dependencies: #165583, #165584 trunk/4d833f859b89f3401e4b395dc1b31cc683ab6019 viable/strict/1760671704	2025-10-16 22:19:53 +00:00
Wei Wang	d7e275d4b4	[CI][CUDA] Add periodic b200 distributed job (#159323 ) 1. Run distributed job with B200 runner, periodically. 2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323 Approved by: https://github.com/eqy Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> trunk/d7e275d4b43105a23db9c958b1675b543584747f	2025-10-16 21:54:04 +00:00
Jithun Nair	d5db3aee0d	[CI] Use 1-GPU runners for rocm-mi355.yml (#165658 ) Should only need 1-GPU runners for rocm-mi355.yml since it runs `default` test config which only needs 1 GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/165658 Approved by: https://github.com/jeffdaily trunk/d5db3aee0d22db69968d32a7190e775d7120de81 viable/strict/1760670600	2025-10-16 21:53:22 +00:00
Maggie Moss	5641de7b6b	Add suppressions for _inductor/codegen (#165659 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659 Approved by: https://github.com/oulgen trunk/5641de7b6b3470f8e7709554dd582f8a0b1a4ee0	2025-10-16 21:37:37 +00:00
Nicolas De Carli	cbc08c8993	Add NEON acceleration for `Vectorized<int[8\|16\|32\|64>` (#165273 ) Summary: Adding NEON specializations of Vectorized<T> for int8, int16, int32 and int64. Correcness has been checked using test_ops.py and the comprehensive torch test operator_benchmark_test.py has been enhanced by adding cases of bitwise operations, boolean ops and integer ops. The benchmark, which uses the PyTorch API, shows significant enhancements in a wide variety of operations: Before: bitwise xor: 779.882us boolean any: 636.209us boolean all: 538.621us integer mul: 304.457us integer asr: 447.997us After: bitwise xor: 680.221us ---> 15% higher throughput boolean any: 391.468us ---> 63% higher throughput boolean all: 390.189us ---> 38% higher throughput integer mul: 193.532us ---> 57% higher throughput integer asr: 179.929us---> 149% higher throughput Test Plan: Correctness: buck2 test @mode/opt //caffe2/test:test_ops buck2 test @mode/opt //caffe2/test:torch buck2 test @mode/opt //caffe2/test/distributed/launcher/fb:fb_run_test Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D84424638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165273 Approved by: https://github.com/malfet trunk/cbc08c899310f1e24251ee56060b79721829d212	2025-10-16 21:35:13 +00:00
Yiming Zhou	1a54d3333d	[easy] Fix graph_capture in aot_joint_with_descriptors test (#165660 ) when `with_export=True`, `aot_export_joint_with_descriptors` should take the graph produced by `_dynamo_graph_capture_for_export` ``` python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_simple python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_flex_attention ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165660 Approved by: https://github.com/yushangdi trunk/1a54d3333de6b9d2e8aa785b3d791c87201be45a	2025-10-16 21:10:11 +00:00
Aaron Orenstein	4c1c341fa0	FakeTensorMode shouldn't cache syms when tracing (#164718 ) Improve FakeTensor cache to handle SymNode and tracing properly. For now, when we're proxy tracing just don't bother caching operations that contain SymNodes in the output. The problem is that the proxy tracer relies on SymNode identity and our cache doesn't preserve that. It can be fixed (and I left some notes in _validate_symbolic_output_for_caching() how) but it's not worth it for now. If we aren't proxy tracing then caching is fine. Thus these changes: 1. Our cache key needs to include whether we were actively tracing or not - this way if we create a cache entry when we weren't tracing and then we try to use it when we ARE tracing it gets rerun. 2. If there's a SymNode in the output then bypass tracing. 3. Some general cleanup of the output validation - we were unnecessarily doing it as a two-step process when it could just be a single step (it's still two parts internally but only a single outer try/except). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164718 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #165266, #164717 trunk/4c1c341fa06e6ac2cf7d9089e9628529f89b0b62	2025-10-16 20:57:07 +00:00
Aaron Orenstein	5f21cc786a	Teach ProxyTorchDispatchMode how to decompose sympy.Expr into known inputs (#164717 ) In a training library we hit a weird conflict between dtensor, dynamic shapes, and proxy tensor. The problem is occuring because in sharding_prop we use FakeTensors to compute an operation size (so we don't have to use the full "real" data). We turn off proxy tracing while we're doing that because we don't want the FakeTensor ops to end up in the graph. We then use that size when doing later operations. Normally this is no problem - but when those sizes are dynamic shapes then we have a problem - the proxy tracer wants to track the provenance of all shape operations (`s1*s2`) but since tracing is disabled it doesn't see the operation and when we then use the result shape later on the proxy tracer gets all confused (because the SymNode appeared out of nowhere). At first we were thinking to never disable shape tracing - but that caused a slew of other downstream problems (lots of code that actually needs the shape tracing to be disabled) so instead we enable having a "sym tracing override" and surgically when we disable proxy tracing we leave shape tracing enabled. After this change the dtensor embedding is "fixed" but then runs afoul of a FakeTensor cache bug - which is fixed in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164717 Approved by: https://github.com/bobrenjc93, https://github.com/ezyang ghstack dependencies: #165266	2025-10-16 20:57:06 +00:00
Aaron Orenstein	e86942f422	minor proxy_tensor reorg (#165266 ) Moving some code around in proxy_tensor in preparation for the next PR. There we no actual changes (other than simple relabeling such as `self.tracer` -> `tracer`): - Move _compute_proxy() out of ProxyTorchDispatchMode. - Give `sympy_expr_tracker` a structured type instead of `object`. - Split SymNode registration out of ProxyTorchDispatchMode.__sym_dispatch__() so it can be reused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165266 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-10-16 20:57:06 +00:00
Dzmitry Huba	2cd5fd1588	Enable local tensor mode on DTensor view ops test (#165596 ) While enabling this test discovered lack of support for sub meshes. Added limited support for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation follows similar approach to collectives. We infer all sub meshes for the given dimensions and compute each rank's coordinates with respect to is sub mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596 Approved by: https://github.com/ezyang trunk/2cd5fd15882ad940caa28346bac23b9f5ff2c893 viable/strict/1760668990	2025-10-16 20:52:06 +00:00
Oguz Ulgen	7d0f872cb3	Use union syntax in torch/_inductor runtime and fx_passes (#165652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 Approved by: https://github.com/aorenste trunk/7d0f872cb36841e9e975002bcee16aa3177c7f46 viable/strict/1760666101	2025-10-16 20:51:59 +00:00
PyTorch MergeBot	fb06e49ce8	Revert "[inductor] print 0.0 as 0 for triton (#164291 )" This reverts commit 99b32a6750bfd0cfe2bc84a47823e1da34802b7b. Reverted https://github.com/pytorch/pytorch/pull/164291 on behalf of https://github.com/malfet due to Broke slow job, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/164291#issuecomment-3412768915)) trunk/fb06e49ce86c120cb070b0b28c7bd49785a68e43	2025-10-16 20:44:29 +00:00
PyTorch MergeBot	27a98e6ae9	Revert "[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554 )" This reverts commit d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3. Reverted https://github.com/pytorch/pytorch/pull/165554 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681)) trunk/27a98e6ae97a0f82c2deba225b1142b73be2e639	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	b10f463b1a	Revert "[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555 )" This reverts commit 99097b6d89c927c15180ff4683c38be01f9955f6. Reverted https://github.com/pytorch/pytorch/pull/165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	431c13cf61	Revert "[DeviceMesh] Simplify unflatten method (#165556 )" This reverts commit 86fd4fc23e697e275d37c36e3cbe521f156434fd. Reverted https://github.com/pytorch/pytorch/pull/165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
Ketan Ambati	aead9270f5	12/n : Remove fbandroid_compiler_flags (#165558 ) Summary: Currently `get_c2_fbandroid_xplat_compiler_flags()` is reading the `caffe2.strip_glog` buckconfig which we want to get rid of. This diff removes the `fbandroid_compiler_flags` arg and merges it with compiler_flags with a nested select and the select version of the method The goal is to get rid of all the usages of `get_c2_fbandroid_xplat_compiler_flags()` so that we can get rid of the `caffe2.strip_glog` buckconfig Test Plan: CI bifferential Revision: D84626885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165558 Approved by: https://github.com/malfet trunk/aead9270f56ebc7302c7f5fa7e5dff959f26608e	2025-10-16 20:41:24 +00:00
Janani Sriram	9bf5b38c14	[Inductor][Triton][FP8] Refactor scaled_mm template to accept scaling mode (#164318 ) Summary: Refactor `scaled_mm` Inductor template to support template choice based on scaling mode. This modification sets up the infrastructure for adding new templates based on new scaling modes, such as deepseek-style scaling (a follow-up diff), as new scaling modes (deepseek, block, group) scale before the accumulation (as opposed to per-tensor and per-row scaling, which apply scaling after accumulation). This modification also further enables Inductor to infer a scaling type based on the shape of the scaling tensors, which makes existing infrastructure more extensible to new scaling modes. Test Plan: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling_rowwise --atol=20 --rtol=2 2>&1 \| tee ~/personal/random.log ``` bifferential Revision: D83591083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164318 Approved by: https://github.com/drisspg, https://github.com/slayton58 trunk/9bf5b38c14f7f7c627d2c8775b203f1b3d61597e	2025-10-16 20:40:45 +00:00
Tristan Trouwen	aba8c43594	Register var for MTIA (#165382 ) Summary: Registers variance kernel Reviewed By: srsuryadev Differential Revision: D84546250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165382 Approved by: https://github.com/malfet trunk/aba8c43594a83772281a62a7961c0b6ddcff321d	2025-10-16 20:35:15 +00:00
linhaifeng	37f3ba274a	[Fix] Use sys.executable instead of hardcoded python (#165633 ) Replace hardcoded "python" string with sys.executable to ensure correct Python interpreter is used. This fixes failures on systems with multiple Python runtimes or where "python" is not in PATH. Similar to pytorch/pytorch#155918 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165633 Approved by: https://github.com/Skylion007 trunk/37f3ba274a8ccebc6b3409f52cf068a8b23617d4	2025-10-16 20:26:10 +00:00
IvanKobzarev	585b9dbb5e	[async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068 ) Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim). When we decompose matmul by reduction dimension we result in partials that needs additional reduction, we allocate memory for accumulator. Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan). scaled_mm is not supported yet for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068 Approved by: https://github.com/ngimel trunk/585b9dbb5ed8a149e2c6a196537aafe065b61ec4	2025-10-16 20:14:39 +00:00
Maggie Moss	d795fb225a	[RFC] Add pyrefly to lintrunner (#165179 ) This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker. References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283: test plan: `lintrunner init` `lintrunner` confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 Approved by: https://github.com/ezyang trunk/d795fb225ace717f692ceb3f1d20dfb35afbdf8a	2025-10-16 20:07:09 +00:00
tvukovic-amd	7df9aca529	[ROCm][Windows] Enable AOTriton runtime compile on Windows (#165538 ) AOTriton uses prebuilt runtime binaries if the user's ROCm version matches the ones used to generate the prebuilt runtime. However, since there's no prebuilt runtime available for Windows, this check needs to be bypassed for Windows. This PR enables it by changing condition to always build AOTriton runtime from source on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165538 Approved by: https://github.com/xinyazhang, https://github.com/jeffdaily trunk/7df9aca52946ae47ca4d98dbe0685a412fbc77b8	2025-10-16 19:51:43 +00:00
Shangdi Yu	d4a713cd9c	Change forkserver test to only run below 3.13.8 (#165667 ) A multiprocessing bug is fixed in 3.13.8, see [https://docs.python.org/3.13/whatsnew/changelog.html](https://l.workplace.com/l.php?u=https%3A%2F%2Fdocs.python.org%2F3.13%2Fwhatsnew%2Fchangelog.html&h=AT0qUhHJq5c2UJvQaq9_MrSo0mVhwn1VOfq1nDQl2C1UOhDI80RMbzVayhG7LSAT1uYHKtkftKnBDwiGMhbw0YRvQLe5vwE01qejpPFautHvU3LXeOE1KChPykqz3qnCRzk7czu_iNzQ05shR4F1N_qYOzR5YxejA52ZZQ), [gh-126631](https://github.com/python/cpython/issues/126631) So this test will fail when we update to python 3.13.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165667 Approved by: https://github.com/malfet trunk/d4a713cd9c8ea1dc13917d3311d73c13914306a6	2025-10-16 19:34:10 +00:00
Sean McGovern	5daef30b26	158232 Fix autocast cache incorrectly retaining no_grad state (#165068 ) Fixes #158232 The autocast caching heuristic in `aten/src/ATen/autocast_mode.cpp:139` did not account for gradient mode state when deciding whether to cache. FSDP2 is not directly related. ~~This PR adds `GradMode::is_enabled()` check to caching condition. Caching is now disabled in `no_grad()` contexts to prevent storing tensors with incorrect gradient state. Ensures correctness at the cost of using cache.~~ This PR proposes separate caches for gradient-enabled and gradient-disabled modes. Adds tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165068 Approved by: https://github.com/ngimel, https://github.com/janeyx99 trunk/5daef30b26b794d237fbbc399c1d47ec0380200a	2025-10-16 19:32:01 +00:00

1 2 3 4 5 ...

94597 Commits