pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
Shangdi Yu	26f67ef050	Add an option to put store large mmap weights on disk (#164526 ) Summary: As title In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file. 1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob. ```cpp AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize( AOTInductorModelContainerHandle container_handle, uint64_t* ret_size); // Load weights from a single blob in weight_blob_ptr AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob( AOTInductorModelContainerHandle container_handle, const uint8_t* weight_blob_ptr); ``` 2. We also add a method in ModelContainerRunner to load the weight: If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants. 3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights. Test Plan: ``` buck run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk ``` Also tested for windows-cross compilation with `6542566585/demo/main_voxtral.cpp` ``` Loaded model.dll audio_encoder loaded C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so Loaded model.dll token_embedding loaded C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so Loaded model.dll Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob text_decoder loaded Load latency (ms): audio_encoder: 1011.234 archive extraction: 0.000 .so loading: 1011.197 token_embedding: 525.773 archive extraction: 0.000 .so loading: 525.704 text_decoder: 3324.130 archive extraction: 0.000 .so loading: 3323.979 Run latency (ms): audio_encoder: 285.958 audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000 token_embedding: 6.676 token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936 text_decoder: 576.519 text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: D84093310 Pulled By: yushangdi	2025-10-09 15:21:58 -07:00
Nikita Shulga	6d27a8e509	[CD] Do not propagate download.pytorch.org IP into container (#165075 ) Followup after https://github.com/pytorch/pytorch/pull/164969 Should fix binary build test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/165075 Approved by: https://github.com/seemethere, https://github.com/huydhn ghstack dependencies: #164968, #164969	2025-10-09 21:59:31 +00:00
Eddie Yan	cd62a73dcb	[cuDNN][SDPA] Handle noncontig nested tensors in cuDNN SDPA (#164958 ) Previously we hardcoded the assumption in cuDNN that the inputs would be dense which breaks when e.g., the user is chunking tensors yielding noncontig inputs New test added to check this when `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` is set in `test/test_transformers.py` One issue I noticed was that the old gating of nested tensor in `sdp_utils.cpp` seems to be a no-op? All of the inputs are reported as "dense" by the time that function is called in the nested tensor tests in `test/test_nestedtensor.py -k sdpa` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164958 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-10-09 21:58:54 +00:00
PyTorch MergeBot	4d7f9f3aed	Revert "[ATen] Fix CUDA reduction warp shuffle order (#164790 )" This reverts commit 8e1f409b8ccf64b2cf3933ece13587ad57e9d8a9. Reverted https://github.com/pytorch/pytorch/pull/164790 on behalf of https://github.com/jeffdaily due to broke cuda and rocm ci ([comment](https://github.com/pytorch/pytorch/pull/164790#issuecomment-3387558806))	2025-10-09 21:36:10 +00:00
William Wen	2b9ff99535	[flex attention] change "==" to "is" in inspect parameter comparison (#165003 ) Patch for https://github.com/pytorch/pytorch/issues/164760. This doesn't actually fix the underlying torch function issue though. Explanation: `is` is traced differently compared to `__eq__`, so we end up avoiding the issue where we attempt to evaluate `torch.eq(tensor, inspect._empty)` in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165003 Approved by: https://github.com/mlazos	2025-10-09 21:18:05 +00:00
Sam Larsen	98a081a24c	Call internal log_compilation_event if it exists (#164855 ) Summary: For internal conda on mast jobs, call the internal version of log_compilation_event if it exists. Test Plan: Ran a simple test job that just calls the API: https://fburl.com/scuba/dynamo_compile/dqx8d10g Pull Request resolved: https://github.com/pytorch/pytorch/pull/164855 Approved by: https://github.com/c00w	2025-10-09 21:15:11 +00:00
Lakshay Garg	6c0125dbc0	Mark functions const in CUDACachingAllocator (#165007 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165007 Approved by: https://github.com/eqy	2025-10-09 20:53:58 +00:00
Murray Steele	0fd976b65c	Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741 ) This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc. Updated Results Torchbench FP32 eager Inference, 16 threads: <img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" /> Torchbench BF16 eager Inference, 16 threads: <img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741 Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet	2025-10-09 20:49:46 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Nikita Shulga	e7fd296930	[CI] Add full debug build to trunk (#164974 ) But not test, just import torch, as regression test for https://github.com/pytorch/pytorch/issues/164297 Test plan: Re-apply #164974 on top of this change and observer the failure in the workflows: https://github.com/pytorch/pytorch/actions/runs/18383302153/job/52375282838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164974 Approved by: https://github.com/seemethere, https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #164968, #164969	2025-10-09 20:12:16 +00:00
Sam Larsen	fac85fcfb5	[inductor] custom_graph_pass.get_hash_for_files: don't hash paths (#165020 ) Summary: We have an internal user where caching broke because the paths that are unzipped are probably different per host. We can't think of a use case where a path change matters when the file content has not changed, so removing this part Pull Request resolved: https://github.com/pytorch/pytorch/pull/165020 Approved by: https://github.com/oulgen	2025-10-09 20:07:53 +00:00
Natalia Gimelshein	228973df7f	Fix channels-last dimension mapping in CUDA parallel_cat (#165023 ) Fixes #164849 `dimension` was updated in-place, so for more than one batch of channels-last tensors the concat `dimension` for the second kernel launch was wrong ## Testing - python -m compileall test/test_tensor_creation_ops.py ------ https://chatgpt.com/codex/tasks/task_e_68e708879b30832f89b10ae55faa68e8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165023 Approved by: https://github.com/ezyang	2025-10-09 20:04:32 +00:00
PyTorch MergeBot	ed2d514ad8	Revert "Fix truediv numerics between eager and compile (#164144 )" This reverts commit 724463d5a2fba369cd14e89215b84d1b01435df7. Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to Not sure if it's related, but looks it triggered fuzzer compiler test failure, see `a2f29bcd63/1` ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3387288464))	2025-10-09 19:53:38 +00:00
Tianren Gao	a2f29bcd63	[inductor] Remove Repeated Code in Subgraph (#164892 ) Discovered some repeated code blocks in the subgraph.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/164892 Approved by: https://github.com/PaulZhang12	2025-10-09 19:16:02 +00:00
FFFrog	5390324984	[CodeClean] Replace std::runtime_error with TORCH_CHECK (#164129 ) As the title stated. Changes: - torch/csrc/Module.cpp - torch/csrc/utils.cpp - torch/csrc/stable - torch/lib/libshm Pull Request resolved: https://github.com/pytorch/pytorch/pull/164129 Approved by: https://github.com/albanD	2025-10-09 19:01:07 +00:00
Avik Chaudhuri	ae25ec569c	reorder wrappers in aot_stage2_inference to match forward compile in aot_stage2_autograd (#165016 ) In aot_stage2_autograd: Before calling fw_compiler, we run pre_compile for the following wrappers: * FakifiedOutWrapper * FunctionalizedRngRuntimeWrapper After, we run post_compile for the following wrappers: * EffectTokensWrapper * AOTDispatchSubclassWrapper * FunctionalizedRngRuntimeWrapper * FakifiedOutWrapper In aot_stage2_inference: Before calling inference compiler, we run pre_compile for the following wrappers (same as above): * FakifiedOutWrapper * FunctionalizedRngRuntimeWrapper After, we run post_compile for the following wrappers (different than above): * FunctionalizedRngRuntimeWrapper * FakifiedOutWrapper * EffectTokensWrapper * AOTDispatchSubclassWrapper This PR makes both do the post_compiles in the same order. Differential Revision: D84213657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165016 Approved by: https://github.com/zhxchen17, https://github.com/bdhirsh	2025-10-09 18:36:04 +00:00
PaulZhang12	8e1f409b8c	[ATen] Fix CUDA reduction warp shuffle order (#164790 ) Typical warp shuffle reduction has the following pattern: <img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166" /> which is exhibited in Triton generated by torch.compile: <img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8" /> Switch the warp shuffle order to make bitwise equivalence between the 2 easier. PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/ Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order: ``` Tensor Shape Operation New all dims (ms) New dim=0 (ms) New dim=1 (ms) Old all dims (ms) Old dim=0 (ms) Old dim=1 (ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1024, 1024) mean 0.015817 0.016259 0.013642 0.015990 0.016258 0.013631 (1024, 1024) sum 0.015917 0.015906 0.013359 0.015707 0.016266 0.013226 (1024, 1024) min 0.016021 0.024625 0.015631 0.015761 0.024485 0.015317 (1024, 1024) max 0.016349 0.024971 0.015972 0.015771 0.025001 0.015314 (1024, 1024) argmin 0.018070 0.024448 0.015578 0.018135 0.025370 0.015322 (1024, 1024) argmax 0.018427 0.024859 0.015932 0.018164 0.024452 0.015639 (1024, 1024) var 0.020078 0.026413 0.020295 0.020199 0.026381 0.020214 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (2048, 2048) mean 0.023826 0.023726 0.022273 0.023236 0.023776 0.022248 (2048, 2048) sum 0.023840 0.023355 0.021974 0.023294 0.023354 0.021884 (2048, 2048) min 0.024519 0.041263 0.024620 0.023292 0.041491 0.024358 (2048, 2048) max 0.024509 0.041670 0.024277 0.023334 0.041231 0.024395 (2048, 2048) argmin 0.026125 0.041282 0.024567 0.026772 0.041773 0.024296 (2048, 2048) argmax 0.026117 0.041487 0.024572 0.026412 0.041477 0.024273 (2048, 2048) var 0.026603 0.048581 0.031308 0.027587 0.048603 0.030860 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (4096, 4096) mean 0.053927 0.057070 0.054073 0.053028 0.057544 0.053935 (4096, 4096) sum 0.053604 0.057410 0.054451 0.053076 0.057033 0.054266 (4096, 4096) min 0.054293 0.109122 0.058363 0.053821 0.108689 0.058382 (4096, 4096) max 0.054258 0.108035 0.058703 0.053492 0.110552 0.058376 (4096, 4096) argmin 0.056805 0.111167 0.058301 0.056836 0.112325 0.058292 (4096, 4096) argmax 0.056488 0.110958 0.058636 0.056844 0.111000 0.057928 (4096, 4096) var 0.058936 0.141755 0.068693 0.059735 0.141284 0.068500 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (8192, 8192) mean 0.145552 0.148082 0.138647 0.145364 0.147818 0.138207 (8192, 8192) sum 0.145985 0.147900 0.138714 0.145755 0.148031 0.138616 (8192, 8192) min 0.146566 0.205359 0.192739 0.145611 0.205237 0.182335 (8192, 8192) max 0.146526 0.204844 0.193050 0.146073 0.205457 0.182697 (8192, 8192) argmin 0.150190 0.206605 0.192543 0.150654 0.206847 0.182007 (8192, 8192) argmax 0.150481 0.206368 0.192535 0.150845 0.206430 0.182022 (8192, 8192) var 0.150884 0.184546 0.203900 0.151594 0.184172 0.197983 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1, 1024, 128) mean 0.014293 0.008119 0.014533 0.013861 0.008022 0.014449 (1, 1024, 128) sum 0.014039 0.007877 0.014111 0.014219 0.008227 0.014045 (1, 1024, 128) min 0.014159 0.011354 0.023493 0.014271 0.010862 0.023644 (1, 1024, 128) max 0.014154 0.011027 0.023368 0.014259 0.011234 0.023692 (1, 1024, 128) argmin 0.016403 0.005677 0.023328 0.016273 0.005683 0.024073 (1, 1024, 128) argmax 0.016734 0.005675 0.023437 0.016580 0.005318 0.023331 (1, 1024, 128) var 0.018338 0.009549 0.025538 0.018528 0.009391 0.024777 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (5, 1024, 128) mean 0.014873 0.010131 0.015546 0.015123 0.010131 0.015481 (5, 1024, 128) sum 0.015334 0.009673 0.015824 0.014736 0.009671 0.015438 (5, 1024, 128) min 0.015047 0.013252 0.024573 0.014803 0.013163 0.024551 (5, 1024, 128) max 0.015050 0.013339 0.024197 0.014810 0.013525 0.024230 (5, 1024, 128) argmin 0.017341 0.012737 0.024306 0.017471 0.012379 0.024991 (5, 1024, 128) argmax 0.017345 0.012411 0.024421 0.017422 0.012471 0.024237 (5, 1024, 128) var 0.019973 0.011453 0.026188 0.020050 0.011438 0.026282 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10, 1024, 128) mean 0.016976 0.011575 0.016831 0.016722 0.011927 0.017173 (10, 1024, 128) sum 0.017039 0.011841 0.017159 0.016385 0.011860 0.016753 (10, 1024, 128) min 0.017036 0.015331 0.026770 0.016944 0.015205 0.027166 (10, 1024, 128) max 0.017369 0.015348 0.027077 0.016531 0.015716 0.026819 (10, 1024, 128) argmin 0.019203 0.014447 0.026813 0.018994 0.014497 0.027313 (10, 1024, 128) argmax 0.019563 0.014795 0.027140 0.019460 0.014912 0.026733 (10, 1024, 128) var 0.020529 0.014316 0.030405 0.020719 0.013960 0.029964 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100, 1024, 128) mean 0.045046 0.039168 0.046082 0.044839 0.039217 0.045782 (100, 1024, 128) sum 0.045094 0.039150 0.045777 0.044496 0.039542 0.046083 (100, 1024, 128) min 0.045768 0.054466 0.076244 0.044915 0.053943 0.076599 (100, 1024, 128) max 0.045748 0.054459 0.076188 0.044931 0.053949 0.076856 (100, 1024, 128) argmin 0.048275 0.054046 0.076647 0.048694 0.054105 0.077004 (100, 1024, 128) argmax 0.048267 0.054395 0.077401 0.048691 0.054131 0.076751 (100, 1024, 128) var 0.049710 0.043254 0.083077 0.050971 0.043251 0.082378 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1000, 1000, 100) mean 0.202312 0.196723 0.197765 0.201774 0.196641 0.197459 (1000, 1000, 100) sum 0.202651 0.196682 0.197736 0.202175 0.196313 0.197523 (1000, 1000, 100) min 0.203022 0.264762 0.269200 0.202729 0.264129 0.268694 (1000, 1000, 100) max 0.202864 0.264396 0.269388 0.202486 0.263896 0.268720 (1000, 1000, 100) argmin 0.226727 0.263781 0.268651 0.226597 0.264676 0.268983 (1000, 1000, 100) argmax 0.226412 0.264469 0.269090 0.226570 0.264595 0.269178 (1000, 1000, 100) var 0.243223 0.204079 0.216096 0.241942 0.204079 0.215925 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10000, 100) mean 0.016193 0.020277 0.014316 0.016152 0.020324 0.013712 (10000, 100) sum 0.016289 0.020237 0.014034 0.016168 0.020265 0.013708 (10000, 100) min 0.016046 0.030872 0.019609 0.016208 0.030867 0.018627 (10000, 100) max 0.016369 0.030835 0.019257 0.016218 0.030861 0.018209 (10000, 100) argmin 0.017957 0.031171 0.019517 0.018050 0.031556 0.018077 (10000, 100) argmax 0.017961 0.031658 0.019521 0.018060 0.031564 0.018087 (10000, 100) var 0.020393 0.035652 0.019339 0.020144 0.035987 0.019171 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100000, 10) mean 0.015718 0.016576 0.016555 0.015999 0.016246 0.014869 (100000, 10) sum 0.015833 0.016247 0.016572 0.016007 0.016627 0.014872 (100000, 10) min 0.015888 0.020510 0.023920 0.015671 0.020821 0.021417 (100000, 10) max 0.015889 0.020479 0.023918 0.016077 0.020386 0.021421 (100000, 10) argmin 0.018233 0.020863 0.023647 0.017574 0.020864 0.021103 (100000, 10) argmax 0.017896 0.020527 0.023296 0.017569 0.020447 0.021098 (100000, 10) var 0.020005 0.024198 0.024372 0.020075 0.024167 0.022415 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 1023) mean 1.874816 1.963506 1.903909 1.873279 1.963859 1.903230 (1023, 1023, 1023) sum 1.875030 1.965716 1.902458 1.873566 1.960730 1.901642 (1023, 1023, 1023) min 1.878563 2.473455 2.179092 1.875174 2.482086 2.183027 (1023, 1023, 1023) max 1.879128 2.474803 2.178895 1.874831 2.482253 2.183884 (1023, 1023, 1023) argmin 1.921800 2.476629 2.174831 1.923987 2.472641 2.170453 (1023, 1023, 1023) argmax 1.922605 2.476688 2.177927 1.923366 2.472808 2.172979 (1023, 1023, 1023) var 1.972606 3.088695 2.758797 1.978679 3.095658 2.762243 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 255) mean 0.489984 0.500954 0.492957 0.489891 0.500654 0.491971 (1023, 1023, 255) sum 0.490228 0.500764 0.492289 0.489624 0.501089 0.492824 (1023, 1023, 255) min 0.491457 0.563560 0.553334 0.490355 0.564709 0.554754 (1023, 1023, 255) max 0.491396 0.563628 0.553345 0.490017 0.565004 0.554947 (1023, 1023, 255) argmin 0.503666 0.561512 0.551831 0.503845 0.560972 0.551017 (1023, 1023, 255) argmax 0.503602 0.561185 0.551407 0.504328 0.561267 0.551448 (1023, 1023, 255) var 0.510844 0.709452 0.701630 0.512693 0.710365 0.701965 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 377) mean 0.707439 0.727646 0.712019 0.706769 0.727101 0.711632 (1023, 1023, 377) sum 0.707780 0.727453 0.711554 0.706807 0.726656 0.711729 (1023, 1023, 377) min 0.709423 0.819809 0.794379 0.707847 0.822086 0.796664 (1023, 1023, 377) max 0.709297 0.819780 0.794308 0.707566 0.821913 0.796690 (1023, 1023, 377) argmin 0.725028 0.817088 0.791695 0.726039 0.816445 0.790828 (1023, 1023, 377) argmax 0.725301 0.817011 0.791420 0.726040 0.816917 0.791143 (1023, 1023, 377) var 0.740859 1.034165 1.006712 0.743413 1.035506 1.007638 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-09 18:08:30 +00:00
Jithun Nair	ee6a1ecb0a	[ROCm] Enable MI355 CI on PRs, and run full set of UTs on PRs (#160215 ) Useful to have PR testing for PRs such as https://github.com/pytorch/pytorch/pull/151360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160215 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-09 18:03:12 +00:00
Lakshay Garg	3c0577bd15	Remove shared_ptr from MHAGraphCache (#164895 ) This commit makes several cleanup changes to MHA.cpp, the main one of which is removal of shared_ptr from MHAGraphCache as the cache does not actually intend to share ownership. The changes are: 1. Remove shared_ptr from MHAGraphCache 2. Remove template arguments from MHAGraphCache 3. Remove unnecessary optional<shared_ptr<...>> vars 4. Change some functions with auto return type to the actual type Pull Request resolved: https://github.com/pytorch/pytorch/pull/164895 Approved by: https://github.com/eqy	2025-10-09 17:44:28 +00:00
PyTorch MergeBot	688efd9741	Revert "Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741 )" This reverts commit 87eccf10e8484c9e59ef81ae7bdee68d3db4f605. Reverted https://github.com/pytorch/pytorch/pull/164741 on behalf of https://github.com/malfet due to But it breaks MacOS builds, see https://github.com/pytorch/pytorch/actions/runs/18382886648/job/52373781138 ([comment](https://github.com/pytorch/pytorch/pull/164741#issuecomment-3386859778))	2025-10-09 17:30:25 +00:00
PyTorch MergeBot	91040f4934	Revert "[Code Clean] Remove support of python3.9 (#163846 )" This reverts commit bc1690c7e859dee8c47a7f0bbd3c43cc27c6fd2a. Reverted https://github.com/pytorch/pytorch/pull/163846 on behalf of https://github.com/izaitsevfb due to breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/163846#issuecomment-3386855437))	2025-10-09 17:27:08 +00:00
Murray Steele	87eccf10e8	Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741 ) This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc. Updated Results Torchbench FP32 eager Inference, 16 threads: <img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" /> Torchbench BF16 eager Inference, 16 threads: <img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741 Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet	2025-10-09 16:45:31 +00:00
Ryo Suzuki	5d459dd609	avoid bit cast for bfloat16_t (#159946 ) using bit_cast<bfloat16_t> triggers a static_assert, so replace it with intrinsics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159946 Approved by: https://github.com/aditew01, https://github.com/malfet	2025-10-09 16:42:49 +00:00
albanD	24d69c57cb	Add view support for library custom Function (#164520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164520 Approved by: https://github.com/soulitzer, https://github.com/ezyang	2025-10-09 16:17:48 +00:00
Catherine Lee	eaa02655ea	[CI] Run cpp tests on windows in one run_tests call (#164861 ) The windows cpp tests take ~1 hour according to logs. Each has run_test called on them individually, so I tried batching them together so it's just one run_test call for all of them. I believe it now takes 30min. I turned off TD since I don't think cpp tests are included in TD stuff. As always with batch, I'm not sure if the errorlevel/error surfacing stuff is correct This code is written with a lot of help from chatgpu and copilot Pull Request resolved: https://github.com/pytorch/pytorch/pull/164861 Approved by: https://github.com/huydhn	2025-10-09 16:07:28 +00:00
Manuel Candales	aea57b3aa3	AOTI MPS Shim Implementation (#163865 ) ## MPS Shim API * Updated MPS shimification API with handles and function declarations: * `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types * Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function` * Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc ## MPS Shader Codegen * Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation: * Before: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");` * After: `const char* mps_lib_0_source = R"MTL(...)MTL";` * Updated kernel call generation to use shimified functions: * Generates calls to shimified API instead of direct libtorch calls ## Before vs After Comparison ### Section 1: Shader Library Before (Direct Library Object) ```cpp at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL( ... )MTL"); ``` After (Source String) ```cpp const char* mps_lib_0_source = (R"MTL( ... )MTL"); ``` ### Section 2: Getter Functions & RAII Management Before (Direct Library Access) ```cpp const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ``` After (Shim API + RAII Wrapper) ```cpp AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static auto kernel_handle = []() { AOTIMetalShaderLibraryHandle lib_handle = nullptr; AOTIMetalKernelFunctionHandle kern_handle = nullptr; aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle); aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle); // RAII wrapper with custom deleter auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{ if (h) aoti_torch_mps_delete_shader_library(h); }}; using LibDeleter = decltype(lib_deleter); using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>; // Return pair of kernel handle and library smart pointer for cleanup return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter)); }(); return kernel_handle.first; } ``` ### Section 3: Runtime Execution Before (Direct Library Methods) ```cpp void AOTInductorModel::run_impl(...) { ... get_mps_lib_0()->runCommandBlock([&] { get_mps_lib_0()->startEncoding(); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1); get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)}); }); ... } // AOTInductorModel::run_impl ``` After (Shim API with Lambda Pattern) ```cpp void AOTInductorModel::run_impl(...) { ... auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) { aoti_torch_mps_start_encoding(handle); aoti_torch_mps_set_arg_tensor(handle, 0, buf0); aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1); aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1); aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL)); }; std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0; aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0); ... } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-10-09 16:06:36 +00:00
PyTorch MergeBot	3d1fa40ae1	Revert "[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641 )" This reverts commit 64108bdbed2f099d527060b4c9fdd5a11cad2afc. Reverted https://github.com/pytorch/pytorch/pull/164641 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164641#issuecomment-3386346474))	2025-10-09 15:42:51 +00:00
Markus Hoehnerbach	a7fa1a91e3	fix flex attention eager bwd: more rounding (#164317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317 Approved by: https://github.com/drisspg ghstack dependencies: #163986	2025-10-09 15:40:49 +00:00
Tugsbayasgalan Manlaibaatar	afeec56a5a	Fix replacement reconstruct (#164937 ) If we return Dtensor, the object is created via fx graph call so we never needed to reconstruct them. But if there is side effect, we do need to reconstruct it. Differential Revision: [D84159000](https://our.internmc.facebook.com/intern/diff/D84159000) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164937 Approved by: https://github.com/StrongerXi	2025-10-09 15:31:23 +00:00
PaulZhang12	724463d5a2	Fix truediv numerics between eager and compile (#164144 ) Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel ghstack dependencies: #164997	2025-10-09 14:31:33 +00:00
PyTorch MergeBot	f79e212733	Revert "[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955 )" This reverts commit ab94a0d544503b5c27e889b45e45ef8cf75c8183. Reverted https://github.com/pytorch/pytorch/pull/163955 on behalf of https://github.com/jeffdaily due to broke on cuda and rocm after landing though this PR had a clean signal initially ([comment](https://github.com/pytorch/pytorch/pull/163955#issuecomment-3386127145))	2025-10-09 14:24:56 +00:00
Thanh Ha	b28b24a9fc	Switch build jobs that use linux.12xlarge to c7i (#164941 ) This PR updates build jobs that currently use linux.12xlarge to the c7i varient which should increase build times by 15% - 20% depending on the job and reduce costs of these jobs by 10% - 15%. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>	2025-10-09 09:58:52 -04:00
Laith Sakka	17c7170ca6	Fix Avoid DDE in item numel check (#164934 ) address https://github.com/pytorch/pytorch/issues/164725 and https://github.com/pytorch/pytorch/issues/164704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164934 Approved by: https://github.com/ezyang, https://github.com/aorenste, https://github.com/Skylion007	2025-10-09 13:09:06 +00:00
Simon Layton	6a7f5c0d21	Add scaled_mm python API, test (#164142 ) Summary: * Add `torch.nn.functional.scaled_mm` as an abstraction around the C++ methods * Wraps `torch._scaled_mm_v2` API by default, but user can force use of the older `torch._scaled_mm` interface. * Scaled MM tests now run on the new API Test Plan: `pytest test/test_scaled_matmul_cuda.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164142 Approved by: https://github.com/drisspg ghstack dependencies: #164141	2025-10-09 12:43:18 +00:00
Simon Layton	512b6b59f0	Add _scaled_mm_v2 API (#164141 ) Summary: * Add new scaled-MM API to future-proof / clean-up existing code. * Scaling is explicitly described rather than infer * Swizzling of scaled must now be defined (vs. inferred) * Adds API support for multi-level scaling * Refactor dispatch logic to make it easier to add new implementations Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164141 Approved by: https://github.com/drisspg	2025-10-09 12:43:18 +00:00
FFFrog	bc1690c7e8	[Code Clean] Remove support of python3.9 (#163846 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846 Approved by: https://github.com/ezyang	2025-10-09 11:54:10 +00:00
Cui, Yifeng	53f5af8c92	Update torch-xpu-ops commit pin (#164237 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@f30173](`f301733b03`), includes: - Install xpu internal headers to PyTorch - Fix error handling for BatchLinearAlgebra Ops - Fix unnecessary double data type conversion - Fix overflow when calculating workgroups count - Fix segmentation fault and calculation error in AveragePool2dKernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/164237 Approved by: https://github.com/EikanWang	2025-10-09 10:38:59 +00:00
PyTorch MergeBot	4412026949	Revert "AOTI MPS Shim Implementation (#163865 )" This reverts commit 874efa2d72d83b00894097130f18062ce331a265. Reverted https://github.com/pytorch/pytorch/pull/163865 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/163865#issuecomment-3385196387))	2025-10-09 10:26:01 +00:00
PyTorch MergeBot	06d86e58d0	Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939 )" This reverts commit d40a9bfb8da0dc1ac1e6e56b33a25979112874de. Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3385056722))	2025-10-09 09:50:59 +00:00
Manuel Candales	874efa2d72	AOTI MPS Shim Implementation (#163865 ) ## MPS Shim API * Updated MPS shimification API with handles and function declarations: * `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types * Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function` * Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc ## MPS Shader Codegen * Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation: * Before: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");` * After: `const char* mps_lib_0_source = R"MTL(...)MTL";` * Updated kernel call generation to use shimified functions: * Generates calls to shimified API instead of direct libtorch calls ## Before vs After Comparison ### Section 1: Shader Library Before (Direct Library Object) ```cpp at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL( ... )MTL"); ``` After (Source String) ```cpp const char* mps_lib_0_source = (R"MTL( ... )MTL"); ``` ### Section 2: Getter Functions & RAII Management Before (Direct Library Access) ```cpp const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ``` After (Shim API + RAII Wrapper) ```cpp AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static auto kernel_handle = []() { AOTIMetalShaderLibraryHandle lib_handle = nullptr; AOTIMetalKernelFunctionHandle kern_handle = nullptr; aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle); aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle); // RAII wrapper with custom deleter auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{ if (h) aoti_torch_mps_delete_shader_library(h); }}; using LibDeleter = decltype(lib_deleter); using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>; // Return pair of kernel handle and library smart pointer for cleanup return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter)); }(); return kernel_handle.first; } ``` ### Section 3: Runtime Execution Before (Direct Library Methods) ```cpp void AOTInductorModel::run_impl(...) { ... get_mps_lib_0()->runCommandBlock([&] { get_mps_lib_0()->startEncoding(); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1); get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)}); }); ... } // AOTInductorModel::run_impl ``` After (Shim API with Lambda Pattern) ```cpp void AOTInductorModel::run_impl(...) { ... auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) { aoti_torch_mps_start_encoding(handle); aoti_torch_mps_set_arg_tensor(handle, 0, buf0); aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1); aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1); aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL)); }; std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0; aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0); ... } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-10-09 09:28:10 +00:00
PyTorch MergeBot	e09fb44ef1	Revert "Fix truediv numerics between eager and compile (#164144 )" This reverts commit d386325ca9a142419f45b987391f4bb175dd7d0b. Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3384769092))	2025-10-09 08:40:52 +00:00
PyTorch MergeBot	5b8174bc28	Revert "[vllm hash update] update the pinned vllm hash (#164628 )" This reverts commit 7b691546d2949790ffc8f6bd3c674faa6a46ff7c. Reverted https://github.com/pytorch/pytorch/pull/164628 on behalf of https://github.com/huydhn due to There are some broken vLLM tests ([comment](https://github.com/pytorch/pytorch/pull/164628#issuecomment-3384560957))	2025-10-09 07:43:02 +00:00
PyTorch MergeBot	5209c8ce07	Revert "Fix Avoid DDE in item numel check (#164934 )" This reverts commit a9a9a3438a374f96a308b707a1718036aaec790d. Reverted https://github.com/pytorch/pytorch/pull/164934 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164934#issuecomment-3384390621))	2025-10-09 06:57:03 +00:00
Yuanyuan Chen	f231be25c6	Mark unused parameters in C++ code (#164912 ) This PR adds unused parameter name comments in C++ declarations to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912 Approved by: https://github.com/Skylion007	2025-10-09 06:23:25 +00:00
PyTorch MergeBot	a753ffa9af	Revert "Use runner with more memory for ASAN builds (#165000 )" This reverts commit f5fd18f7e24378bd9eb91404f697f1c81a8187d5. Reverted https://github.com/pytorch/pytorch/pull/165000 on behalf of https://github.com/izaitsevfb due to not sure how, but this broke lint ([comment](https://github.com/pytorch/pytorch/pull/165000#issuecomment-3384286412))	2025-10-09 06:22:28 +00:00
Laith Sakka	a9a9a3438a	Fix Avoid DDE in item numel check (#164934 ) address https://github.com/pytorch/pytorch/issues/164725 and https://github.com/pytorch/pytorch/issues/164704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164934 Approved by: https://github.com/ezyang, https://github.com/aorenste, https://github.com/Skylion007	2025-10-09 06:06:25 +00:00
Seonmyeong Bak	263db92563	Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions (#164591 ) This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog - New knobs to control FR dump as suggested in the public documentation even for watchdog (TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE) - Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error (TORCH_NCCL_EXTRA_DUMP_ON_EXEC) -> Can be used as a snapshot of the workload progress for post-mortem analysis Pull Request resolved: https://github.com/pytorch/pytorch/pull/164591 Approved by: https://github.com/fduwjj	2025-10-09 05:33:35 +00:00
Nicolas Macchioni	ed6156e3ea	non-fb impls + unit tests (#164722 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Differential Revision: D83714692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164722 Approved by: https://github.com/NikhilAPatel, https://github.com/adamomainz	2025-10-09 05:10:57 +00:00
Edward Z. Yang	d40a9bfb8d	Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939 ) This fixes AOTAutograd rms_norm not being bitwise equivalent to eager, because it avoids a decomposition. You can force the decomposition by having the decomposition in the dispatch table, but if eager mode wouldn't have decomposed (because it went to the fused one), we now default to preserving the fused call by default. This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel. Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939 Approved by: https://github.com/bdhirsh ghstack dependencies: #164573	2025-10-09 04:49:44 +00:00
Sherlock Huang	e532f62e0d	Introduce joint_custom_pass callback (#164981 ) ``` def joint_custom_pass(joint_gm: torch.fx.GraphModule, joint_inputs): # apply your pass for joint graph here return joint_gm class M(torch.nn.Module): def forward(self, x): return x.sin() x = torch.randn(10, requires_grad=False) compiled_fn = torch.compile(M(), backend="aot_eager") with torch._functorch.config.patch("joint_custom_pass", joint_custom_pass): out = compiled_fn(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164981 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2025-10-09 04:40:54 +00:00
Pian Pawakapan	1f73b96668	[PGO] log missing sources in allowlist (#164881 ) Summary: - logs missing dynamic sources - emits MLHub insight only on size mismatch recompiles Test Plan: test_pgo Differential Revision: D84098898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164881 Approved by: https://github.com/bobrenjc93	2025-10-09 04:39:09 +00:00
PyTorch UpdateBot	7b691546d2	[vllm hash update] update the pinned vllm hash (#164628 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164628 Approved by: https://github.com/pytorchbot	2025-10-09 04:35:36 +00:00
PaulZhang12	f05e23e1bc	Add less warps config to inner reductions (#162447 ) Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread <img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314	2025-10-09 04:22:16 +00:00
PaulZhang12	d386325ca9	Fix truediv numerics between eager and compile (#164144 ) Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel ghstack dependencies: #164997	2025-10-09 04:22:03 +00:00
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
Nikita Vedeneev	ab94a0d544	[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955 ) As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-09 04:07:45 +00:00
Animesh Jain	0e9b3a772a	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #164721	2025-10-09 03:25:15 +00:00
Animesh Jain	af7ca55ced	[export][dynamo] Fallback to slowpath for MultiHeadAttention for strict export (#164721 ) In https://github.com/pytorch/pytorch/pull/106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export. Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called. This divergence is undesirable. There are 2 ways to fix it 1) Make export take the fast path - As explained in the https://github.com/pytorch/pytorch/pull/106824 , this might be difficult. So, we go to (2) 2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc. Since (2) is easy to do, we will follow this path. We are tracking the issue in https://github.com/pytorch/pytorch/issues/164062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164721 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2025-10-09 03:25:15 +00:00
Yuanyuan Chen	a029675f6f	More ruff SIM fixes (#164695 ) This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695 Approved by: https://github.com/ezyang	2025-10-09 03:24:50 +00:00
PaulZhang12	54ae61c573	Change test_emulate_precision_casts_mean_ratio_chain from gelu to relu (#164997 ) gelu can be instable on local builds due to libdevice differences, as we lower to libdevice.erf. That combined with the semantics in the test can lead to catastrophic cancellation. We switch this test from gelu to relu to fix this instability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164997 Approved by: https://github.com/eellison, https://github.com/jansel	2025-10-09 03:14:05 +00:00
Jeddie Ji	2fe37b5fde	[RecSys][Combo Kernel] skip combo kernel generation if parition group is empty (#164918 ) Summary: Noticed sometimes the combo kernel partition will contain empty group. Skip kernel generation in this case to unblock head model launching. The change in this diff is safe, but it's better to root cause why empty group is being created. Test Plan: Lowering passed after applying the diff Differential Revision: D84134471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164918 Approved by: https://github.com/mlazos	2025-10-09 02:55:23 +00:00
ruisizhang123	96d91da792	[dynamo] allow placement subclass to be traceble (#164985 ) This pr is to unblock SimpleFSDP+`gradient_divide_factor` [here](https://github.com/pytorch/torchtitan/pull/1793). We will need to create a subclass for DTensor `Partial` placement. When tracing `SimpleFSDPPartial`, I hit the assertion error that `SimpleFSDPPartial` is not in `ok_types`. I'm updating the code to check placement dtype via `isinstance` instead of `type(val)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164985 Approved by: https://github.com/ezyang, https://github.com/eellison	2025-10-09 01:44:21 +00:00
Ivan Zaitsev	f5fd18f7e2	Use runner with more memory for ASAN builds (#165000 ) An attempt to [address OOM here](`aed5ed1076/1`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/165000 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/huydhn	2025-10-09 01:09:28 +00:00
fduwjj	8ca986ee60	[fr] Enable reset the FR recording for fault tolerance (#164988 ) We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988 Approved by: https://github.com/tushar00jain ghstack dependencies: #164752	2025-10-09 01:03:01 +00:00
atalman	81dbeb06f4	CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013 ) Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint. This was missed by https://github.com/pytorch/pytorch/pull/162364 Proper constraint on triton should be: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" ``` not: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013 Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv	2025-10-09 00:49:28 +00:00
fduwjj	7a1ead755f	[DeviceMesh] Add a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993 ) As title, we want to add a deprecate warning for slicing flattened dim from root mesh. Also cosmetic changes for adding types for `_get_slice_mesh_layout`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164993 Approved by: https://github.com/fegin ghstack dependencies: #164750, #164954	2025-10-09 00:47:08 +00:00
Boyuan Feng	90b4e130d6	[Benchmark] cleanup torchbench models (#164816 ) Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes torchbench models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which removes timm and huggingface models from torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164816 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet	2025-10-09 00:31:25 +00:00
Animesh Jain	4308b8a28f	[dynamo] Support torch.fx.traceback.annotate (#164678 ) Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation. The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node. What does not work? * We still have to set the context manager `torch.fx.traceback.preserve_node_meta()` in the user code because CI was unhappy. This can be fixed but with some perseverance. * This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678 Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan	2025-10-08 22:41:00 +00:00
Nikita Shulga	94b1ec8c7c	[BE] Use torch check the way its intended (#164987 ) Replace `if (!foo) TORCH_CHECK(false, "bar");` with `TORCH_CHECK(foo, "bar");` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164987 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-10-08 22:28:08 +00:00
eellison	054268c9eb	Consider collective inputs to be deallocated only when wait is completed (#164945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164945 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944	2025-10-08 22:19:25 +00:00
eellison	af40828bbb	Limit coll bucketing within node idxs (#164944 ) Respect max_coll_distance from overlap scheduler in bucketing, also, add an optimization in path searching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164944 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783	2025-10-08 22:18:53 +00:00
bobrenjc93	5a1fbf45ad	[ez] remove unnecessary wrapper (#164720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164720 Approved by: https://github.com/ydwu4	2025-10-08 22:12:29 +00:00
eellison	aed5ed1076	Refactor memory estimator to use node storages, add test (#164783 ) - Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164783 Approved by: https://github.com/ruisizhang123 ghstack dependencies: #164738	2025-10-08 22:07:43 +00:00
William Wen	af4c29fea8	[dynamo, nested graph breaks] fix nested step graph break related issues (#162737 ) Turns out codegen'ing a nested step graph break is significantly more complicated than first thought. The optimized function should actually do: - call graph/load values/do side effects etc. - call into the leaf's resume function, but skipped (this essentially step graph break function for just the leaf function) - call into all the other resume functions, traced. This PR also adds `torch._dynamo.step_unsupported()`, which can be used for internal testing purposes to better test step graph break handling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162737 Approved by: https://github.com/Lucaskabela ghstack dependencies: #160601	2025-10-08 22:02:52 +00:00
William Wen	486b4d2414	[dynamo, nested graph breaks] move cell codegen before side effects codegen (#160601 ) This is needed because if we codegen cells for nested frames AFTER side effects, then reconstruction could get messed up. From below: >The added test case demonstrates the reconstruction failure if we kept cell codegen at the original place (only happens with nested graph breaks since we reconstruct nested frame cells from VariableTracker rather than directly using LOAD_CLOSURE). >At a high level, what happened before this change was that side_effects was pruning the cells (I don't recall exactly why this happens), and because cells were codegen'd after the side effects were applied, we were unable to properly reconstruct the cell. The error I was seeing was a list/tuple IndexError. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160601 Approved by: https://github.com/mlazos	2025-10-08 22:02:52 +00:00
Hari Krishna Sai Kodali	8f83b3e71c	add device generalization support for distributed checkpoint tests (#159242 ) ## MOTIVATION To generalize Distributed checkpoint test cases for non-CUDA devices ## CHANGES 18 test files with minimal device abstraction changes updated in test/distributed/checkpoint/ - Use device_type from DTensorTestBase wherever appropriate - Replaced hard coded device names with torch.accelerator.current_accelerator() - extend multi gpu decrator for other devices test/distributed/checkpoint/test_state_dict_stager.py has large diff, that's because i changed the name cuda_obj to gpu_obj. Functional change is minimum. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159242 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-10-08 21:56:31 +00:00
Howard Huang	f0c9f3bddb	[PP] [BE] Remove runtime tests (#164962 ) BE cleaning up dead code since we migrated the Multi-stage schedules to use schedule execution runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/164962 Approved by: https://github.com/Skylion007 ghstack dependencies: #162016	2025-10-08 21:42:33 +00:00
Isalia20	1d182dd81c	[MPS] sparse norm (#164961 ) Norms for sparse mps tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/164961 Approved by: https://github.com/malfet	2025-10-08 21:41:42 +00:00
fduwjj	0b15f7ae05	[fr] Enable dynamic path write for FR dump when it comes to torchft (#164752 ) When it comes to FR dump, in the case of fault tolerance, users want to set the dump path to a different one when there is restart, so we just enable this case for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164752 Approved by: https://github.com/tushar00jain	2025-10-08 21:36:32 +00:00
Nikita Shulga	f1229b6db9	[BE] Remove manual IP address resolution (#164969 ) As https://github.com/pytorch/pytorch/issues/100400 has been closed a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/164969 Approved by: https://github.com/seemethere ghstack dependencies: #164968	2025-10-08 21:22:34 +00:00
Anshul Sinha	b1ac252f55	[Replicate][Test] tests that pp model grads are the same as single-device model grads (#164890 ) Summary: Created a test so that we can verify that a model that has been pipelined + replicated has the same gradients as a reference model. To do this, I mapped the layers and their parameters in each partial model to the original full model and then compared the gradients. Test Case 1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp_grads Pull Request resolved: https://github.com/pytorch/pytorch/pull/164890 Approved by: https://github.com/H-Huang	2025-10-08 21:07:05 +00:00
fduwjj	5ba11df4f8	[DeviceMesh] Make all members of DeviceMesh private and add public access API (#164954 ) This is mostly mechanical change which make device mesh members all private and use a public property API instead. This is not a BC breaking change since the new API still guarantee BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164954 Approved by: https://github.com/fegin ghstack dependencies: #164750	2025-10-08 21:04:07 +00:00
Nikita Shulga	15800888b6	[CI] Print GPU info during setup linux (#164968 ) I.e. run `nvidia-smi` if present Helps detecting what driver version this runner is on, which would have helped debugging some of the issues recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/164968 Approved by: https://github.com/ngimel	2025-10-08 20:58:33 +00:00
Catherine Lee	e7ed1a00eb	Run inductor-perf-test-nightly-h100 once per day (#164967 ) To reduce inductor costs, though I'm not sure how much this one matters specifically since h100s are reserved Pull Request resolved: https://github.com/pytorch/pytorch/pull/164967 Approved by: https://github.com/BoyuanFeng	2025-10-08 20:58:19 +00:00
Shunting Zhang	2982406721	[inductor] ban benchmarking by default in deterministic mode (#164532 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164532 Approved by: https://github.com/eellison ghstack dependencies: #164801	2025-10-08 20:55:15 +00:00
Howard Huang	005c3d449e	Support custom callback functions in schedule (#162016 ) This is going to be used in https://github.com/pytorch/torchtitan/issues/1682 Add a `register_custom_function` to the `_PipelineScheduleRuntime` which allows users to implement any custom function to replace the runtime operation dynamically. The signature of the callback should look like: ```python class _CustomFunctionProtocol(Protocol): def __call__(self, action: _Action, ctx: _PipelineContext) -> None: ... ``` `_PipelineContext` contains a reference to the schedule which is executing the operations. ### Testing Added a test which adds custom methods for `FORWARD` and `OVERLAP_F_B` which are just the same implementations as those used in the default schedule runtime. Check that the schedule can still run, numerics are correct, and the callbacks are executed the correct number of times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162016 Approved by: https://github.com/fegin	2025-10-08 20:43:26 +00:00
fduwjj	b2b3947565	[DeviceMesh] Remove private _set_mesh_dim_group_options API (#164750 ) We allow passing in PG option via https://github.com/pytorch/pytorch/pull/159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on. Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164750 Approved by: https://github.com/lw, https://github.com/fegin	2025-10-08 20:38:17 +00:00
Shunting Zhang	81994b08a0	[inductor] don't tune xblock for reduction (#164801 ) It turns out that tuning XBLOCK for a reduction can also change numerics ( https://github.com/pytorch/pytorch/pull/164525#pullrequestreview-3306235454 ). The PR skip tuning XBLOCK for a reduction. If we have multiple configs left with different XBLOCKs, the heuristic will pick the configs with second-largest XBLOCK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164801 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/v0i0	2025-10-08 20:31:39 +00:00
soulitzer	71aefd5595	[reland] Allow setting grad_dtype on leaf tensors (#164751 ) ghstack-source-id: e44b3941530be83a630ec93f1478eec741ffca2e Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/162815 Fixes #ISSUE_NUMBER Relanding due to internal weirdness. Separate PR to codev w/o ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164751 Approved by: https://github.com/albanD	2025-10-08 20:23:13 +00:00
eellison	001e1d2637	Add memory estimator (#164738 ) Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738 Approved by: https://github.com/IvanKobzarev	2025-10-08 20:04:33 +00:00
Aleksandar Samardžić	e0cb1848d0	Use TMA loads always for Triton grouped MM kernel (#164256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256 Approved by: https://github.com/ngimel	2025-10-08 19:40:06 +00:00
Lakshay Garg	a4110fedcf	Use insert_or_assign instead of erase+emplace (#164868 ) insert_or_assign does effectively the same thing as erase+emplace but more efficiently since the search does not need to be repeated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164868 Approved by: https://github.com/eqy	2025-10-08 19:13:49 +00:00
Natalia Gimelshein	37c6087334	Add split-K control to cuBLAS reduced-precision settings (#164766 ) ## Summary - add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options - extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls - update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations ## Testing - python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v (fails: ModuleNotFoundError: No module named 'psutil') ------ https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766 Approved by: https://github.com/malfet, https://github.com/albanD	2025-10-08 18:48:45 +00:00
Laith Sakka	0b85236477	Fix refine_ranges corner case (#164075 ) (#164846 ) Summary: address https://github.com/pytorch/pytorch/issues/161360 u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that. Test Plan: contbuild & OSS CI, see `27234792ad` D84038721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164846 Approved by: https://github.com/izaitsevfb, https://github.com/ezyang	2025-10-08 18:42:37 +00:00
Janani Sriram	4c0fec3e4d	[Max Autotune][B200] Skip carveout tests (#164435 ) Summary: Skip sm `carveout` tests on B200, as carveout is currently unsupported. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_honor_sm_carveout_with_triton_tma ``` Differential Revision: D83395610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164435 Approved by: https://github.com/eellison	2025-10-08 18:39:43 +00:00
cyy	fdc622b513	[CMake] Remove LLVM link code (#134940 ) This handling is not needed no recent LLVM APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134940 Approved by: https://github.com/ezyang, https://github.com/malfet	2025-10-08 18:39:16 +00:00
bobrenjc93	91b9484264	[ez] fix small doc error (#164915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164915 Approved by: https://github.com/svekars	2025-10-08 18:27:44 +00:00
Ke Wen	5c827a4133	[SymmMem] Multi-root tile reduction (#164757 ) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243	2025-10-08 17:28:00 +00:00
Boyuan Feng	83458197d1	[Benchmark] remove old timm models from benchmark (#164805 ) Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for timm models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 60 to 14 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164805 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet	2025-10-08 17:14:58 +00:00
Gheorghe-Teodor Bercea	0b01ff4de0	[ROCm] Improve non stride-one backwards indexing for small index sets (#164409 ) This patch fixes a performance problem which occurs when a small set of indices is used and there are practically no duplicates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164409 Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily	2025-10-08 17:04:52 +00:00
Nikita Shulga	01f3a43462	[MPS] Update OS version in error message (#164946 ) Followup after https://github.com/pytorch/pytorch/pull/159912 Fixes https://github.com/pytorch/pytorch/issues/164943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946 Approved by: https://github.com/Camyll	2025-10-08 16:43:50 +00:00
Sean McGovern	f332017294	C++ API handle optimizer defaults (#161825 ) Fixes #141884 This fixes the issue for all optimizers and parameter options. A member function `overwrite_from` is added to the optimizer base class. Each optimizer then implements this function for comparing their accepted parameters to defaults. A SFINAE approach to handle the different optimizer parameters generically (in optimizer.h only) was evaluated, but I think this is easier to review and maintain. This mirrors the Python API up to one edge case. An example of the edge case is provided below. Python can distinguish between 1) Key not present in dict = "not specified" and 2) Key present in dict = "explicitly set". The C++ implementation cannot. The issue hinges on whether or not to track if a particular parameter was set by the user explicitly or not (discrepancy in the case when the constructor default is explicitly passed in). To track this seems like it will take more intervention than would be worth it (modify TORCH_ARG to keep track, use std::optional for the parameter types, use bitset tracking) and was not pursued in the current PR. I'm happy to alter the design if appropriate. ### Example of edge case hinging on CONSTRUCTOR DEFAULTS vs OPTIMIZER DEFAULTS 1. CONSTRUCTOR DEFAULTS: These are the values you get when calling AdamOptions() AdamOptions().lr() = 0.001 AdamOptions().weight_decay() = 0 AdamOptions().eps() = 1e-08 2. OPTIMIZER DEFAULTS: These are the values the user chose when creating the optimizer User's optimizer defaults: optimizer.lr() = 0.005 optimizer.weight_decay() = 0.1 optimizer.eps() = 1e-07 3. THE PROBLEM SCENARIO: User wants to add a parameter group with explicit weight_decay=0.0 User sets: weight_decay(0) 4. THE CONFUSION: Constructor default weight_decay: 0 User's explicit weight_decay: 0 Are they equal? YES Since they're equal, our overwrite_from() logic thinks: "User didn't set weight_decay explicitly, use optimizer default" 5. CURRENT BEHAVIOR: Final weight_decay: 0.1 User expected: 0 Match? ❌ NO === KEY INSIGHT === Constructor defaults are built into the C++ class definition. Optimizer defaults are chosen by the user at runtime. We want to respect the user intention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161825 Approved by: https://github.com/janeyx99	2025-10-08 16:40:45 +00:00
mingyuan.wang	0a3e4e894c	[PP]: Optimize memory by early releasing stage inputs' gradients (#164329 ) Seems that we can release input activations' gradients early in `stage_backward()` in PP, which helps to reduce the peak memory. I tested this using `1F1B` and `Interleaved1F1B` PP strategy (for simplicity, I use 4 decoder layers of llama3, set PP size to 2 and set num_microbatches to 128) based on torchtitan run command using torchtitan: ```bash CUDA_VISIBLE_DEVICES=4,5 LOG_RANK=0,1 NGPU=2 CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml ./run_train.sh --metrics.log_freq 1 --training.seq_len 8192 --training.steps 10 --parallelism.data_parallel_shard_degree 1 --activation_checkpoint.mode full --model.tokenizer_path /workspace/torchtitan-v0.1.0/torchtitan/torchtitan/datasets/tokenizer/original/tokenizer.model --tr aining.dataset wikipedia --parallelism.pipeline_parallel_degree 2 --training.local_batch_size 128 --parallelism.pipeline_parallel_microbatch_size 1 --training.dataset_path /workspace/wikipedia_subset --training.seed 42 --parallelism.pipeline_parallel_schedule 1F1B ``` ## 1F1B torchtitan train results ### before fix <img width="1526" height="606" alt="b8e281cce1dac15e827c216e7d83f402" src="https://github.com/user-attachments/assets/545c0a80-6276-40c0-893f-fd2df0a53b8d" /> ### after fix <img width="1526" height="594" alt="70d5ceba311a8398d041189bf8897cfc" src="https://github.com/user-attachments/assets/0d606e08-238a-4115-a1c0-b40df101d867" /> after fix, the memory usage on rank1, i.e., non first stages saving 6.9GB compare to before fix. the memory usage on rank0 remains unchanged (rank0 represents stage0) ## Interleaved1F1B torchtitan train results ### before fix <img width="1514" height="601" alt="a28b7f9704b9234870619c43194e8a72" src="https://github.com/user-attachments/assets/2c28565f-ffff-4747-a8f5-722b5c65dc7e" /> ### after fix <img width="1526" height="621" alt="2d8d6d956b72885186f8c7059146c41a" src="https://github.com/user-attachments/assets/8c4a4ff2-336b-4e0b-8ac4-014ae22c2ed1" /> after fix, the memory usage on rank1 saving 14.57GB (rank1 holds layer1 and layer3) and rank0 saving 7.5GB (rank0 holds layer0 and layer2) ## Memory snapshot results also, I have dumped the memory snapshot to observe the memory under the 1F1B PP strategy. ### before fix <img width="1906" height="918" alt="6fd4e4ba82b8bacf9ca6edee4f3d5581" src="https://github.com/user-attachments/assets/d1b9245c-b09f-43c5-87ce-87ba48533a70" /> we can see the memory is increasing as pp step_microbatches running. (the lifetime of input activation's gradient, i.e., the output of `FusedRMSNormBackward` lasts too long) ### after fix <img width="1903" height="918" alt="2e415f25af6750d06e5e647683b212b9" src="https://github.com/user-attachments/assets/b657c8f6-5a56-46bd-8743-f3b8375c81b0" /> after fix, we got more steady memory usage during training. (the input activation's gradient will be released or return allocator soon) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164329 Approved by: https://github.com/H-Huang	2025-10-08 16:12:00 +00:00
Adnan Akhundov	73adac05d1	Triton 3.5.x pin update to 7416ffc (#164587 ) Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/ This updates contains 1 cherry-pick to fix flex_attention_fwd regression on B200: - https://github.com/triton-lang/triton/pull/8366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164587 Approved by: https://github.com/atalman	2025-10-08 16:07:18 +00:00
eqy	0d39ecb2ce	[cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411 ) seems to work Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411 Approved by: https://github.com/Skylion007	2025-10-08 15:26:50 +00:00
Nikita Shulga	90c0825e2d	[GHF] Allow reverts from pytorch-auto-revert app (#164911 ) This is a bit weird, but author_login is not a unique field, but author_url is. Explicitly allow https://github.com/apps/pytorch-auto-revert to issue revert commands Update mocks by running ``` sed -i -e s/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/47860a8f57a214a426d1150c29893cbc2aa49507f12b731483b1a1254bca3428/ gql_mocks.json ``` Test plan: Run ```python from trymerge import GitHubPR pr=GitHubPR("pytorch", "pytorch", 164660) print(pr.get_last_comment().author_url, pr.get_comment_by_id(3375785595).author_url) ``` that should produce ``` https://github.com/pytorch-auto-revert https://github.com/apps/pytorch-auto-revert ``` Plus added a regression test that checks two particular comments for revert validity `pytorch-auto-revert` user is my alter ego :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164911 Approved by: https://github.com/jeanschmidt	2025-10-08 15:15:45 +00:00
PyTorch MergeBot	fd4bde430a	Revert "list_stored_sd_metadata API. (#160610 )" This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca. Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))	2025-10-08 15:10:38 +00:00
PyTorch MergeBot	b5e93ffdcf	Revert "Limit path search within range (#164581 )" This reverts commit 415e641572473479fc9d9eaea12762e1a223a9e0. Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))	2025-10-08 14:56:21 +00:00
PyTorch MergeBot	f8d0d65ddc	Revert "Add memory estimator (#164738 )" This reverts commit ab01a0d7d352e7fd07989b8d6bf035bf82aea74e. Reverted https://github.com/pytorch/pytorch/pull/164738 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))	2025-10-08 14:56:21 +00:00
Jeff Daily	f46ddb1e65	[ROCm][CI] add gfx1150 gfx1151 to docker images for binary builds (#164854 ) Fixes #164346. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164854 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-08 14:34:22 +00:00
PyTorch MergeBot	20082d7136	Revert "fix flex attention eager bwd: more rounding (#164317 )" This reverts commit 41808b2ba9a61ab2f4c7af394c1668d09a4a0331. Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](`41808b2ba9`) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))	2025-10-08 14:29:10 +00:00
Laith Sakka	7158aa22e8	remove more (#164753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753 Approved by: https://github.com/aorenste, https://github.com/mlazos ghstack dependencies: #164664, #164665, #164667, #164668	2025-10-08 14:23:38 +00:00
Laith Sakka	2035f6b2e6	use check_size instead of check_is_size in ops.py (#164668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665, #164667	2025-10-08 14:23:38 +00:00
Mwiza Kunda	2b58adc3bd	[inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752 ) If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752 Approved by: https://github.com/eellison	2025-10-08 14:12:14 +00:00
angelayi	322091d8d8	[opaque_obj] Add make_fx tracing support (#163278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278 Approved by: https://github.com/zou3519 ghstack dependencies: #163279, #163277	2025-10-08 09:09:16 +00:00
angelayi	2bb4e6876c	[opaque obj] Error for torch.library.custom_op infer_schema (#163277 ) Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277 Approved by: https://github.com/zou3519 ghstack dependencies: #163279	2025-10-08 09:09:16 +00:00
angelayi	56ef7743fc	[opaque_obj] Add __eq__ and __deepcopy__ (#163279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279 Approved by: https://github.com/zou3519	2025-10-08 09:09:16 +00:00
Yuanyuan Chen	64108bdbed	[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641 ) This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641 Approved by: https://github.com/albanD, https://github.com/justinchuby	2025-10-08 08:27:58 +00:00
Maggie Moss	c855f8632e	Pyrefly suppressions 7/n (#164913 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913 Approved by: https://github.com/oulgen	2025-10-08 07:27:17 +00:00
morrison-turnansky	12d2ef557f	Update round size with 1 division behavior (#162203 ) have round size return nearest power of 2 greater than or equal to size with 1 division Fixes #161139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203 Approved by: https://github.com/ezyang	2025-10-08 06:41:46 +00:00
Edward Yang	65aa62d50d	Use codegen for the boxed interpreters (#164573 ) Authored with claude code. The arg parsing is kind of horrible, open to more suggestions. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164573 Approved by: https://github.com/albanD, https://github.com/jansel	2025-10-08 06:27:44 +00:00
Jane Xu	6a09f9306c	Fix #164742 , all header-impl'd userfacing functions should be inline (#164871 ) It is as @mxmpl pointed out; we are missing an inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164871 Approved by: https://github.com/mikaylagawarecki	2025-10-08 05:57:19 +00:00
Ke Wen	19bf67be32	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-08 05:25:16 +00:00
PyTorch MergeBot	1927783aa3	Revert "Reland vision pinned commit hash update (#164492 )" This reverts commit 6861a270624b44954826688f8dad668eb0154452. Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))	2025-10-08 04:38:26 +00:00
Nicolas Macchioni	184817c7a8	locks + unit tests (#164636 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste D83714690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636 Approved by: https://github.com/aorenste	2025-10-08 04:34:22 +00:00
Pradeep Fernando	da903b6a8b	list_stored_sd_metadata API. (#160610 ) Summary: 1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load. 2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available). 3\ There, we have to use saved checkpoint (metadata) as our source of truth. 4\ This RFC proposal exposes the checkpoint metadata using a public API. In this proposal we expose the stored state-dict metadata (minus associated storage/chunk metadata). Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader. Test Plan: UT. Rollback Plan: Differential Revision: D80231457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610 Approved by: https://github.com/saumishr	2025-10-08 04:33:51 +00:00
Boyuan Feng	f76fdcaaf8	[Benchmark] cleanup huggingface models (#164815 ) Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet	2025-10-08 03:21:04 +00:00
Sam Larsen	608792153f	[inductor][codecache] Print bytes in codecache debug output (#164898 ) Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters. Test Plan: Registered a dummy post_grad_custom_pass and printed codecache debug output `TORCH_LOGS=+torch._inductor.codecache python ~/foo.py` Yields something like: ``` V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!��... ``` Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898 Approved by: https://github.com/oulgen	2025-10-08 02:45:20 +00:00
Maggie Moss	086dec3235	Pyrefly suppressions 6/n (#164877 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (5,064 ignored) Only four directories left to enable Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877 Approved by: https://github.com/oulgen	2025-10-08 02:30:57 +00:00
Aaron Orenstein	ad7b2bebc6	Use tuples to have a deterministic ordering. (#164851 ) When debugging I noticed some non-deterministic behavior and tracked it down to this literal set. Changed to be a tuple for determinism. Changed two other small literal sets also because using a set for a small lookup like that is slow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164851 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-10-08 02:12:03 +00:00
Ke Wen	d444384003	[SymmMem] Tiled reduce (#162243 ) Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)` For now supports only: - NVSHMEM backed symmetric tensor; - 2D tensor and tile; - torch.float. Testing on right-bottom quandrant: ``` rank 0: tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0') PASSED ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243 Approved by: https://github.com/ngimel	2025-10-08 02:03:04 +00:00
PyTorch MergeBot	3040a5d294	Revert "[dynamo] Support torch.fx.traceback.annotate (#164678 )" This reverts commit 801e282f39e9ef4424dfd3ecfd2b550a44595229. Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))	2025-10-08 01:49:34 +00:00
PyTorch MergeBot	97463d4cf3	Revert "Fix double dispatch to Python for detach (#163671 )" This reverts commit c32118dc3e50505fd285e6e448a90883fce11535. Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))	2025-10-08 01:46:45 +00:00
Howard Huang	c813617c53	[PP] Migrate other schedules to use PipelineScheduleRuntime (#164777 ) Second fix for https://github.com/pytorch/pytorch/issues/164756 This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above. <img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" /> A test is failing after the conversion: - Fixed a gradient scaling issue for dWeight Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777 Approved by: https://github.com/fegin ghstack dependencies: #164775	2025-10-08 01:45:57 +00:00
Howard Huang	e659661ffa	[PP] Fix FSDP unshard/reshard (#164775 ) First fix for https://github.com/pytorch/pytorch/issues/164756 In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward. Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775 Approved by: https://github.com/weifengpy	2025-10-08 01:45:57 +00:00
Markus Hoehnerbach	41808b2ba9	fix flex attention eager bwd: more rounding (#164317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317 Approved by: https://github.com/drisspg ghstack dependencies: #163986	2025-10-08 01:17:45 +00:00
Xilun Wu	c0510dc447	[ContextParallel] add `_LoadBalancer` classes, and load-balance interface to Context Parallel APIs (#161062 ) Summary This PR provides an interface for users to specify how to load-balance the attention input. The load-balance is essentially a rearrangement of the input tensor(s) over the seq_dim before sharding and can be specified via an index tensor `rearrange` such that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added in https://github.com/pytorch/pytorch/pull/155442. New `_LoadBalancer` classes New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`) provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`. When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies to K and V). When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that `Q'[restore_idx] == Q` (same applies to K and V). Impact 2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by: - For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not take in the extra `load_balancer` argument and solely determines from the global var `_cp_options.enable_load_balance`. - For new users including who want to try `flex_attention()`, we require to use the new API `_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()` because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded. Load-Balance Behavior `context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument `load_balancer`. This argument is optional because of backward compatibility but we require new users to explicitly pass in a `load_balancer` if load-balancing is demanded: - if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs no load-balancing on input Tensors. - if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs head-tail load-balancing (e.g. split a Tensor into 2N chunks and first N are called head and the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second head along with the second last tail chunk on rank 1, and so on). `_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` : - if `load_balancer == None`, no load-balancing will be performed - otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding. Changes* This PR moves the index Tensor generation logic into a set of LoadBalancer classes and make LoadBalancer the common interface for Context Parallel APIs that leverages load-balancing: * _context_parallel_buffers * context_parallel_unshard * create_cp_block_mask The `_LoadBalancer` classes added are: - `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation. - `_HeadTailLoadBalancer`: Implements head-tail balancing logic. - `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences. Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062 Approved by: https://github.com/fegin	2025-10-08 01:09:14 +00:00
Nicolas Macchioni	9ec10dc26a	utils + unit tests (#164551 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551 Approved by: https://github.com/aorenste	2025-10-08 01:05:45 +00:00
Yuanyuan Chen	43fc859625	Don't return values in void functions (#164809 ) This PR fixes returning values in void C++ functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809 Approved by: https://github.com/janeyx99	2025-10-08 01:04:14 +00:00
PyTorch MergeBot	f713abab16	Revert "Enable all flake8-logging-format rules (#164655 )" This reverts commit e98c4e835b1db22092fc93b49d2cddd7b3537d1f. Reverted https://github.com/pytorch/pytorch/pull/164655 on behalf of https://github.com/malfet due to Looks like it broke lint in trunk, see `bd3b98a8a5/1` ([comment](https://github.com/pytorch/pytorch/pull/164655#issuecomment-3379209309))	2025-10-08 00:55:17 +00:00
Pian Pawakapan	bd3b98a8a5	[dynamic shapes] make backed_size_oblivious behavior consistent b/w symbolic_shapes/inductor (#164796 ) Summary: call guard_or_ directly to enable backed_size_obl in inductor calls to guard_or Test Plan: CI and unit test added. Differential Revision: D84009392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164796 Approved by: https://github.com/laithsakka	2025-10-08 00:19:06 +00:00
Yuanyuan Chen	e98c4e835b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99	2025-10-08 00:16:13 +00:00
Yiming Zhou	7b15534434	[export] Fix weight sharing when there is no complete tensor (#164857 ) Summary: As titled. Test Plan: CI Differential Revision: D84079625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164857 Approved by: https://github.com/yushangdi	2025-10-07 23:40:13 +00:00
Scott Wolchok	c32118dc3e	Fix double dispatch to Python for detach (#163671 ) This fixes #71725. Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-10-07 23:34:37 +00:00
Chien-Chin Huang	e3ae80fc03	[PP] Let PP split BlockMask into micro-BlockMask (#164111 ) BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111 Approved by: https://github.com/H-Huang	2025-10-07 23:25:34 +00:00
atalman	483f4e0db9	CUDA 13.0 builds fix on Amazon Linux 2023 (#164870 ) During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds This is related to: https://github.com/pytorch/pytorch/issues/152756 Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262 Error: ``` WARNING: There was an error checking the latest version of pip. + python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly Traceback (most recent call last): File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__ self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: libcudart.so.13: cannot open shared object file: No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module> import torch File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module> _load_global_deps() File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps _preload_cuda_deps(lib_folder, lib_name) File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps raise ValueError(f"{lib_name} not found in the system path {sys.path}") Traceback (most recent call last): ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages'] File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module> main() File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main run_cmd_or_die(f"docker exec -t {container_name} /exec") File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870 Approved by: https://github.com/Camyll Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-10-07 22:52:53 +00:00
Aaron Gokaslan	d1a62c8036	[BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856 ) Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856 Approved by: https://github.com/williamwen42	2025-10-07 22:51:17 +00:00
Huy Do	6861a27062	Reland vision pinned commit hash update (#164492 ) Redo https://github.com/pytorch/pytorch/pull/154694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492 Approved by: https://github.com/yangw-dev	2025-10-07 22:45:05 +00:00
amdfaa	955f21dc2c	[ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355 ) This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort. @dnikolaev-amd contributed all the test skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355 Approved by: https://github.com/jeffdaily Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-07 22:36:25 +00:00
Pian Pawakapan	9f5e1beaf3	[multi-kernel] base tensor sizes for shape cache key (#164499 ) to match shape key in `3ca09d65f1/torch/_inductor/select_algorithm.py (L3571)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164499 Approved by: https://github.com/ColinPeppler	2025-10-07 21:27:40 +00:00
Mwiza Kunda	2e027e8742	[inductor] Improve bound on the number of dims to match for the block (#163755 ) - Removes redundant broadcast code when `len(kernel.range_tree_nodes)` is much larger than `len(range_tree.nodes)`. For example: ```python # before, the broadcast is to [1, 1, XBLOCK, R0_BLOCK] tmp0 = tl.reshape(tl.broadcast_to(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last')[:, None, :, :], [(511 + XBLOCK) // 512, ((1) * ((1) <= ((511 + XBLOCK) // 512)) + ((511 + XBLOCK) // 512) * (((511 + XBLOCK) // 512) < (1))), ((512) * ((512) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (512))), R0_BLOCK]), [XBLOCK, R0_BLOCK]) # after tmp0 = tl.reshape(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last'), [XBLOCK, R0_BLOCK]) ``` - Fix: also save range_tree_nodes per subgraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/163755 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2025-10-07 21:02:37 +00:00
PyTorch MergeBot	1e42fde45e	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))	2025-10-07 20:51:22 +00:00
PyTorch MergeBot	f505caa71b	Revert "multimem reduce (#164517 )" This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d. Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))	2025-10-07 20:12:38 +00:00
Howard Huang	65f10becdf	Support OVERLAP_F_B in schedule (#161072 ) Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/161072 Approved by: https://github.com/fegin, https://github.com/wconstab	2025-10-07 19:55:10 +00:00
PyTorch MergeBot	df640df68a	Revert "Reapply "C++-accessible Placements via pybind11 (#163030 )" (#164519 )" This reverts commit 8c0bc879b97bc580aaa0777b2d266bdd068cb528. Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))	2025-10-07 19:46:17 +00:00
zhxchen17	4c3c0ef2f1	[precompile] Load source cache for AOT compile as well. (#164773 ) Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards. Test Plan: test_guard_serialization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773 Approved by: https://github.com/yiming0416, https://github.com/dolpm	2025-10-07 18:47:09 +00:00
Parshant Sharma	bc33b10202	fix copy_ for scalar in inductor (#164167 ) Fixes #158437 ### Summary - TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))` - Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device ### Impact - Enables compilation of models using ` tensor.copy_(scalar) `patterns - module: inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167 Approved by: https://github.com/shunting314	2025-10-07 18:31:37 +00:00
Colin Peppler	2855a045b3	Use sym_eq and sym_and on symbolic shapes in common_meta_baddbmm_bmm (#164781 ) Differential Revision: D84005053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164781 Approved by: https://github.com/Skylion007	2025-10-07 18:25:00 +00:00
Lakshay Garg	9ecd092bd9	Add python bindings for NCCL CTA policies (#164309 ) NCCLConfig can now be constructed with non-default [cta policies][1] ```python import torch from torch.distributed import ProcessGroupNCCL as nccl config = nccl.NCCLConfig() config.cta_policy = nccl.NCCL_CTA_POLICY_ZERO # NCCL version >= 2.28 ``` [1]: https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2283/user-guide/docs/api/flags.html#nccl-communicator-cta-policy-flags Pull Request resolved: https://github.com/pytorch/pytorch/pull/164309 Approved by: https://github.com/eqy	2025-10-07 18:16:20 +00:00
Avik Chaudhuri	078d475d3b	move partition and compiler fns from stage 1 to stage 2 (#164765 ) Differential Revision: D83995689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164765 Approved by: https://github.com/zhxchen17	2025-10-07 18:02:03 +00:00
Mikayla Gawarecki	f37a6523ef	Move version.h to torch/headeronly (#164381 ) Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381 Approved by: https://github.com/janeyx99	2025-10-07 17:47:30 +00:00
Maggie Moss	b13cd141b3	Add pyrefly suppressions (#164748 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the `project-excludes` field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: 0 errors (4,263 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164748 Approved by: https://github.com/oulgen	2025-10-07 17:31:18 +00:00
Lakshay Garg	5e47b4dd60	Remove device_id param from DeviceCachingAllocator::malloc (#164798 ) The `malloc` call in DeviceCachingAllocator accepts a DeviceIndex param which can be confusion because the allocator can only allocate memory for the device that it corresponds to. This associated device is fixed at construction time and the runtime param can be misleading. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164798 Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/eqy	2025-10-07 16:42:04 +00:00
Yuanyuan Chen	ee5389d520	Enable batch samples in sparse tests (#164677 ) The test cases are enabled because the issue was fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164677 Approved by: https://github.com/albanD	2025-10-07 15:58:37 +00:00
eellison	ab01a0d7d3	Add memory estimator (#164738 ) Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #164568, #164569, #164581	2025-10-07 15:32:27 +00:00
Animesh Jain	801e282f39	[dynamo] Support torch.fx.traceback.annotate (#164678 ) Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation. The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node. This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678 Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan	2025-10-07 14:54:26 +00:00
Aleksei Nikiforov	87c9fbda22	Follow up to PR 163980 for s390x (#164464 ) Now with same updates propagated to s390x it works on s390x runners too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164464 Approved by: https://github.com/atalman	2025-10-07 12:02:29 +00:00
YyWangCS	3cc8af2d67	torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access (#164459 ) # TLDR This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs. The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs. \| Input Shape \| torch2.6.0 (ms) \| torch2.8.0 (ms) \| 2.8.0+this PR (ms) \| \| -------------- \| --------------- \| --------------- \| ------------------ \| \| (1, 1B) \| 36.6 \| 1564.1 \| 25.6 \| \| (1, 100M) \| 3.56 \| 17.4 \| 2.54 \| \| (1, 1000,000) \| 0.135 \| 0.145 \| 0.098 \| \| (512, 128000) \| 1.33 \| 1.33 \| 1.32 \| \| (8192, 128000) \| 19.6 \| 19.6 \| 19.4 \| # Background After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from 36 ms to 1.6 s. Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536). # Analysis `torch.topk` relies on RadixSelect to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages: 1. Local histogram: Each CUDA block processes a subset of the input and writes its local histogram to global memory. 2. Global reduction: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram. Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction. In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces redundant memory reads: - `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks. - For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory. # This PR To address this inefficiency, we introduce the following optimizations: 1. Dedicated kernel: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`. 2. Loop unrolling: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads. # Performance We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs. \| Input Shape \| torch2.6.0 (ms) \| torch2.8.0 (ms) \| 2.8.0+this PR (ms) \| \| -------------- \| --------------- \| --------------- \| ------------------ \| \| (1, 1B) \| 36.6 \| 1564.1 \| 25.6 \| \| (1, 100M) \| 3.56 \| 17.4 \| 2.54 \| \| (1, 1000,000) \| 0.135 \| 0.145 \| 0.098 \| \| (512, 128000) \| 1.33 \| 1.33 \| 1.32 \| \| (8192, 128000) \| 19.6 \| 19.6 \| 19.4 \| Besides, I have verified the correctness of this PR with different inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-10-07 11:04:03 +00:00
Nicolas Macchioni	1fb072ac2a	exceptions + unit tests (#164550 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164550 Approved by: https://github.com/aorenste	2025-10-07 10:04:58 +00:00
Animesh Jain	cac5e13e13	[dynamo] Inline nn module calls using __call__ methods (#164817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164817 Approved by: https://github.com/SherlockNoMad, https://github.com/mlazos	2025-10-07 08:57:20 +00:00
Ivan Zaitsev	68350660ee	Increase timeout for nightly macOS performance tests to 300 minutes (#164793 ) the Test step time recently went slightly up. hopefully this fixes https://github.com/pytorch/alerting-infra/issues/263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164793 Approved by: https://github.com/seemethere	2025-10-07 08:44:07 +00:00
Laith Sakka	ef7e2ca77e	remove check_is_size from test_misc.py (#164667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164667 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665	2025-10-07 07:33:50 +00:00
Laith Sakka	cdaaf3e4a3	remove size-like based size-oblivious special max simplifications (#164665 ) As we removed guard_size_oblivious this simplification is no longer relevant, this is part of the process of deprecation for guard_size_oblivious and its dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164665 Approved by: https://github.com/aorenste ghstack dependencies: #164664	2025-10-07 07:33:50 +00:00
Laith Sakka	0ea59c3c55	do not suggest torch._check_is_size() (#164664 ) size like concept for data dependency is not relevant anymore as we removed all guard_size_oblivious calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164664 Approved by: https://github.com/angelayi, https://github.com/mlazos	2025-10-07 07:33:50 +00:00
Nicolas Macchioni	8f705d019a	context + unit tests (#164549 ) Summary: the context module provides configurable context selection + isolation key hashing; context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes). callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached. these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste D83714689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549 Approved by: https://github.com/aorenste	2025-10-07 06:02:10 +00:00
bobrenjc93	4bcc05777e	[torchfuzz] synthesize inputs for data dependent ops (#164716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164716 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694, #164715	2025-10-07 05:40:32 +00:00
bobrenjc93	2a6cdba6e5	[torchfuzz] various edge case fixes (#164715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164715 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694	2025-10-07 05:30:46 +00:00
bobrenjc93	53f6cc7529	[torchfuzz] make ops_fuzzer deterministic (#164694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164694 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693	2025-10-07 05:30:46 +00:00
bobrenjc93	ac901bf79a	[torchfuzz] consolidate on a base implementation of args_codegen (#164693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164693 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688	2025-10-07 05:20:28 +00:00
bobrenjc93	c965d6dbb2	[torchfuzz] move into experimental dir (#164688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164688 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687	2025-10-07 05:09:08 +00:00
bobrenjc93	ac08556f67	[torchfuzz] support more unbacked functions (#164687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164687 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649	2025-10-07 05:00:03 +00:00
bobrenjc93	5fe7f29b9e	[torchfuzz] add support for operator weights (#164649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164649 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646, #164647	2025-10-07 05:00:03 +00:00
bobrenjc93	ded099ecbf	[torchfuzz] don't use the first gpu in multi process fuzzer (#164647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164647 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514, #164646	2025-10-07 04:59:56 +00:00
bobrenjc93	63fcc3e6c4	[torchfuzz] update README.md (#164646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164646 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434, #164514	2025-10-07 04:59:50 +00:00
Xiao Fu	fd3e15c14f	Fix typo in class definition of bytecodedispatchtable (#164762 ) ghstack-source-id: 84f0d7bb7e3780ca75473782abfae530010be56e Pull Request resolved: https://github.com/pytorch/pytorch/pull/164761 Fixes the type in naming of bytecodedispatchtable Pull Request resolved: https://github.com/pytorch/pytorch/pull/164762 Approved by: https://github.com/StrongerXi, https://github.com/williamwen42	2025-10-07 04:36:09 +00:00
Yuanyuan Chen	ff5faa744a	Remove unused THPXXX macros (#164660 ) These macros are not used in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164660 Approved by: https://github.com/albanD	2025-10-07 04:04:21 +00:00
Tugsbayasgalan Manlaibaatar	4725871a81	Return fake mode from export graph capture API (#164730 ) This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like: ``` out = fulllgraph_capture(mod, inputs) fake_mode = out.backend_inputs.fake_mode gm = out.module() ``` This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal. Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730 Approved by: https://github.com/avikchaudhuri	2025-10-07 03:42:46 +00:00
Animesh Jain	bcd96cc6ff	[annotate] Copy fwd to bwd metadata for subgraphs as well (#164795 ) The test is in the next PR. My older PR on dynamo annotate - https://github.com/pytorch/pytorch/pull/164678 is getting reverted due to unknown reasons, so difficult to add a test right now in this PR. When I reland, I can add a test for this as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164795 Approved by: https://github.com/yushangdi ghstack dependencies: #164772	2025-10-07 02:42:47 +00:00
Yuanyuan Chen	50e077beaa	Fix outdated info in requirements-ci.txt (#164441 ) Fixes installation instructions and descriptions for `numba` and `scikit-image` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164441 Approved by: https://github.com/albanD	2025-10-07 02:10:41 +00:00
albanD	56d66ac0d7	Make custom op alias check consistent (#164576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576 Approved by: https://github.com/soulitzer	2025-10-07 02:05:09 +00:00
rraminen	49f7d8d19d	[ROCm] Fix test_cuda_synchronize failure on ROCm (#164735 ) This PR skips the hipify step of torch/csrc/jit/ir/ir.h to avoid a build-time error for the JIT cuda namespace. This fixes two skipped tests in test/jit/test_cuda.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164735 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-07 01:14:24 +00:00
PyTorch MergeBot	afee8062d5	Revert "Fix mesh.get_local_rank when it is > 1d (#164473 )" This reverts commit 83d71dfb2fd993a6242372b8123549acaa85ffdb. Reverted https://github.com/pytorch/pytorch/pull/164473 on behalf of https://github.com/izaitsevfb due to appears to be causing vision_maskrcnn regression ([comment](https://github.com/pytorch/pytorch/pull/164473#issuecomment-3374738997))	2025-10-07 00:37:41 +00:00
Chris Leonard	e89d12bf5d	Numpy zerotensor handling (#164487 ) Fixes #89034 Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True. @ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487 Approved by: https://github.com/izaitsevfb	2025-10-07 00:34:14 +00:00
Yedidya Feldblum	d4752bc7f6	[caffe2] tweak Unpickler::readInstruction handling TUPLE (#164764 ) Summary: Creating the vector was a bit awkward. Use the natural iterator-pair constructor with move-iterators. Test Plan: CI. Reviewed By: dolpm Differential Revision: D83995108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164764 Approved by: https://github.com/drisspg	2025-10-07 00:18:10 +00:00
Jeff Daily	44a5d41993	[ROCm] add gfx1150 gfx1151 to supported gemm lists (#164744 ) This is one of a few PRs needed to address https://github.com/pytorch/pytorch/pull/164744 fully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164744 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-07 00:02:23 +00:00
Animesh Jain	361c5d362c	[fx][traceback] Actually disable preservation of node metadata when enable=False (#164772 ) This will come in handy when we run graph passes that add new nodes, and create_proxy can add seq_nr meta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164772 Approved by: https://github.com/SherlockNoMad	2025-10-06 23:39:12 +00:00
PyTorch MergeBot	1fc71d1b57	Revert "Numpy zerotensor handling (#164487 )" This reverts commit f7ad6dbad67161333a1473d1e0b478b7475a0ec1. Reverted https://github.com/pytorch/pytorch/pull/164487 on behalf of https://github.com/malfet due to Did it break torchbench?, see `8c728e129d/1` ([comment](https://github.com/pytorch/pytorch/pull/164487#issuecomment-3374635051))	2025-10-06 23:32:12 +00:00
Jeff Daily	8f54e27e5d	[ROCm][CI] rebuild magma binary for gfx1150 gfx1151 (#164782 ) After #164763 added gfx1150 gfx1151 to list of targets, this PR will trigger rebuild of magma binary for ROCm 7 with the new targets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164782 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-06 23:29:21 +00:00
Scott Wolchok	8c0bc879b9	Reapply "C++-accessible Placements via pybind11 (#163030 )" (#164519 ) This makes Placement data representation available in C++ via pybind11. Reapply with fix for internal errors. D83788896 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164519 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-10-06 23:19:14 +00:00
Eddie Yan	746fe78ecd	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-06 23:11:23 +00:00
Yuanyuan Chen	b63bbe1661	Remove old ROCm version check in tests (#164245 ) This PR removes ROCm<6 version checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245 Approved by: https://github.com/jeffdaily	2025-10-06 22:42:01 +00:00
PyTorch MergeBot	3912ba3e94	Revert "Fix refine_ranges corner case (#164075 )" This reverts commit 27234792add2ee9bedd84ca02dbf34f8f244bc5c. Reverted https://github.com/pytorch/pytorch/pull/164075 on behalf of https://github.com/izaitsevfb due to fails executorch builds, see [D83938444](https://www.internalfb.com/diff/D83938444) ([comment](https://github.com/pytorch/pytorch/pull/164075#issuecomment-3374430964))	2025-10-06 22:09:39 +00:00
PyTorch MergeBot	cfc5cc17dc	Revert "[dynamo] Support torch.fx.traceback.annotate (#164678 )" This reverts commit 2883b5ab773daf5861d43ff0b65be49a441ab3f9. Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to fails inductor:max_autotune tests internally, see D83948169 ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3374407009))	2025-10-06 22:03:42 +00:00
zeshengzong	fdc8ccc5bc	Make `Adam`, `AdamW` work with nonzero-dim Tensor betas (#149939 ) Fixes #147921 ## Changes - Convert tensor `betas` using `_to_scalar` - Change annotation of `betas` param - Change param type in docs ## Test Result ```bash pytest -s test/test_optim.py -k test_tensor_lr -vv ``` ![image](https://github.com/user-attachments/assets/312ee045-1e8b-4789-aa6e-ba63e6df7e81) ![image](https://github.com/user-attachments/assets/7e6ec274-645b-46b9-b1a6-2b340a685203) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149939 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-10-06 22:03:25 +00:00
Yuanyuan Chen	48b54b45d6	Replace pynvml with nvidia-ml-py in win-test.sh (#164681 ) pynvml was deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164681 Approved by: https://github.com/Aidyn-A, https://github.com/eqy	2025-10-06 21:57:26 +00:00
Eddie Yan	6861fa43e5	[CUDA] Cleanup persistent cuBLASLt workspaces before compile-regions test (#163299 ) Fixes some tests that seemed to start flaking out as reported in #163202, due to cuBLASLt workspaces becoming persistent following that change. It's relatively obvious why the workspaces/allocations corresponding to them should be cleaned up for `test_memory_snapshot_script` but less obvious for `test_memory_plots_free_segment_stack`? Why does not cleaning up workspace prevent `empty_cache` from showing up? Pull Request resolved: https://github.com/pytorch/pytorch/pull/163299 Approved by: https://github.com/albanD	2025-10-06 21:13:03 +00:00
atalman	c1f40d33c8	Fix docker build issue after 164575 (#164774 ) Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue. The command is wrong: ``` conda install -c "whl/nightly" -y python=3.11 conda=25.7.0 ``` Should be just using default conda channel: ``` conda install -y python=3.11 conda=25.7.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774 Approved by: https://github.com/Camyll	2025-10-06 20:28:20 +00:00
Jeff Daily	7e7ac2039d	[ROCm][CI] add gfx1150 gfx1151 to almalinux image (#164763 ) First PR necessary to address missing gfx1151 reported in https://github.com/pytorch/pytorch/issues/164346. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164763 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-06 20:19:43 +00:00
Zhengxu Chen	23ab6a45e5	[precompile][ez] Add instrumentation for guard loading/building. (#164602 ) Summary: as title. Test Plan: CI Differential Revision: D83868533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164602 Approved by: https://github.com/dolpm	2025-10-06 20:16:09 +00:00
Rohit Singh Rathaur	b558c986e8	Add regression test for get_root_mesh with multiple independent meshes (#164731 ) Fixes #163330 I tried to reproduce the bug with my 4-GPU setup (the original issue used 8 GPUs). I created several different test scenarios, trying to trigger the bug by: - creating two different device meshes - slicing them in various ways - checking if get_root_mesh() would get confused but the bug didn't show up! Everything worked correctly in `2.10`. I found that there was a massive refactoring of the `DeviceMesh` code (PR #163213) that landed on October 2nd. That PR completely rewrote how `DeviceMesh` tracks relationships between parent meshes and submeshes using. It seems like this refactoring fixed the bug! But I added a regression test to make sure it doesn't come back. The test (`test_get_root_mesh_multiple_independent_meshes`) does exactly what the bug report described: - creates two independent meshes - slices them both - verifies that each submesh correctly points back to its real parent - makes sure submeshes from mesh1 don't incorrectly claim mesh2 as their parent Pull Request resolved: https://github.com/pytorch/pytorch/pull/164731 Approved by: https://github.com/fduwjj	2025-10-06 18:52:25 +00:00
eellison	415e641572	Limit path search within range (#164581 ) When we are looking if two nodes are dependent, limit path search within the bounds of their node idxs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164581 Approved by: https://github.com/ezyang ghstack dependencies: #164568, #164569	2025-10-06 18:29:27 +00:00
Scott Wolchok	11f5f65686	Use PyObject_GetOptionalAttrString in PyObject_FastGetAttrString when available (#164624 ) Python 3.13 added PyObject_GetOptionalAttrString. I'm not 100% certain that it is strictly better than the old approach in all cases, but based on documentation/comments it seems to be meant for this type of use, and it's faster when I profile torchtitan training (which gets to the "check for the `__torch_function__` attr on some object" part of maybe_has_torch_function frequently enough to notice, but wastes a bunch of time generating exceptions that we then suppressed here). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164624 Approved by: https://github.com/Skylion007	2025-10-06 18:26:09 +00:00
albanD	af32d16a71	Add pure view support in autograd Function (#164736 ) This is the same as https://github.com/pytorch/pytorch/pull/164467 But it needs to be co-deved due to internal insanity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164736 Approved by: https://github.com/soulitzer	2025-10-06 18:21:05 +00:00
Colin L Reliability Rice	ba480d6bf7	torch.compile: Increase subprocess parent death check interval to lower cpu (#164594 ) Summary: This check is a good idea (we could potentially do it with prctl). However we're seeing elevated rates of cpu usage in idle worker threads. This causes issues on production jobs, causing a large amount of spikeness in qps. Test Plan: Tested on a prod job with caches force disabled via TORCH_COMPILE_FORCE_DISABLE_CACHES=1 Baseline <img width="454" height="403" alt="image" src="https://github.com/user-attachments/assets/b88583a1-5b99-48cb-b03d-cd9b69546579" /> With this diff - <img width="426" height="403" alt="image" src="https://github.com/user-attachments/assets/431217f1-0ed0-4f6e-9d81-6428bf34e0e3" /> Differential Revision: D83803302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164594 Approved by: https://github.com/masnesral	2025-10-06 18:15:21 +00:00
Jeff Daily	4a6abba0d9	[ROCm][CI] test_convolution.py uses miopen immediate mode (#164598 ) This should help stabilize some flaky test behavior where miopen would pick different solutions for different parts of the same test and the test expects bitwise identical results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164598 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-06 17:48:50 +00:00
Henry Tsang	96181d6f76	[BE][cutlass backend] BE changes post cutlass_cppgen name change (#164589 ) Differential Revision: D83809105 Handle reviews from https://github.com/pytorch/pytorch/pull/164159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164589 Approved by: https://github.com/Skylion007	2025-10-06 17:22:08 +00:00
Yiming Zhou	2164b66121	[export] Better state_dict and constant dedup in torch.export.save (#164196 ) Summary: Previously, weight deduplication was done by simply grouping tensors with their untyped storage and saving the first tensor in the group. A more rigorous approach would be to find a complete tensor that covers the storage and store that tensor. This is particularly important for GPU weights because when saving to raw bytes, we move the weight to CPU first, and if the weight being saved is not a complete one, it will lose the storage information during the copy to CPU. In this diff, we reuse code in `_package_weights.py` for better weights and constants deduplication in `torch.export.save`. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_weight_sharing_gpu Differential Revision: D83523690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164196 Approved by: https://github.com/angelayi	2025-10-06 17:03:15 +00:00
Janani Sriram	bde18c445d	[Max Autotune][B200] Relax absolute tolerance for MM+MM test (#164022 ) Summary: Relax absolute tolerance from 1e-2 to 1e-1 for `test_non_contiguous_input_mm_plus_mm` in `test_max_autotune.py`. Test Plan: `test_max_autotune.py` Differential Revision: D83391942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164022 Approved by: https://github.com/eellison	2025-10-06 16:29:07 +00:00
Janani Sriram	f3e43ff2d7	[Max Autotune][B200] Fix decompose_k test failure (#164021 ) Summary: Fix decompose_k test failure (`test_max_autotune_decompose_k `) in `test_max_autotune.py` on B200s by setting `torch._inductor.config` patches for variables `comprehensive_padding` and `shape_padding`. Initial failure was `AssertionError: False is not true : Could not find a split in {3, 9, 2187, 81, 243, 729, 27} in # AOT ID: ['6_forward']`. Refactor decompose_k test to follow patch semantics when setting all environment variables within a test. Test Plan: `test_max_autotune.py`: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_max_autotune_decompose_k ``` Differential Revision: D83390563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164021 Approved by: https://github.com/njriasan, https://github.com/mlazos, https://github.com/eellison	2025-10-06 16:28:23 +00:00
bobrenjc93	39d0c06ed0	[torchfuzz] check in some more xfail repros (#164619 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164619 Approved by: https://github.com/ezyang	2025-10-06 16:20:44 +00:00
Maggie Moss	4ab847bbc7	Pyrefly suppressions 4/n (#164615 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: uncomment lines in the pyrefly.toml file step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/356645cf8cfe33123d9a27f23b30f7b1 after: 0 errors (2,753 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164615 Approved by: https://github.com/oulgen	2025-10-06 16:14:36 +00:00
Zhengxu Chen	4bd1505f84	[precompile][ez] Inline type definition for dynamo cache entry. (#164580 ) Summary: as title. DynamoCaptureOutput in package.py is not actively used in other files. Inline it to reduce confusion. Test Plan: CI Differential Revision: D83846957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164580 Approved by: https://github.com/dolpm	2025-10-06 16:00:59 +00:00
amdfaa	1f9614cef8	[ROCm][CI] Change rocm periodic workflow label to linux.rocm.gpu.mi250.4 (#164616 ) Testing done on this PR: https://github.com/pytorch/pytorch/pull/156491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164616 Approved by: https://github.com/jeffdaily, https://github.com/huydhn	2025-10-06 15:51:07 +00:00
eellison	35f66b83f8	respect aten planned overlap in inductor (#164569 ) Now that we have a hop to add implicit deps - use those deps for comm/compute overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164569 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164568	2025-10-06 15:47:55 +00:00
eellison	4a39820e5e	Add hop for additional control dependencies (#164568 ) Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies. This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap. There's definitely some similarity with the `with_effects` hop. Talked with @angelayi - when @zou3519 is back we will figure out how we want to consolidate. The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects` breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](`aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)`). By maintaining the node with its original calling structure in subgraph this all works. Example transformation: Before: ``` %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {}) %mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {}) ``` After: ``` add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1) mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1) mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add) ``` The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev	2025-10-06 15:47:55 +00:00
PaulZhang12	600267ea56	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-10-06 14:29:07 +00:00
PyTorch UpdateBot	f11ac803d7	Update slow tests (#164726 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164726 Approved by: https://github.com/pytorchbot	2025-10-06 12:57:29 +00:00
PyTorch UpdateBot	ea42517e45	[xla hash update] update the pinned xla hash (#164727 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164727 Approved by: https://github.com/pytorchbot	2025-10-06 11:54:10 +00:00
Tugsbayasgalan Manlaibaatar	91c211fb8c	AC should work with pre-dispatch IR (#164505 ) Previously we had to rely on turning off export verifier because the AC body was torch IR instead of aten IR. This PR makes it so that we create an IR that is export compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164505 Approved by: https://github.com/ydwu4, https://github.com/xmfan	2025-10-06 11:05:22 +00:00
Wei Feng	660e369a68	[FSDP2] check storage equal and consider data_ptr() == 0 (#164595 ) resolve https://github.com/pytorch/pytorch/issues/164554 unit test * `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict` * `pytest -s test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta_device_1d_init` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164595 Approved by: https://github.com/fegin	2025-10-06 08:44:38 +00:00
Animesh Jain	2883b5ab77	[dynamo] Support torch.fx.traceback.annotate (#164678 ) Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation. The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node. This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678 Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan	2025-10-06 02:59:24 +00:00
Yuanyuan Chen	9fff8155c3	[2/N] Fix clang-tidy readability checks (#164652 ) This PR applies clang-tidy readability checks to jit sources and all headers in the code base. `readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652 Approved by: https://github.com/Skylion007	2025-10-06 01:06:01 +00:00
PyTorch MergeBot	331191ce4b	Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659 )" This reverts commit 29cbcbac4215e0d9070a1b7a07ddaec9a36bbd08. Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/izaitsevfb due to reverted internally, see [D83214133](https://www.internalfb.com/diff/D83214133) ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3369348172))	2025-10-05 21:39:57 +00:00
PyTorch MergeBot	2c5ed6e7c0	Revert "[2/N] Fix clang-tidy readability checks (#164652 )" This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419. Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))	2025-10-05 21:36:57 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Yuanyuan Chen	3c5ca685d6	[2/N] Fix clang-tidy readability checks (#164652 ) This PR applies clang-tidy readability checks to jit sources and all headers in the code base. `readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652 Approved by: https://github.com/Skylion007	2025-10-05 07:05:11 +00:00
yewentao256	5178d0a480	[Compile] Fix Compile Warning for Capture Id (#163898 ) ```bash DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign DEBUG CaptureId_t capture_id_ = -1; DEBUG ^ DEBUG DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" DEBUG DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign DEBUG CaptureId_t capture_id_ = -1; DEBUG ^ DEBUG DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" DEBUG DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign DEBUG CaptureId_t capture_id_ = -1; DEBUG ^ ``` Cuda won't use 0 as a capture id, so it is safe to initialize with 0, which also matches the initialization in `pytorch/aten/src/ATen/native/cudnn/RNN.cpp:2362` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163898 Approved by: https://github.com/houseroad	2025-10-05 06:51:33 +00:00
Yuanyuan Chen	cf0a00d4f3	Enable ruff FURB161 rule (#164654 ) This PR enables FURB161 in ruff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164654 Approved by: https://github.com/Skylion007	2025-10-04 23:26:28 +00:00
Laith Sakka	5ed4270440	remove more no longer needed torch._check_is_size calls 1 (#164630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164630 Approved by: https://github.com/Skylion007 ghstack dependencies: #164627	2025-10-04 22:06:04 +00:00
Laith Sakka	8c728e129d	remove no longer needed torch._check_is_size calls from test_dynamic_shapes (#164627 ) No longer needed in those tests to prevent DDE Pull Request resolved: https://github.com/pytorch/pytorch/pull/164627 Approved by: https://github.com/ezyang	2025-10-04 22:06:04 +00:00
Laith Sakka	9fc2c6446d	remove guard_size_oblivious from is_contiguous python eager eval path. (#164622 ) Summary: this should not be needed anymore we shall have explicit is_contiguous_or_false calls where appropriate already ! Test Plan: run existing tests. Differential Revision: D83884977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164622 Approved by: https://github.com/bobrenjc93	2025-10-04 21:02:39 +00:00
William Wen	409aece3f9	[dynamo, 3.14] prevent StackRef compilation in 3.14 Windows (#164400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164400 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-10-04 18:38:08 +00:00
Edward Z. Yang	b116c51330	torch.cond on DTensor triggers an internal assert, add xfail for this. (#164389 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164389 Approved by: https://github.com/albanD	2025-10-04 18:12:06 +00:00
PyTorch MergeBot	2e1742dd63	Revert "Add device argument to torch.random.get_rng_state (#163034 )" This reverts commit 9580539e2f73d68e89544c713ff460bea3038701. Reverted https://github.com/pytorch/pytorch/pull/163034 on behalf of https://github.com/cyyever due to It cased partially initialised torch module ([comment](https://github.com/pytorch/pytorch/pull/163034#issuecomment-3368349209))	2025-10-04 15:25:45 +00:00
Chris Leonard	f7ad6dbad6	Numpy zerotensor handling (#164487 ) Fixes #89034 Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True. @ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-10-04 12:03:48 +00:00
PyTorch MergeBot	f46bb04dcc	Revert "Add pure view support in autograd Function (#164467 )" This reverts commit 10335ffb2cce26c99958d055f415a16c1d14bc35. Reverted https://github.com/pytorch/pytorch/pull/164467 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))	2025-10-04 11:42:46 +00:00
PyTorch MergeBot	6f6a919366	Revert "Make custom op alias check consistent (#164576 )" This reverts commit e438db254602cf39ba536aed0590b4144c019ee8. Reverted https://github.com/pytorch/pytorch/pull/164576 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))	2025-10-04 11:42:45 +00:00
Francisco Massa	83d71dfb2f	Fix mesh.get_local_rank when it is > 1d (#164473 ) Previously, we would not take the arguments passed by get_local_rank into account. This means that we wouldn't be able to trace this call if we had a device_mesh > 1d Pull Request resolved: https://github.com/pytorch/pytorch/pull/164473 Approved by: https://github.com/xmfan, https://github.com/Skylion007	2025-10-04 11:27:55 +00:00
Yuanyuan Chen	5103ecc5d8	[1/N] Fix clang-tidy readability checks (#164561 ) Check all `.cpp` files except `jit` files for readability thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561 Approved by: https://github.com/Skylion007	2025-10-04 09:40:38 +00:00
Evan Conway	9580539e2f	Add device argument to torch.random.get_rng_state (#163034 ) Fixes #162812 Adds support for either passing a device directly into get_rng_state, or passing in a string or int (which is then wrapped into a device inside, as in torch.cuda.get_rng_state). I wasn't exactly sure where tests for this should go, please let me know. I used this script for testing: ```python import torch # note: when running with CUDA GPU, first three tests will give the same result, # as will the last two # test with no device specified print(torch.get_rng_state()) # test with CPU cpu_device = torch.device("cpu") print(torch.get_rng_state(cpu_device)) # test with direct name print(torch.get_rng_state("cpu")) # test with CUDA cuda_device = torch.device("cuda:0") print(torch.get_rng_state(cuda_device)) # test with integer print(torch.get_rng_state(0)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163034 Approved by: https://github.com/ezyang, https://github.com/cyyever	2025-10-04 06:48:39 +00:00
Yuanyuan Chen	a11a66ef32	Remove CUDA 11 branches for sparse code (#164531 ) This PR removes outdated CUDA version checks from sparse code in aten/src/ATen/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164531 Approved by: https://github.com/eqy	2025-10-04 06:07:49 +00:00
Shangdi Yu	6b768e1890	Support propagating custom meta field to backward graph nodes (#164174 ) # Propagate custom meta data to backward Support propagating the user annotation tags to backward graph, by extending the `copy_fwd_metadata_to_bw_nodes` utils (recommended by @xmfan , thanks!). Example annotation API (added in https://github.com/pytorch/pytorch/pull/163673): ``` class M(torch.nn.Module): def forward(self, x): with fx_traceback.annotate({"pp_stage": 0}): with fx_traceback.annotate({"fdsp_bucket": 0}): x = x + 1 x = x - 2 with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}): x = x * 2 x = x / 3 return x ``` Assumptions (some inherited from https://github.com/pytorch/pytorch/pull/126573): - I am trusting the seq_nr mapping introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129 - I am also trusting that the forward is single threaded, since seq_nr is thread local. If this isn't always true, we'll need to also plumb thread_id through the same machinery which is populating seq_nr. - (This is changed in this PR!) I assume all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes. If we don't run export before `aot_export_join_with_descriptors`, then none of the nodes has "nn_module_stack" in node meta. If we do run export first, then we don't need this change. - I copy "custom" node meta from forward to backward graph nodes. Question: - Is it a good idea to copy all "custom" node meta? Or should we create a dedicated key in custom node meta to be copied? @SherlockNoMad - Do we expect people to run export before using `aot_export_join_with_descriptors`? - Can we assume the following for graph produced by `aot_export_join_with_descriptors`? "all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes". Maybe this is a question for @ezyang ``` python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_ python test/export/test_export.py -k preserve_anno python test/distributed/tensor/test_dtensor_export.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164174 Approved by: https://github.com/xmfan, https://github.com/SherlockNoMad	2025-10-04 05:03:32 +00:00
Yuanyuan Chen	35c4130fd1	[2/N] Fix ruff warnings (#164460 ) Apply ruff `SIM` rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460 Approved by: https://github.com/ezyang	2025-10-04 03:40:32 +00:00
Kirthi Shankar Sivamani	34042a9145	Change intra-graph offset dtype to `uint64_t` (#164515 ) Even though `offset_intragraph_` only tracks RNG consumption within a single graph replay, we have observed that the 32bit storage for these offsets is easy to overshoot, especially for cases with big CUDA graph captures including kernels that are generating a large amount of random numbers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164515 Approved by: https://github.com/eee4017, https://github.com/eqy	2025-10-04 03:39:09 +00:00
Ken	9d1ab4f4bb	[CI] Limit Numba CUDA-13 patch to CUDA environments only (#164607 ) The patch introduced in https://github.com/pytorch/pytorch/pull/163111 caused issues in ROCm environments. This change guards the patching logic to CUDA environments only, thus ameliorating test failures in ROCm environments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164607 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-04 02:39:07 +00:00
Tugsbayasgalan Manlaibaatar	3e0826c9d7	Update disabling fast-path for strict-export inside MultiheadAttention (#164544 ) For some reason, executorch needs the slow path. But the original flag doesn't work for new export because we inline torch modules even before getting into make_fx. We still have to keep the old flag because lot of code assumes this exist.... grr Differential Revision: [D83810733](https://our.internmc.facebook.com/intern/diff/D83810733) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164544 Approved by: https://github.com/anijain2305, https://github.com/mikaylagawarecki	2025-10-04 02:20:55 +00:00
fduwjj	86c789849e	[fr] Re-order mismatch check in fr analysis script (#164606 ) In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following: 1. We do collective type check first 2. Then size check (excluding all2all) 3. dtype check 4. state check Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606 Approved by: https://github.com/VieEeEw	2025-10-04 01:16:15 +00:00
Yuanyuan Chen	f3afbcf340	[ONNX] Bump tested onnxruntime to 1.23.0 and onnxscript to 0.5.2 (#164440 ) Performs tests on the latest ONNX environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164440 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-10-04 01:10:47 +00:00
Shunting Zhang	40b25578e4	[Inductor] deterministic mode (#163589 ) Add a deterministic mode to skip the on device benchmarking that we know should affect numeric. This include - pad-mm - dynamic rblock scaling - template autotuning - coordinate descent tuning for reduction - reduction config autotuning in CachingAutotuner. For reduction both RBLOCK, num_warps should affect numeric. XBLOCK does not. We can still autotune XBLOCK for reductions. - benchmarking for computation communication reordering pass The mode definitely has perf hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163589 Approved by: https://github.com/v0i0	2025-10-04 01:05:08 +00:00
Jeff Daily	412c6d28ec	[ROCm][CI] additional dynamo benchmarks for inductor-periodic (#164279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164279 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-04 00:55:17 +00:00
soulitzer	7d570129e0	Fix custom autograd Function memory leak when saving mutated view (#164407 ) Fixes https://github.com/pytorch/pytorch/issues/160317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164407 Approved by: https://github.com/albanD	2025-10-04 00:47:12 +00:00
Avik Chaudhuri	97ca21106d	move fw\|bw compiler args in aot joint with descriptors (#164584 ) Summary: Minor refactor where we push some args in the aot joint with descriptors workflow that are not used in export stage to the compile stage where they are actually used. Test Plan: existing tests should pass Differential Revision: D83850316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164584 Approved by: https://github.com/tugsbayasgalan	2025-10-04 00:24:46 +00:00
Laith Sakka	27234792ad	Fix refine_ranges corner case (#164075 ) address https://github.com/pytorch/pytorch/issues/161360 u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164075 Approved by: https://github.com/ColinPeppler	2025-10-03 23:30:46 +00:00
Pierre Moulon	b6b7a44dec	Fix common typos and misspellings (#164413 ) Summary: This commit fixes numerous typos and misspellings found throughout the codebase. The fixes improve code readability and documentation consistency across C++, Python, CUDA, and documentation files. ## Typos Fixed \| Before \| After \| Occurrences \| \|--------\|-------\|-------------\| \| occured \| occurred \| 14 \| \| accross \| across \| 9 \| \| lenght/lenghts \| length/lengths \| 8 \| \| unneccessary \| unnecessary \| 5 \| \| Peform \| Perform \| 4 \| \| furture \| future \| 3 \| \| paritioned \| partitioned \| 2 \| \| desireable \| desirable \| 2 \| \| registerations \| registrations \| 2 \| \| seperated \| separated \| 2 \| \| intialized \| initialized \| 2 \| \| capatibility \| compatibility \| 2 \| \| peformed \| performed \| 2 \| \| Exmple \| Example \| 2 \| \| comma_seperated \| comma_separated \| 2 \| \| cumsuming \| consuming \| 2 \| \| neccessary \| necessary \| 1 \| \| ParamterMetadataTable \| ParameterMetadataTable \| 1 \| \| matached \| matched \| 1 \| \| conaitner \| container \| 1 \| \| reivew \| review \| 1 \| \| prioriry \| priority \| 1 \| \| Alocated \| Allocated \| 1 \| \| opportunixtically \| opportunistically \| 1 \| \| peformance \| performance \| 1 \| \| equavalent \| equivalent \| 1 \| \| asssumed \| assumed \| 1 \| \| valdiation \| validation \| 1 \| \| apprear \| appear \| 1 \| \| consectuve \| consecutive \| 1 \| \| dependending \| depending \| 1 \| \| copnversion \| conversion \| 1 \| \| weigted \| weighted \| 1 \| \| repreesenting \| representing \| 1 \| \| finialize \| finalize \| 1 \| \| unintialized \| uninitialized \| 1 \| \| conbined \| combined \| 1 \| \| tesnor \| tensor \| 1 \| \| desugared \| discarded \| 1 \| \| behaviour \| behavior \| 1 \| \| paramerizaitons \| parametrizations \| 1 \| \| compute_output_lenghths_kernel \| compute_output_lengths_kernel \| 1 \| Test Plan: N/A - mostly comments - waiting on CI Differential Revision: D83695665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164413 Approved by: https://github.com/eqy, https://github.com/larryliu0820	2025-10-03 23:19:41 +00:00
PyTorch MergeBot	3ddf2018d0	Revert "Support setting grad_dtype on leaf tensors (#162815 )" This reverts commit dca73982c53e9f99f96246b5d9ed9bab83c7423f. Reverted https://github.com/pytorch/pytorch/pull/162815 on behalf of https://github.com/yangw-dev due to break internal test D83850533, see more details below ([comment](https://github.com/pytorch/pytorch/pull/162815#issuecomment-3367498501))	2025-10-03 23:14:28 +00:00
Catherine Lee	fac6f20ae3	[CI] Add another win shard (#164605 ) Since its timing out `0b4f2b46d9/1` the first shard is disproportionately long because of cpp tests, I'm trying to figure that out but for now we can do this or increase the timeout Pull Request resolved: https://github.com/pytorch/pytorch/pull/164605 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-10-03 22:51:09 +00:00
Pradeep Fernando	1894082000	UT/Examples for resharding checkpoint save/loads for distributed tensors with uneven shards. (#160533 ) 1\ DTensor abstraction on its own, does not support arbitrary length shards in its distributed tensors representation. It supports a single uneven shard, bit it has to be the last shard in the sharding dimension. 2\ However, DCP supports an API called checkpointable. This API allows you to define your custom shardable tensor structure. I have given a UT example ( look for CheckpointableDistTensor). Therefore, one option is to use CheckpointableDistTensor to save/load uneven shards. 3\ While exploring this path, I also noticed that torch.rec module also encountered a similar problem while working with DTensor. They workaround it by implementing Checkpointable API in DTensor and introducing an auxillary structure called LocalShardsWrapper. This is the second option we can use to unblock data loader resharding work. In summary; Use LocalShardWrapper + DTensor as the first option to unblock. Second preference is to use new implementation of Checkpointable API. ( similar to CheckpointbaleDistTensor I have introduced in this example). Differential Revision: D80182564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160533 Approved by: https://github.com/saumishr	2025-10-03 22:15:02 +00:00
William Wen	5a66ff4915	[dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005 Approved by: https://github.com/anijain2305, https://github.com/atalman	2025-10-03 22:07:54 +00:00
Pian Pawakapan	abadea70f3	[inductor] thread hint_override in more kernel args (#164494 ) ensure hint_override is threaded in benchmarking args Pull Request resolved: https://github.com/pytorch/pytorch/pull/164494 Approved by: https://github.com/bobrenjc93	2025-10-03 22:07:12 +00:00
Maggie Moss	f414aa8e0d	Add pyrefly suppressions (3/n) (#164588 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: uncomment lines in the pyrefly.toml file step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d after: 0 errors (1,970 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588 Approved by: https://github.com/oulgen	2025-10-03 22:03:03 +00:00
albanD	e438db2546	Make custom op alias check consistent (#164576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576 Approved by: https://github.com/soulitzer ghstack dependencies: #164467	2025-10-03 21:42:11 +00:00
albanD	10335ffb2c	Add pure view support in autograd Function (#164467 ) Fix https://github.com/pytorch/pytorch/issues/73604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164467 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-10-03 21:42:11 +00:00
Lakshay Garg	f006aee601	Speed up FP precision lookup (#164044 ) This commit simplifies the precision lookup and setting logic by reducing the number of branches and using a custom hash function. Fixes #161822. The issue described in #163709 still persists. This is meant as a short term fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-03 21:35:20 +00:00
Eli Uriegas	8d53d788fe	lint: add .pyi to changed files on .pyi.in changes (#164603 ) We were observing issues where the lint on trunk vs. PRs would be different due to missing .pyi files. This change adds the .pyi files to the changed files list when .pyi.in files are changed. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164603 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/Skylion007	2025-10-03 21:30:54 +00:00
PyTorch MergeBot	0b4f2b46d9	Revert "[inductor] require shape in TritonCSEVariable (#162275 )" This reverts commit f465ea6752c91498de63eb57439a74f4836e568a. Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/yangw-dev due to break interal test, see more details in next comment ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3367213941))	2025-10-03 21:07:00 +00:00
Cynthia Yang	960c4b9937	[inductor] Enable triton kernels with unbacked inputs (#164509 ) Summary: We need to pass in fallback value to avoid converting symbols to int original failure log in onefeed Slimper MB - P1973406565 `raise TypeError("Cannot convert symbols to int")` Test Plan: if not passing in fallback value - https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz `raise TypeError("Cannot convert symbols to int")` ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)" ``` Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031 Network: Up: 0B Down: 16MiB (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654) Executing actions. Remaining 0/6 5.0s exec time total Command: test. Finished 2 cache (100% hit) 5.0s exec time cached (100%) Time elapsed: 33.8s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D83684260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509 Approved by: https://github.com/ColinPeppler	2025-10-03 21:05:18 +00:00
Yuanyuan Chen	1f8ee5da11	[TorchGen] Remove unused variables and function imports (#164538 ) This PR removes unused code in torchgen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164538 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-10-03 20:49:36 +00:00
rraminen	da49a57d34	[ROCm] Enabled JIT UTs on ROCm (#164582 ) This PR is to enable the following tests rocm. test/test_jit.py::TestBackends::test_save_load test/test_jit.py::TestBackends::test_execution test/test_jit.py::TestBackends::test_errors test/test_jit.py::TestCUDA::test_current_stream Verified that the tests pass on AMD gfx90a and gfx942 arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164582 Approved by: https://github.com/jeffdaily	2025-10-03 20:16:41 +00:00
PyTorch MergeBot	8ec8c14ace	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))	2025-10-03 20:15:56 +00:00
Yuanyuan Chen	2d50678dcc	Fix -Wno-duplicate-decl-specifier is valid for C/ObjC but not for C++ (#164552 ) Fixes #99715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164552 Approved by: https://github.com/Skylion007	2025-10-03 20:12:49 +00:00
Prachi	3ca09d65f1	[ROCm] Enable several distributed UTs (#164390 ) Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200. - test_data_parallel.py:test_strided_grad_layout - test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess Skip for MI200: - test_fully_shard_training.py:test_2d_mlp_with_nd_mesh - test_2d_composability.py:test_train_parity_2d_mlp - test_fully_shard_overlap.py:test_fully_shard_training_overlap Fixes #159489 Fixes #159488 Fixes #152700 Fixes #125555 Fixes #134139 Working as is on both MI200 and MI300: Fixes #125991 Fixes #125918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390 Approved by: https://github.com/jeffdaily	2025-10-03 19:52:51 +00:00
Nikita Shulga	1bb68271b7	Stop building nativert in OSS (#164463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164463 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-10-03 19:41:15 +00:00
Aaron Gokaslan	9eb89a4ad5	Add missing TypeIs to torch/_inductor/ir.py (#164489 ) This should be a TypeIs here Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489 Approved by: https://github.com/mlazos	2025-10-03 19:34:20 +00:00
blorange-amd	15d726005d	Enable several unit tests on ROCm (#163087 ) Code change enables: test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16 test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32 test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64 test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16 test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float32 test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float64 test_ops::TestCommonCUDA::test_complex_half_reference_testing_as_strided_scatter_cuda_complex32 Fixes https://github.com/pytorch/pytorch/issues/134687 Fixes https://github.com/pytorch/pytorch/issues/78205 Closing github issues: inductor/test_gpu_cpp_wrapper unit tests: Fixes https://github.com/pytorch/pytorch/issues/157084 test_nn unit tests: Fixes https://github.com/pytorch/pytorch/issues/157167 Fixes https://github.com/pytorch/pytorch/issues/157119 Fixes https://github.com/pytorch/pytorch/issues/157118 Fixes https://github.com/pytorch/pytorch/issues/157115 Fixes https://github.com/pytorch/pytorch/issues/157081 Fixes https://github.com/pytorch/pytorch/issues/155216 Fixes https://github.com/pytorch/pytorch/issues/157259 Fixes https://github.com/pytorch/pytorch/issues/157166 Fixes https://github.com/pytorch/pytorch/issues/157165 Fixes https://github.com/pytorch/pytorch/issues/157164 Fixes https://github.com/pytorch/pytorch/issues/157117 Fixes https://github.com/pytorch/pytorch/issues/157116 Fixes https://github.com/pytorch/pytorch/issues/157114 Fixes https://github.com/pytorch/pytorch/issues/157113 Fixes https://github.com/pytorch/pytorch/issues/157082 Fixes https://github.com/pytorch/pytorch/issues/157080 Fixes https://github.com/pytorch/pytorch/issues/157079 Fixes https://github.com/pytorch/pytorch/issues/157078 test_linalg unit tests: Fixes https://github.com/pytorch/pytorch/issues/157427 Fixes https://github.com/pytorch/pytorch/issues/157414 Fixes https://github.com/pytorch/pytorch/issues/157369 Fixes https://github.com/pytorch/pytorch/issues/157349 Fixes https://github.com/pytorch/pytorch/issues/157348 Fixes https://github.com/pytorch/pytorch/issues/157337 Fixes https://github.com/pytorch/pytorch/issues/157336 Fixes https://github.com/pytorch/pytorch/issues/157297 Fixes https://github.com/pytorch/pytorch/issues/157281 Fixes https://github.com/pytorch/pytorch/issues/157260 Fixes https://github.com/pytorch/pytorch/issues/157171 Fixes https://github.com/pytorch/pytorch/issues/157169 Fixes https://github.com/pytorch/pytorch/issues/157168 Fixes https://github.com/pytorch/pytorch/issues/157125 Fixes https://github.com/pytorch/pytorch/issues/157124 Fixes https://github.com/pytorch/pytorch/issues/157123 Fixes https://github.com/pytorch/pytorch/issues/157089 Fixes https://github.com/pytorch/pytorch/issues/157088 Fixes https://github.com/pytorch/pytorch/issues/157087 Fixes https://github.com/pytorch/pytorch/issues/157068 Fixes https://github.com/pytorch/pytorch/issues/157067 Fixes https://github.com/pytorch/pytorch/issues/157066 Fixes https://github.com/pytorch/pytorch/issues/157047 Fixes https://github.com/pytorch/pytorch/issues/157046 Fixes https://github.com/pytorch/pytorch/issues/157045 Fixes https://github.com/pytorch/pytorch/issues/157044 Fixes https://github.com/pytorch/pytorch/issues/156997 Fixes https://github.com/pytorch/pytorch/issues/156996 Fixes https://github.com/pytorch/pytorch/issues/156995 Fixes https://github.com/pytorch/pytorch/issues/156994 Fixes https://github.com/pytorch/pytorch/issues/156993 Fixes https://github.com/pytorch/pytorch/issues/156991 Fixes https://github.com/pytorch/pytorch/issues/156990 Fixes https://github.com/pytorch/pytorch/issues/156989 Fixes https://github.com/pytorch/pytorch/issues/105118 Fixes https://github.com/pytorch/pytorch/issues/157415 Fixes https://github.com/pytorch/pytorch/issues/157282 Fixes https://github.com/pytorch/pytorch/issues/157261 Fixes https://github.com/pytorch/pytorch/issues/157170 Fixes https://github.com/pytorch/pytorch/issues/157126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163087 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-10-03 19:30:59 +00:00
Zhengxu Chen	16f9bef642	[precompile] Fix guard serialization loading bugs. (#164490 ) Summary: Added a set of fixes triggered by fm training job. Overall the theme here is that we should get rid of saved objects as much as possible when they are not used in guard reconstruction. Sometimes for objects that cannot be saved (like local functions) we still try our best to save their closures. Test Plan: test_guard_serialization.py test_lazy_awatiable.py Differential Revision: D83766926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164490 Approved by: https://github.com/jamesjwu	2025-10-03 19:20:07 +00:00
Eddie Yan	3c59351c6e	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-03 18:59:12 +00:00
Eli Uriegas	7eb1eb4313	ci: Removing ROCm tests from trunk. (#164585 ) Had a conversation with the AMD team today and I think we are all in agreement that the current state of queueing for AMD is beyond where we'd like to be for there to be blocking CI for ROCm. Moving the representative testing jobs for this into the ciflow/rocm workflow. I'd love for these to be back in trunk if we can get to a state where our queueing metrics are below an hour for ROCm infrastructure. Dashboards: * ROCm Queueing (>60mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A06%3A45.025Z&endDate=2025-10-03T16%3A06%3A45.025Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.2&machineTypes=linux.rocm.gpu.4&machineTypes=linux.rocm.gpu.mi250&machineTypes=linux.rocm.gpu.gfx942.1&machineTypes=linux.rocm.gpu.mi250.4&machineTypes=linux.rocm.gpu.gfx942.4&machineTypes=linux.rocm.gpu.mi355.2&machineTypes=linux.rocm.gpu.gfx942.4.test&machineTypes=linux.rocm.gpu.mi250.1&machineTypes=linux.rocm.gpu.gfx942.1.test&machineTypes=linux.rocm.gpu.gfx90a.1&machineTypes=linux.rocm.gpu.gfx90a.4&items=linux.rocm.gpu.2&items=linux.rocm.gpu.4&items=linux.rocm.gpu.mi250&items=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.mi250.4&items=linux.rocm.gpu.gfx942.4&items=linux.rocm.gpu.mi355.2&items=linux.rocm.gpu.gfx942.4.test&items=linux.rocm.gpu.mi250.1&items=linux.rocm.gpu.gfx942.1.test&items=linux.rocm.gpu.gfx90a.1&items=linux.rocm.gpu.gfx90a.4)) * NVIDIA queueing (<5mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A05%3A08.000Z&endDate=2025-10-03T16%3A05%3A08.000Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=lf.linux.g4dn.4xlarge.nvidia.gpu&machineTypes=linux.g4dn.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.metal.nvidia.gpu&machineTypes=linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g4dn.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.4xlarge.nvidia.gpu&machineTypes=linux.g5.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.4xlarge.nvidia.gpu&machineTypes=lf.linux.4xlarge.nvidia.gpu&machineTypes=linux.g6.12xlarge.nvidia.gpu&items=lf.linux.g4dn.4xlarge.nvidia.gpu&items=linux.g4dn.12xlarge.nvidia.gpu&items=linux.g4dn.metal.nvidia.gpu&items=linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g4dn.12xlarge.nvidia.gpu&items=lf.linux.g5.12xlarge.nvidia.gpu&items=lf.linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.4xlarge.nvidia.gpu&items=linux.g5.12xlarge.nvidia.gpu&items=linux.g4dn.4xlarge.nvidia.gpu&items=lf.linux.4xlarge.nvidia.gpu&items=linux.g6.12xlarge.nvidia.gpu)) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164585 Approved by: https://github.com/malfet, https://github.com/yangw-dev, https://github.com/atalman, https://github.com/jeffdaily	2025-10-03 18:19:24 +00:00
Banit Agrawal	f39789cdab	[PyTorch Pinned Allocator] Add support of reserved pinned memory segment to avoid slow paths (#164501 ) Summary: This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths). Example: PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048 This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called. Differential Revision: D83779074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501 Approved by: https://github.com/yangw-dev	2025-10-03 18:11:27 +00:00
Yuanyuan Chen	3d9d41c801	Remove old workaround in launch_logcumsumexp_cuda_kernel (#164567 ) Remove workaround for CUDA 11.4 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/164567 Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007	2025-10-03 18:07:02 +00:00
Pian Pawakapan	5b0b4cda4a	[dtensor] avoid shape recompilations on DTensorSpec (#163820 ) skips DTensorSpec.sizes/strides in metadata guard checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/163820 Approved by: https://github.com/azahed98	2025-10-03 17:18:18 +00:00
Tugsbayasgalan Manlaibaatar	2a11ce2c78	Support calling torch.compile inside non-strict export (#164171 ) So this fixes at least two issues: 1) When we are invoking inductor backend, we apply pre-grad passes which try to find correct fake mode to use. In the nested case, we will run into clash when there is closure variable in the inductor region because non-strict would have fakified this variable before hand and inner torch.compile would have created a new fresh fake mode. This is not a problem in regular torch.compile because inner torch.compile gets ignored. I don't know if we are supposed to inherit fake mode from parent context in this case. But we can avoid this problem if we just default to eager backend which is fine in this case because the point of export is to capture aten operators. Going to inductor would mean we will lose inner torch.compile ops. 2) There is custom torch function modes in export that track number of torch fns executed and inner compile itself doesn't work because of guard failure as this mode state gets changed. I noticed torch.cond fixes this problem by carefully stashing the torch function mode and defer it in the backend. So the correct thing to do here is just re-use torch.cond implementation unconditionally. So the things i did for fixing above were: 1) Always default to eager backend when compile is invoked inside export. I needed to make how torch.cond sets up the fresh tracing env into an util that can be shared. 2) The previous eager backend for torch.cond was wrong because the context managers didn't actually persist until the backend is invoked. 3) torch.cond used only disable TorchFunctionMetadata tf mode and stash it for later, but in fact, we should do both TorchFunctionMetadata and PreDispatchTorchFunctionMode. With above fixes, we are able to export flex attention in export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164171 Approved by: https://github.com/ydwu4	2025-10-03 16:31:07 +00:00
drisspg	3288fbf374	Change default device to current acclerator (#164399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164399 Approved by: https://github.com/albanD	2025-10-03 16:15:09 +00:00
James Wu	fa5306b4f5	Support partial _DynamoCacheEntries when not all backends available (#163521 ) Differential Revision: [D82735769](https://our.internmc.facebook.com/intern/diff/D82735769/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163521 Approved by: https://github.com/zhxchen17	2025-10-03 16:14:32 +00:00
Jeff Daily	5656d45c8f	forward fix #164481 (#164578 ) PR #164481 added unit test test_scaled_mm_preserves_strides in test/inductor/test_fp8.py. It was missing the adjustment for ROCm's F8 types on MI300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164578 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-03 15:44:34 +00:00
atalman	e40fe634b1	Pin conda version for Docker builds (#164575 ) Mitigates https://github.com/pytorch/pytorch/issues/164574 Remove unused CUDA_CHANNEL var - this was used before when we had pytorch install via conda. Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575 Approved by: https://github.com/malfet, https://github.com/Camyll	2025-10-03 15:01:35 +00:00
bobrenjc93	3db2164341	[torchfuzz] add norm operators (#164514 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164514 Approved by: https://github.com/pianpwk ghstack dependencies: #164432, #164434	2025-10-03 14:44:19 +00:00
bobrenjc93	5bb8f04d3e	[torchfuzz] add nn functional ops (#164434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164434 Approved by: https://github.com/pianpwk ghstack dependencies: #164432	2025-10-03 14:44:19 +00:00
Yuanyuan Chen	5743d731c1	Use torch.testing.test_close instead of torch.testing.test_allclose (#164539 ) Because torch.testing.test_allclose is deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164539 Approved by: https://github.com/mlazos	2025-10-03 14:39:10 +00:00
PyTorch UpdateBot	aed66248a0	[vllm hash update] update the pinned vllm hash (#164319 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164319 Approved by: https://github.com/pytorchbot Co-authored-by: Huy Do <huydhn@gmail.com>	2025-10-03 12:30:33 +00:00
Nicolas Macchioni	6c3c9414eb	config for dcache + unit tests (#164512 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164512 Approved by: https://github.com/jananisriram	2025-10-03 10:52:59 +00:00
CWOA	eccf561326	Move call to output generated code in inductor (#161615 ) This PR moves the call to copy the generated code from `/tmp/...` so that it is still called if attempting to compile the generated code fails. In both cases now, the generated code will be copied across to `torch_compile_debug/run_.../torchinductor/output_code.py` which makes debugging bad generated code easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161615 Approved by: https://github.com/eellison	2025-10-03 10:23:22 +00:00
jainapurva	ddf8de28c2	Add Rocm to Operator Microbenchmark CI (#164173 ) This pull request adds support for running operator microbenchmarks on ROCm (AMD GPU) environments in the CI workflow. The main changes involve introducing new build and test jobs for ROCm in the `.github/workflows/operator_microbenchmark.yml` file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164173 Approved by: https://github.com/huydhn	2025-10-03 07:35:32 +00:00
bobrenjc93	7617b113ad	[torchfuzz] Support EagerVsFullGraphDynamicCompileWithNumericsCheck (#164432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164432 Approved by: https://github.com/pianpwk	2025-10-03 06:42:20 +00:00
fduwjj	2a760dc51e	[DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213 ) We want to refactor the internal bookkeeping of DeviceMesh so that: Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible. Concretely, in this PR, we do the following: 1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims. 2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh. 3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`. 4. Use the newly added function `check_overlap` to check layout overlap. 5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1). Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match. <img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" /> The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor. With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout. This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213 Approved by: https://github.com/lw, https://github.com/fegin	2025-10-03 05:51:28 +00:00
Henry Tsang	6c209bfc5c	[cutlass-4][take 2] upgrade to cutlass 4.2.1 (#164159 ) Test Plan: Sandcastle Differential Revision: D83492704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164159 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-10-03 03:47:59 +00:00
Maggie Moss	1051c1de5c	Add pyrefly suppressions 2/n (#164513 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check --- step 1: uncomment lines in the `pyrefly.toml` file before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46 after: ``` INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml` INFO 0 errors (1,152 ignored) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164513 Approved by: https://github.com/oulgen	2025-10-03 02:46:13 +00:00
Ke Wen	d1cbb74fb1	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-03 02:41:10 +00:00
Markus Hoehnerbach	91c4db76cb	fix flex attention eager: dont round down scores to low-precision (closes #163588 ) (#163986 ) Fixes: https://github.com/pytorch/pytorch/issues/163588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163986 Approved by: https://github.com/drisspg, https://github.com/mlazos	2025-10-03 01:09:59 +00:00
eellison	4691fe6070	remove unnecessary registration (#164481 ) scaled_mm already had `needs_exact_strides` in its op registration. also added a test showing these strides are being respected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164481 Approved by: https://github.com/drisspg, https://github.com/mlazos	2025-10-03 01:03:12 +00:00
Kurt Mohler	ef50c6e3e3	[MPS] Add backward pass for `embedding_bag` (#163931 ) Fixes #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163931 Approved by: https://github.com/malfet	2025-10-03 00:48:38 +00:00
eellison	86474ce996	Update mask dtype (#164472 ) Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472 Approved by: https://github.com/bdhirsh	2025-10-03 00:19:36 +00:00
Yuanyuan Chen	18e18488e8	[6/N] Apply ruff UP035 rule (#164438 ) Continued code migration to enable ruff UP035. Most changes are about moving `Callable` from typing to from collections.abc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164438 Approved by: https://github.com/ezyang	2025-10-03 00:15:32 +00:00
Eddie Yan	f7082e92b3	[cuBLAS] update cuBLAS determinism docs, remove workspace requirement checks (#161749 ) Since CUDA 11.x (need to update the docs for this, current PR is saying 12.2 which is incorrect) we've been allocating cuBLAS workspaces explicitly per handle/stream combination https://github.com/pytorch/pytorch/pull/85447 According to the cuBLAS documentation, this appears to be sufficient for determinism without any explicit workspace requirements to e.g., `:4096:8` or `:16:8` as was previously expressed in PyTorch docs https://docs.nvidia.com/cuda/cublas/#results-reproducibility Planning to add an explicit determinism test as well... Pull Request resolved: https://github.com/pytorch/pytorch/pull/161749 Approved by: https://github.com/ngimel	2025-10-03 00:09:47 +00:00
Yang Wang	95a053284c	Fix vllm build issue (#164361 ) Fixes #ISSUE_NUMBER unstable https://github.com/pytorch/pytorch/issues/164362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164361 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-10-02 23:34:21 +00:00
Jagadish Krishnamoorthy	c7e30ae4dd	MX: Remove redundant PLATFORM_SUPPORTS_MX_GEMM constant (#164320 ) Deleted duplicate definition of PLATFORM_SUPPORTS_MX_GEMM, was introduced in https://github.com/pytorch/pytorch/pull/162209 Also, adjusted BLOCK_SIZE and fp4_scaling_dtype in test_matmul_cuda.py to enable test_blockwise_nvfp4_compile on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164320 Approved by: https://github.com/jeffdaily	2025-10-02 23:30:56 +00:00
soulitzer	dca73982c5	Support setting grad_dtype on leaf tensors (#162815 ) `grad_dtype` is a new attribute on Tensor to control gradient dtype: - Access/setting is leaf-only. - grad_dtype is respected when (1) when assigning to .grad, and (2) in the engine after the previous node produces incoming gradients for AccumulateGrad. (See table below for details) - Not setting grad_dtype preserves the current behavior. Accessing it returns `t.dtype` - `grad_dtype` cannot be set when there is already a `.grad` present and the dtypes conflict. \| `grad_dtype` setting \| Setting `.grad` manually \| Incoming gradient from autograd engine \| \|-----------------------\|--------------------------\|-----------------------------------------\| \| Default (tensor’s dtype) \| `.grad` must match tensor’s dtype \| Engine casts incoming grad to tensor’s dtype \| \| Set to specific dtype \| `.grad` must match that dtype \| Engine casts incoming grad to the specified dtype \| \| Set to `None` \| `.grad` may be any dtype \| Engine does not cast; accepts incoming grad dtype as-is \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/162815 Approved by: https://github.com/albanD	2025-10-02 23:09:07 +00:00
Nan Zhang	43848b71d9	Improved support for autotuning in wrapper_fxir (#164132 ) Summary: - correct dtype propagation - allow more more options to be passed to compiler Test Plan: in follow up change Differential Revision: D83367909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164132 Approved by: https://github.com/jansel	2025-10-02 22:54:22 +00:00
Laith Sakka	15c8bdcc5e	Fix FloorDiv should not generate non integer rationals (due to sympy bug) (#164398 ) FloorDiv eval have this optimization ``` # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. ``` Before this PR this optimization would generate a result in an expression like this. Duo to a bug in sympy. ``` Mul(Rational(1, 22), Add(Mul(Integer(24), Symbol('s37', integer=True, positive=True)), Integer(672)), FloorDiv(Mul(Symbol('s14', integer=True, positive=True), Symbol('s46', integer=True, positive=True)), Integer(2016))) ``` This is because in sympy an expression can have .is_integer =True yet have 1/22 in it! This PR ensure we do not generate that by simply opting out if this optimization if we end up with quotient that have such rational. Fix https://github.com/pytorch/pytorch/issues/164385, https://github.com/pytorch/pytorch/issues/154996 https://github.com/pytorch/pytorch/issues/153375 https://github.com/pytorch/pytorch/issues/164063 and internal user issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164398 Approved by: https://github.com/jansel, https://github.com/isuruf	2025-10-02 22:51:03 +00:00
PyTorch MergeBot	22e219d996	Revert "[DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213 )" This reverts commit b0985144b59db8fb20964829b5e0a9d2f9a3f0d6. Reverted https://github.com/pytorch/pytorch/pull/163213 on behalf of https://github.com/yangw-dev due to caused internal test failure ([comment](https://github.com/pytorch/pytorch/pull/163213#issuecomment-3363414435))	2025-10-02 22:22:26 +00:00
Anthony Barbier	bdc0a421d7	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-10-02 22:22:04 +00:00
ankushwahaRH	ece5e0f01b	Fake process group Direct construction error (#163665 ) Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first. This prevents silent failures and ensures proper dispatch system integration for all distributed operations. Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations. Please let me know if additional distributed operators should be tested or if any other updates are needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665 Approved by: https://github.com/ezyang	2025-10-02 22:19:26 +00:00
PyTorch MergeBot	a34797e031	Revert "Add provenance to inductor IR nodes created after graph.run (#164255 )" This reverts commit b9e73e639e36f3aa628752161711e68878231b30. Reverted https://github.com/pytorch/pytorch/pull/164255 on behalf of https://github.com/jeffdaily due to broke rocm; inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_deferred_triton_kernels [GH job link](https://github.com/pytorch/pytorch/actions/runs/18200790301/job/51821738132) [HUD commit link](`b9e73e639e`) ([comment](https://github.com/pytorch/pytorch/pull/164255#issuecomment-3363360088))	2025-10-02 22:01:41 +00:00
Isuru Fernando	f465ea6752	[inductor] require shape in TritonCSEVariable (#162275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275 Approved by: https://github.com/mlazos ghstack dependencies: #164158	2025-10-02 21:52:09 +00:00
Isuru Fernando	a8edccfbf4	[inductor] fix TestTemplateRender in select_algorithm (#164158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164158 Approved by: https://github.com/mlazos	2025-10-02 21:52:09 +00:00
Rohit Singh Rathaur	6389658ec6	Fix type hints in PrepareModuleInput and PrepareModuleInputOutput (#164482 ) Fixes #161646 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164482 Approved by: https://github.com/Skylion007	2025-10-02 21:40:43 +00:00
Xilun Wu	cc71ab86a6	[DTensor] raise error if the local_tensor argument passed to DTensor.from_local is a DTensor (#164496 ) Summary Raise error when the `local_tensor` argument passed to `DTensor.from_local` is a DTensor, this prevents users from accidentally calling `from_local` over a DTensor object. The error message is organized in this way: ``` the local_tensor argument only accepts torch.Tensor but got <class 'torch.distributed.tensor.DTensor'> value. ``` Test `pytest test/distributed/tensor/test_dtensor.py -k test_from_local` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164496 Approved by: https://github.com/ezyang	2025-10-02 21:25:01 +00:00
PyTorch MergeBot	2a7c486750	Revert "Speed up FP precision lookup (#164044 )" This reverts commit 723ba213932bb1eca90109e003250ebb0da45eb1. Reverted https://github.com/pytorch/pytorch/pull/164044 on behalf of https://github.com/yangw-dev due to broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 \| return std::hash<size_t>{}((k1 << 32) \| k2); ([comment](https://github.com/pytorch/pytorch/pull/164044#issuecomment-3363016702))	2025-10-02 21:00:44 +00:00
Maggie Moss	5f18f240de	Add initial suppressions for pyrefly (#164177 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: `python3 scripts/lintrunner.py` `pyrefly check` --- Pyrefly check before: https://gist.github.com/maggiemoss/3a0aa0b6cdda0e449cd5743d5fce2c60 After: ``` INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml` INFO 0 errors (1,063 ignored) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164177 Approved by: https://github.com/Lucaskabela	2025-10-02 20:57:41 +00:00
Jeff Daily	6b7970192f	[ROCm][CI] fix test_cudnn_convolution_relu_cuda (#164466 ) Fixes #162816. Test was comparing output of conv vs fused conv but inputs were different memory formats. Also fix test_cudnn_convolution_add_relu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164466 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-02 20:36:54 +00:00
Yuanyuan Chen	115af42e9d	Fix readibility checks in TIDY and apply them (#164475 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475 Approved by: https://github.com/albanD, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-02 20:34:49 +00:00
Yu, Guangye	5f775bdfb7	Fix THP_PyObject_VirtualFree return type (#163763 ) # Motivation `void THP_PyObject_VirtualFree` should have no return value; otherwise, it would raise a build warning ```bash C:\Users\guangyey\pytorch\torch\csrc\dynamo\cpython_defs.c(264): warning C4098: 'THP_PyObject_VirtualFree': 'void' function returning a value ``` # Additional Context Refer to `c4f21d7c7c/Include/cpython/objimpl.h (L59-L68)` PyObjectArenaAllocator::free is defined with `void` return type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163763 Approved by: https://github.com/albanD, https://github.com/williamwen42	2025-10-02 20:21:53 +00:00
bobrenjc93	8c54101933	add tensor subclass printing support in fx/graph.py (#164403 ) it was previously quite misleading since it looks like the inputs to the dynamo graph are plain tensors when in reality they are tensor subclasses before ``` class GraphModule(torch.nn.Module): def forward(self, L_input_batch_inputs_: "i64[2, 512][512, 1]cuda:0", L_self_parameters_weight_: "f32[202048, 256][256, 1]cuda:0"): ``` after ``` class GraphModule(torch.nn.Module): def forward(self, L_input_batch_inputs_: "DTensor(i64[2, 512][512, 1]cuda:0)", L_self_parameters_weight_: "DTensor(f32[202048, 256][256, 1]cuda:0)"): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164403 Approved by: https://github.com/ezyang	2025-10-02 20:06:12 +00:00
RajeshvShiyal	c45d56dd00	typo corrected in ivalue.cpp's comment (#164485 ) Fixes #164483 typo corrected in ivalue.cpp's comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164485 Approved by: https://github.com/Skylion007	2025-10-02 20:01:17 +00:00
Yuanyuan Chen	33b17bc619	Remove old CUDA version checks (#164199 ) Remove some version check code for CUDA <12. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164199 Approved by: https://github.com/ezyang	2025-10-02 19:55:47 +00:00
Jian Wen	22b1710252	Use posix_fallocate() to reserve disk space for shared memory (#161910 ) Shared memory is allocated by creating a file in /dev/shm (by default) that can run out of space. Pytorch reserves the file size by calling ftruncate() that creates a sparse file, so it succeeds even if sufficient disk space is not available. This could lead to a situation when a shared memory region is successfully created but a subsequent access to a shared memory page results in SIGBUS due to the disk being full. Using posix_fallocate() instead of ftruncate() eliminates this problem because the former syscall always allocates space and it returns an error if the disk is full. Related to https://github.com/pytorch/pytorch/issues/5040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161910 Approved by: https://github.com/mikaylagawarecki	2025-10-02 19:12:57 +00:00
Tugsbayasgalan Manlaibaatar	4661200125	[RELAND v2] Close some sources of fake tensors (#164372 ) Changelog: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We error under a config and we disable the config for some internal users. The reason we want to error is this signals a coverage problem we need to address but at the same time, we don't wnat to be disruptive to already working flows. 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict and some torchbench models like levit_128 and demucs. 3. Previous logic didn't work on the cases where we mutate a container attribute as the previous approach used to pytree over old and new attributes resulting in length mismatch. We gracefully handle this now. Differential Revision: [D83673054](https://our.internmc.facebook.com/intern/diff/D83673054) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164372 Approved by: https://github.com/avikchaudhuri	2025-10-02 18:58:52 +00:00
adabeyta	6a31f42da4	Fix NestedTensor max/min operations for integer dtypes. (#162273 ) Fixes: https://github.com/pytorch/pytorch/issues/162049 ### Summary max_dim and min_dim functions incorrectly used torch.finfo() for all dtypes, causing TypeError for integer tensors. ### Changes - Use torch.iinfo() for integer dtypes instead of torch.finfo(). - Add CPU test: `test_jagged_max_min_dtypes` covering `int8, int16, int32, int64, uint8, float16, bfloat16, float32 and float64` ### Testing Before Fix: `python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v` Output: ``` FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo' ``` After Fix: `python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v` Output: ``` Running 9 items in this shard test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 PASSED [0.0086s] [ 11%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 PASSED [0.0011s] [ 22%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 PASSED [0.0011s] [ 33%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 PASSED [0.0011s] [ 44%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 PASSED [0.0009s] [ 55%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 PASSED [0.0010s] [ 66%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 PASSED [0.0010s] [ 77%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 PASSED [0.0010s] [ 88%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 PASSED [0.0011s] [100%] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162273 Approved by: https://github.com/Skylion007, https://github.com/jbschlosser	2025-10-02 18:46:27 +00:00
Aidyn-A	c6a6c80a73	Add Aidyn-A to CUDA codeowners (#164436 ) Adding myself to "CUDA and CUDA math libraries" section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164436 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2025-10-02 18:34:10 +00:00
Shangdi Yu	bf717ce346	[AOTI win] Add ABI stable method for updating constant buffer (#163819 ) Add `struct AOTInductorConstantMapEntry` to represent the constant map in AOTI Model. We cannot use `std::unordered_map` for cross-compilation, because it is not ABI stable. it will be tested when we test `update_user_managed_constant_buffer` for windows cross-compilation Example usage: ``` // Load constants. Create random constants here. auto* fc1_w = new slim::SlimTensor(slim::empty({16, 10}, c10::kFloat, c10::Device(c10::kCUDA, 0))); fc1_w->fill_(1.0); ..... // Build pairs std::vector<AOTInductorConstantPair> constants{ {"fc1_weight", fc1_w}, {"fc1_bias", fc1_b}, {"fc2_weight", fc2_w}, {"fc2_bias", fc2_b}, }; // Call runtime (pass raw pointer + size) update_user_managed_constant_buffer_abi( container_handle, constants.data(), constants.size(), /use_inactive=/false, /validate_full_update=/true); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163819 Approved by: https://github.com/desertfire	2025-10-02 18:31:00 +00:00
PyTorch MergeBot	f6f7676756	Revert "C++-accessible Placements via pybind11 (#163030 )" This reverts commit 3e03deab6f3c268c85c8efd9546e28cdda0fa4cc. Reverted https://github.com/pytorch/pytorch/pull/163030 on behalf of https://github.com/swolchok due to doesn't pass pyre ([comment](https://github.com/pytorch/pytorch/pull/163030#issuecomment-3362450379))	2025-10-02 18:25:24 +00:00
Parthava Adabala	e6d4b26776	Update torch.rst (#164408 ) Corrected grammatical mistake Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164408 Approved by: https://github.com/mikaylagawarecki	2025-10-02 18:12:47 +00:00
PyTorch MergeBot	6bb021c125	Revert "Use TMA loads always for Triton grouped MM kernel (#164256 )" This reverts commit b1033789fea2bc82901eafed498a5252985b80e9. Reverted https://github.com/pytorch/pytorch/pull/164256 on behalf of https://github.com/yangw-dev due to failed internal test: (pytorch.tritonbench.test.test_gpu.main.TestTritonbenchGpu) Error Details: torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: All choices failed to compile for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. ([comment](https://github.com/pytorch/pytorch/pull/164256#issuecomment-3362359624))	2025-10-02 17:55:37 +00:00
Shangdi Yu	b9e73e639e	Add provenance to inductor IR nodes created after graph.run (#164255 ) Summary: as title - Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes. - Extract `assign_origin_node` function Differential Revision: D82871244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255 Approved by: https://github.com/mlazos	2025-10-02 17:32:46 +00:00
PyTorch MergeBot	0319556a35	Revert "[vision hash update] update the pinned vision hash (#154694 )" This reverts commit bcafea5c92ca2ee1b0dc8f6d8b62ecabb6f40228. Reverted https://github.com/pytorch/pytorch/pull/154694 on behalf of https://github.com/yangw-dev due to break the unittest for inductor with improved, update benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv, see failure example https://github.com/pytorch/pytorch/actions/runs/18185852421/job/51776537817 ([comment](https://github.com/pytorch/pytorch/pull/154694#issuecomment-3362285901))	2025-10-02 17:32:04 +00:00
atalman	f4cf75688f	Add CUDA release architecture matrix (#164471 ) We should surface the CUDA architecture matrix to make things more transparent. I believe this can later become its own page where we will publish supported matrix for each release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164471 Approved by: https://github.com/Camyll	2025-10-02 16:59:48 +00:00
PyTorch MergeBot	39189592fd	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit ac7b4e7fe4d233dcd7f6343d42b4fa3d64bce548. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/clee2000 due to failing internally D80206253, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3362156908))	2025-10-02 16:54:22 +00:00
Andrey Talman	235b995ce1	Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470 ) I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1 to align the support with Linux support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470 Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll	2025-10-02 16:34:42 +00:00
Anthony Barbier	ac7b4e7fe4	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-10-02 15:48:47 +00:00
PyTorch MergeBot	c6329524d8	Revert "Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527 )" This reverts commit 50c0550f5a5b1e35885d892081a7d5115d8b4489. Reverted https://github.com/pytorch/pytorch/pull/163527 on behalf of https://github.com/swolchok due to breaking import torch in debug builds, see #164297 ([comment](https://github.com/pytorch/pytorch/pull/163527#issuecomment-3361919142))	2025-10-02 15:42:42 +00:00
fduwjj	b0985144b5	[DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213 ) We want to refactor the internal bookkeeping of DeviceMesh so that: Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible. Concretely, in this PR, we do the following: 1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims. 2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh. 3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`. 4. Use the newly added function `check_overlap` to check layout overlap. 5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1). Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match. <img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" /> The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor. With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout. This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213 Approved by: https://github.com/lw, https://github.com/fegin	2025-10-02 15:42:03 +00:00
PyTorch MergeBot	7cfecd76b2	Revert "Improve repeat op to a single copy (#163842 )" This reverts commit 590224f83c8d575b52c6bc40a984132fa593256e. Reverted https://github.com/pytorch/pytorch/pull/163842 on behalf of https://github.com/yangw-dev due to internal test failed: RuntimeError: false INTERNAL ASSERT FAILED at aten/src/ATen/quantized/Quantizer.cpp:441, . cannot call qscheme on UnknownQuantizer please reach out folks who have internal access for furthur debugging. ([comment](https://github.com/pytorch/pytorch/pull/163842#issuecomment-3361746041))	2025-10-02 15:22:19 +00:00
soulitzer	bac0f289a3	Add methods to access data and unpack_hook on SavedVariable (#164358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164358 Approved by: https://github.com/albanD	2025-10-02 13:05:16 +00:00
Edward Z. Yang	39c340ec9e	Add failing bitwise equivalence UT for aot_eager on rms_norm (#164280 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164280 Approved by: https://github.com/albanD	2025-10-02 09:05:28 +00:00
drisspg	cfd46d13e6	Fix SAC + Flex issue (#164421 ) # Summary This happends when flex_attention is not tagged with the ` CheckpointPolicy.MUST_SAVE` policy. This causes the lse to be unrealized. I think in general this probably not the best policy but we shoudn't error Pull Request resolved: https://github.com/pytorch/pytorch/pull/164421 Approved by: https://github.com/Skylion007	2025-10-02 09:02:17 +00:00
Animesh Jain	0e5773b7fa	[dynamo][export] Do not graph break on torch.autograd._profiler_enabled for export (#164418 ) Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet. But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164418 Approved by: https://github.com/StrongerXi, https://github.com/williamwen42	2025-10-02 09:00:00 +00:00
angelayi	2c2e1268b7	[inductor] Handle patterns where input/output nodes are the same (#163994 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163994 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-10-02 08:37:55 +00:00
bobrenjc93	00f0365b95	[torchfuzz] add test suite of fuzzer repros that we xfail (#164430 ) i'll add the rest of the repros once in a follow up PR once we agree on a good test harness Pull Request resolved: https://github.com/pytorch/pytorch/pull/164430 Approved by: https://github.com/ezyang	2025-10-02 08:05:11 +00:00
Banit Agrawal	6bb586eafd	[PyTorch / Sigrid GPU] Fixes in pinned stats collection and add new ODS pinned memory stats (#164412 ) We do some fixes in pinned memory allocation stats collection and better differentiate between active vs allocated bytes. Reviewed By: bbus, sayitmemory Differential Revision: D83162346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164412 Approved by: https://github.com/mradmila	2025-10-02 08:04:05 +00:00
Xuehai Pan	9697a7ce9e	Better path handling for nightly setup tool (#164215 ) Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3349283789, cc @filipviz Previously, the `checkout` subcommand would reuse the `venv`, while the `pull` subcommand would remove and recreate a fresh new `venv` (without prompting before deleting). This PR: - Keep and reuse the existing `venv` by default (both `pull` and `checkout`). - Add a new `--fresh` option to delete and recreate a fresh new `venv`. - Prompt the user for confirmation (add a new `--yes` option) before deleting the existing prefix path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164215 Approved by: https://github.com/ezyang, https://github.com/malfet ghstack dependencies: #162324, #164214	2025-10-02 07:59:17 +00:00
Sherlock Huang	27eb36debb	DebugMode add ignore_compile_internals (#164205 ) Fixes #164143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164205 Approved by: https://github.com/albanD	2025-10-02 07:39:54 +00:00
Yuanyuan Chen	a43c4c3972	[5/N] Apply ruff UP035 rule (#164423 ) Continued code migration to enable ruff `UP035`. Most changes are about moving `Callable` from `typing` to `from collections.abc`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164423 Approved by: https://github.com/ezyang	2025-10-02 07:31:11 +00:00
PyTorch UpdateBot	bcafea5c92	[vision hash update] update the pinned vision hash (#154694 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154694 Approved by: https://github.com/pytorchbot Co-authored-by: Huy Do <huydhn@gmail.com>	2025-10-02 07:02:40 +00:00
Laith Sakka	3924f784ba	unbacked reshape_copy (#164336 ) address https://github.com/pytorch/pytorch/issues/162110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164336 Approved by: https://github.com/ColinPeppler	2025-10-02 06:50:48 +00:00
Avik Chaudhuri	93e833de0f	[inductor] separate preamble from main work in compile_fx (#164169 ) A couple minor things to clean up the structure of `compile_fx` before we hit pre grad passes: 1. After patching config and recursively calling `compile_fx`, we don't need the patches any more. We make the subsequent logic call a `_maybe_wrap_and_compile_fx_main` (both when cpp wrapper exists and doesn't). 2. There's some recursive wrapping that happens on inputs and outputs before hitting pre grad passes, which are now also separated out before calling a `_compile_fx_main`, where actual work finally happens. These also happen to fix a couple of TODOs in the old code. Differential Revision: D83500704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164169 Approved by: https://github.com/zhxchen17	2025-10-02 05:44:31 +00:00
Avik Chaudhuri	14791ea947	[inductor] teach bisector to look at pre_grad passes (#164250 ) Bisector was not aware of pre-grad passes. Now that pre-grad passes use their own graph transformer observer subsystem, it is possible to disable these passes in the bisector. Differential Revision: D83573614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164250 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-10-02 05:42:18 +00:00
Pat Vignola	702f6e703b	[MTIA] Enable deserialization for FP8 checkpoint loading (#163559 ) Summary: It looks like loading FP8 checkpoints goes through that path which wasn't enabled for MTIA beforehand, whereas loading BF16 checkpoints didn't. Differential Revision: D82997140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163559 Approved by: https://github.com/mikaylagawarecki	2025-10-02 04:18:46 +00:00
bobrenjc93	39b31a6bfd	[torchfuzz] keep track of operator stats (#164334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164334 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209, #164211, #164210, #164397, #164284	2025-10-02 03:48:07 +00:00
bobrenjc93	0fbe3f19c7	[torchfuzz] add matmuls (#164284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164284 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209, #164211, #164210, #164397	2025-10-02 03:33:10 +00:00
bobrenjc93	144378615a	[torchfuzz] make fuzzer deterministic (#164397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164397 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209, #164211, #164210	2025-10-02 03:10:30 +00:00
Colin Peppler	5dbae1eae2	Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr (#164013 ) ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164013 Approved by: https://github.com/cp2923, https://github.com/laithsakka	2025-10-02 02:40:54 +00:00
Scott Wolchok	3e03deab6f	C++-accessible Placements via pybind11 (#163030 ) This makes Placement data representation available in C++ via pybind11. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163030 Approved by: https://github.com/ezyang	2025-10-02 02:38:23 +00:00
henrylhtsang	349e9e922d	[cutass backend] remove cutlass presets (#164380 ) Differential Revision: [D83674898](https://our.internmc.facebook.com/intern/diff/D83674898/) Changes made by claude code (need to remove test too) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164380 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-10-02 01:26:00 +00:00
Aidyn-A	8b29c59844	[CI][CUDA] Fix distributed tests for b200 (#164345 ) This PR fixes the tests that were encountered in #159323. Namely it fixes #162746 and #162745. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164345 Approved by: https://github.com/eqy	2025-10-02 01:13:49 +00:00
lichuyang	53860ef4e1	Better error handling in torch/csrc/jit/codegen/* (#163948 ) Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in torch/csrc/jit/codegen/* Fixes some parts of ISSUE #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163948 Approved by: https://github.com/cyyever, https://github.com/FFFrog, https://github.com/albanD	2025-10-02 01:10:09 +00:00
Lakshay Garg	723ba21393	Speed up FP precision lookup (#164044 ) This commit simplifies the precision lookup and setting logic by reducing the number of branches and using a custom hash function. Fixes #161822. The issue described in #163709 still persists. This is meant as a short term fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-02 00:59:19 +00:00
PyTorch MergeBot	a10207e61b	Revert "[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760 )" This reverts commit 0925c644edafbb6a8ff42fef5f3bd48b6042fad3. Reverted https://github.com/pytorch/pytorch/pull/162760 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162760#issuecomment-3358630631))	2025-10-02 00:44:44 +00:00
Shunting Zhang	ffda8e5ddf	[inductor] log kernel autotuning result to a csv (#164191 ) Example output: https://gist.github.com/shunting314/2d646c6b6cd9a79fff7a35ffee82baed ``` for each model: for each triton kernel: for each triton config: the csv contains a line for the latency and pointer to find the kernel module in the file system ``` Would use this to try to come up with heuristics to pick a single config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164191 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-10-02 00:25:34 +00:00
jainapurva	1a5d023a5b	Add B200 to Operator Microbenchmark CI (#164288 ) Add B200 to operator microbenchmarks nightly run Pull Request resolved: https://github.com/pytorch/pytorch/pull/164288 Approved by: https://github.com/huydhn	2025-10-01 23:56:34 +00:00
atalman	566ea4e86a	Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980 ) Work Around for: https://github.com/pytorch/pytorch/issues/133437 Test plan: 1. Build whl in CI 2. Download 3. Run ``nm -D libtorch_cpu.so \| grep "recursive_directory_iterator"`` Test with check_binary_symbols.py: Success: ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 0 ``` Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols( lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W" )`` : ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 20 Traceback (most recent call last): File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module> main() File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path) File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols raise RuntimeError( RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980 Approved by: https://github.com/isuruf, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-01 23:17:30 +00:00
Edward Z. Yang	9065364995	Add xfailing test case for inplace mutation of local DTensor (#164355 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164355 Approved by: https://github.com/albanD	2025-10-01 23:16:26 +00:00
Yiming Zhou	6eb8d9671b	Enable torch.nn.functional.batch_norm in test_export_opinfo (#164261 ) Summary: There are actually 2 `nn.functional.batch_norm` in op_db. See https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_methods_invocations.py#L16797-L16831 So previously the test failed at `assert len(ops)==1` Test Plan: python test/export/test_export_opinfo.py TestExportOnFakeCudaCUDA Differential Revision: D83581427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164261 Approved by: https://github.com/SherlockNoMad	2025-10-01 21:56:08 +00:00
Han Qi	b5c4f46bb9	Add functions to setup PrivateUse1 as a python backend device. (#157859 ) Fixes #156052 and #156444. This PR setup the privateuseone key in Python to be used as a python backend for pytorch. Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor. Changes done in this PR: 1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine. 2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops. 3. python function that invokes the other incantations to setup the privateusekey backend. This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859 Approved by: https://github.com/albanD	2025-10-01 21:32:59 +00:00
Wei Wang	773c6762b8	[CD][CUDA13][NCCL] Fix nccl version typo for cu13 (#164383 ) https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+. Companion PR: https://github.com/pytorch/pytorch/pull/164352 Fixes a potential binary breakage due to non-existence of referenced NCCL cu13 version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164383 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman	2025-10-01 21:32:25 +00:00
Shangdi Yu	7320f44cdc	Skip windows unittest in fbcode (#164363 ) Summary: as title Test Plan: ``` buck run fbcode//caffe2/test/inductor:aot_inductor_windows ``` Differential Revision: D83664801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164363 Approved by: https://github.com/angelayi	2025-10-01 20:18:19 +00:00
Catherine Lee	e5c0e6b5e3	[testing] Better short job name during upload additional stats (#164287 ) I think we usually we leave the ` / test` in for clarity Pull Request resolved: https://github.com/pytorch/pytorch/pull/164287 Approved by: https://github.com/atalman, https://github.com/malfet	2025-10-01 19:56:20 +00:00
Jeff Daily	7304b9e7d2	[ROCm] fix carveout feature (#164303 ) Fixes #164271. Carveout had been applied with an opposite bitmask. Besides being incorrect, this lead to flaky unit test behavior due to carveout being too high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164303 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-01 19:25:41 +00:00
Yuanyuan Chen	315ffdc1e4	[4/N] Apply ruff UP035 rule to python code (#164206 ) Follows #164104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164206 Approved by: https://github.com/albanD	2025-10-01 19:05:53 +00:00
Isuru Fernando	8c590cab9d	[inductor] add a runtime assert for triton shapes (#164242 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164242 Approved by: https://github.com/eellison, https://github.com/mlazos ghstack dependencies: #164241	2025-10-01 18:55:33 +00:00
Isuru Fernando	9357c31b53	[inductor] Fix constant shape for float constants (#164241 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164241 Approved by: https://github.com/mlazos	2025-10-01 18:55:33 +00:00
Nikita Shulga	f63d16c6a9	Make viable/strict updatable again (#164374 ) To allow viable/strict to move forward, after https://github.com/pytorch/pytorch/pull/164260 was landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/164374 Approved by: https://github.com/seemethere	2025-10-01 18:09:07 +00:00
Animesh Jain	8dfc8efffd	[export] Preserve nn_module_stack for aliased nn modules (#164311 ) Preparing for install_free_tensors flag. Thanks to @tugsbayasgalan in coming up with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164311 Approved by: https://github.com/tugsbayasgalan	2025-10-01 18:04:33 +00:00
Anshul Sinha	3ffaab3bc8	[Replicate][Pipeline Parallelism] integration of new replicate function with pipeline parallelism (#164031 ) Summary: In order to test numerics for replicate + pp, stage.py needs to be able to call replicate's backward manually as pipeline parallelism doesn't have this feature. Test Case 1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp Pull Request resolved: https://github.com/pytorch/pytorch/pull/164031 Approved by: https://github.com/weifengpy, https://github.com/H-Huang ghstack dependencies: #163897	2025-10-01 18:01:16 +00:00
Ke Wen	ebd0707578	[SymmMem] Add get_nbi the nonblocking version (#163540 ) ```Py @triton.jit def foo(dest, src): nvshmem.get_nbi(dest, src, 100, 0) # Some independent computation which overlaps with the get operation ... # Wait for completion of the get operation nvshmem.quiet() ``` Allows us to overlap comm and compute in the same kernel, instead of two kernels + signals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163540 Approved by: https://github.com/ngimel, https://github.com/fegin	2025-10-01 17:50:24 +00:00
Edward Yang	76ddbc2bbb	Add option to FakeProcessGroup to raise error if comms are invoked. (#162841 ) The current behavior is to do "nothing", which means you will corrupt data. If you're doing something similar to LocalTensor, where you're overriding the behavior of collectives to do something numerically, this can be unwelcome behavior. If you can error when this happens it can help prevent silent numerical incorrectness. Authored with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162841 Approved by: https://github.com/dcci	2025-10-01 17:48:19 +00:00
PyTorch MergeBot	69c5c08a01	Revert "[dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005 )" This reverts commit 5ed4672477c71492a2f41ac0395dd0630446d6a5. Reverted https://github.com/pytorch/pytorch/pull/164005 on behalf of https://github.com/williamwen42 due to broke some tests e.g. https://github.com/meta-pytorch/autoparallel/actions/runs/18167350261/job/51719783636?pr=179 ([comment](https://github.com/pytorch/pytorch/pull/164005#issuecomment-3357433475))	2025-10-01 17:47:22 +00:00
Anshul Sinha	3dab36bdb4	[FSDP][Replicate] created ReplicateModule and changed replicate to use it instead of FSDPModule (#163897 ) Summary: In order to minimize the code copied from FSDP to make replicate work, I made all replicated modules FSDPModule. While this was sufficient originally, there are changes to codebase like below that require us to differentiate between a FSDPModule and a ReplicateModule so that we can access replicate_state or fsdp_state: https://www.internalfb.com/code/fbsource/[a9a8e5102052]/fbcode/caffe2/torch/distributed/pipelining/stage.py?lines=629-666. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163897 Approved by: https://github.com/weifengpy	2025-10-01 17:30:10 +00:00
Ivan Zaitsev	1288c6d8bb	Enable keep-going for trunk tags (#164307 ) Tags like `trunk/{sha}` are used to re-run signals by [autorevert project](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md). We need to have `keep-going` enabled for those reruns, so that they surface all test failures, not just the first one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164307 Approved by: https://github.com/clee2000	2025-10-01 17:21:43 +00:00
Colin Peppler	80ed522910	[export] support unbacked stack (#163867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163867 Approved by: https://github.com/laithsakka	2025-10-01 16:48:46 +00:00
Yuanyuan Chen	f7ab8a2710	[1/N] Fix ruff warnings (#164333 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164333 Approved by: https://github.com/albanD	2025-10-01 16:48:32 +00:00
Ke Wen	e419dc6d08	[PP] Customize pipeline's submod name (#164037 ) Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037 Approved by: https://github.com/H-Huang ghstack dependencies: #164045, #164035	2025-10-01 16:29:19 +00:00
Ke Wen	5f868ca110	[fx] Allow customization of submod name in split graph (#164035 ) Fixes #164030: HOP and pipelining both name things submod_i by adding an optional argument `partition_affix` to `split_module` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035 Approved by: https://github.com/ezyang ghstack dependencies: #164045	2025-10-01 16:26:14 +00:00
PyTorch MergeBot	20edc5b26a	Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 )" This reverts commit 22c5e8c17c7551c9dd2855589ae774c1e147343a. Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))	2025-10-01 16:23:03 +00:00
PyTorch MergeBot	59a86cb137	Revert "[fx] Allow customization of submod name in split graph (#164035 )" This reverts commit 615da7b95ef22ec0fa07f296dcb103d7d5aeda34. Reverted https://github.com/pytorch/pytorch/pull/164035 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))	2025-10-01 16:09:50 +00:00
PyTorch MergeBot	36a37b81cd	Revert "[PP] Customize pipeline's submod name (#164037 )" This reverts commit 704cd771f6a63abf9498934aeb7f3079ab9e2232. Reverted https://github.com/pytorch/pytorch/pull/164037 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))	2025-10-01 16:09:50 +00:00
albanD	2610746375	Revert nccl upgrade back to 2.27.5 (#164352 ) Revert https://github.com/pytorch/pytorch/pull/162351 as it breaks H100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164352 Approved by: https://github.com/atalman, https://github.com/malfet	2025-10-01 15:27:40 +00:00
Aleksandar Samardžić	b1033789fe	Use TMA loads always for Triton grouped MM kernel (#164256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256 Approved by: https://github.com/ngimel ghstack dependencies: #163895	2025-10-01 15:24:51 +00:00
PyTorch MergeBot	07d896fa48	Revert "CUDACachingHostAllocatorImpl skip event query during capture (#164001 )" This reverts commit 4cf29004749714670fee9e7e3776778faf5ced25. Reverted https://github.com/pytorch/pytorch/pull/164001 on behalf of https://github.com/yangw-dev due to failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.. ([comment](https://github.com/pytorch/pytorch/pull/164001#issuecomment-3356894787))	2025-10-01 15:11:21 +00:00
Nicolas De Carli	31681bcacc	[PyTorch] Pull ARM's box-cox (#164152 ) Summary: ARM has provided with an SVE128 box-cox implementation. It uses the same underlying algorithm as the previous version, but it has better log and exp implementations. These supplied mathematical functions have switches to adjust the precision/speed trade-off. We've noted a slight precision improvement, while also about a 5% peroformance increase Before: ZeroLambda1 61.66ns 16.22M NonZeroLambda1 125.73ns 7.95M NonZeroLambdaManyColumns 1.84ms 542.11 NonZeroLambdaEigenColumnar 262.31us 3.81K NonZeroLambdaEigenRowMajor 275.17us 3.63K NonZeroLambdaWithPyTorchColumnar 97.43us 10.26K NonZeroLambdaWithPyTorchRowMajor 90.82us 11.01K NonZeroLambdaWithPyTorchRowMajorFullBatch 96.96us 10.31K NonZeroLambdaBatch 151.84us 6.59K After: ZeroLambda1 57.85ns 17.29M NonZeroLambda1 118.85ns 8.41M NonZeroLambdaManyColumns 1.82ms 548.16 NonZeroLambdaEigenColumnar 261.67us 3.82K NonZeroLambdaEigenRowMajor 274.53us 3.64K NonZeroLambdaWithPyTorchColumnar 89.12us 11.22K NonZeroLambdaWithPyTorchRowMajor 83.49us 11.98K NonZeroLambdaWithPyTorchRowMajorFullBatch 88.79us 11.26K NonZeroLambdaBatch 144.74us 6.91K Test Plan: Correctness: buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test Performance: buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark Differential Revision: D83485704 Privacy Context Container: L1196524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164152 Approved by: https://github.com/ezyang	2025-10-01 15:00:03 +00:00
Edward Yang	e901866dd7	Add a RECORD_FUNCTION for Python fallback so it shows in profile (#160573 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160573 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2025-10-01 14:10:44 +00:00
Aleksandar Samardžić	70d1043bdf	Fix non-TMA loads in grouped MM Triton kernel (#163895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163895 Approved by: https://github.com/lezcano	2025-10-01 12:21:13 +00:00
atalman	69fa26d9b4	Triton 3.5.x pin update (#164268 ) Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/ This updates contains 2 cherry-pick to remove Python 3.9 from list of supported python versions: https://github.com/triton-lang/triton/pull/8288 https://github.com/triton-lang/triton/pull/8287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164268 Approved by: https://github.com/aakhundov	2025-10-01 11:41:50 +00:00
Robert Hardwick	d9c80ef97d	Build and Install Arm Compute Library in manylinux docker image (#159737 ) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737 Approved by: https://github.com/seemethere, https://github.com/aditew01	2025-10-01 11:33:51 +00:00
William Wen	ac1bc51608	[dynamo] do not pop from framelocals dict in Python 3.10 (#164316 ) Followup to https://github.com/pytorch/pytorch/pull/164038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164316 Approved by: https://github.com/anijain2305	2025-10-01 10:20:46 +00:00
Syed Tousif Ahmed	ed90040d33	Releases multicast object before releasing mapped buffers in CUDASymmetricMemory (#163750 ) Fixes: https://github.com/pytorch/pytorch/issues/162429. In B200, cuMulticastUnbind can error if the mapped buffers are free'd before the multicast object is free'd. The only documentation I could find is here: `e11d7f77c1/src/transport/nvls.cc (L113)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163750 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/cyyever ghstack dependencies: #163575	2025-10-01 09:07:48 +00:00
Syed Tousif Ahmed	4dab208d97	Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163575 Approved by: https://github.com/ngimel	2025-10-01 09:07:48 +00:00
Tristan Trouwen	9fd53a2bdc	Register MTIA kernel for all_all_out (#164293 ) Reviewed By: srsuryadev Differential Revision: D83517879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164293 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-10-01 09:05:08 +00:00
Xuehai Pan	17ab99463a	[Easy] Add notes for setting up dev venv with specific Python version (#164214 ) Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3340751377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164214 Approved by: https://github.com/ezyang ghstack dependencies: #162324	2025-10-01 08:25:13 +00:00
Xuehai Pan	eca6ac2293	[BE][Easy] update CUDA and ROCm sources in nightly tool (#162324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162324 Approved by: https://github.com/ezyang	2025-10-01 08:25:13 +00:00
Xuanteng Huang	12d4cb0122	Suppress `FutureWarning`s in `torch.distributed.algorithms.ddp_comm_hooks` (#163939 ) Fixes #163938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163939 Approved by: https://github.com/cyyever, https://github.com/kwen2501	2025-10-01 07:51:12 +00:00
Haifeng Jin	590224f83c	Improve repeat op to a single copy (#163842 ) In #163455 , the `reshape` was not a pure view op. The `permute` before it created an non-contiguous tensor, which would trigger a data copy during the reshape. This PR improved the implementation by remove the `urtensor` intermediate tensor completely. By simply expanding the `xtensor` would achieve the `repeat` effect. Before this PR, there were two data copies (in `urtensor.copy_` and `urtensor.reshape`). Now, there is only one data copy in the `.copy_()`. Reshape would not copy data because it is on a contiguous tensor. One more note is that we do want at one copy because we want to duplicate the elements for the repeats. User can inplace modify single elements without afffecting others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163842 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-01 06:27:53 +00:00
Yuanyuan Chen	cc8b14d09a	[2/N] Simplify "in" operation for containers of a single item (#164323 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164323 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-10-01 05:39:11 +00:00
Animesh Jain	96c3b9e275	[dynamo] Use strings instead of modules for fqn info tracking (#164272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164272 Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/mlazos	2025-10-01 04:22:57 +00:00
Nikita Shulga	9ddfc59b9b	[BE] Delete stale non-ephemeral runners workarounds (#164285 ) As all Win runners are ephemeral, no need to cleanup leftover processes or uninstall PyTorch at the end of the test Pull Request resolved: https://github.com/pytorch/pytorch/pull/164285 Approved by: https://github.com/Skylion007	2025-10-01 03:47:36 +00:00
Nikita Shulga	6d4dfa0878	[CI] Push `viable/strict/${time}` tags (#164183 ) Every time viable strict is updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183 Approved by: https://github.com/seemethere	2025-10-01 03:41:10 +00:00
Banit Agrawal	11ccb95ccb	[PyTorch Pinned Allocator] Pinned memory stats and perf fixes around allocating blocks (#163777 ) Summary: This diff adds bucket stats for pinned memory and also a perf fix to not check for sizes when background thread is enabled Differential Revision: D83162186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163777 Approved by: https://github.com/bbus	2025-10-01 03:28:58 +00:00
Nikita Shulga	bd0907dc4c	[BE][CI] Unify requirments (#163396 ) Both Linux, Windows and MacOS CI workflows should use `.ci/docker/requirements-ci.txt` TODOS: - Investigate why `choco install cmake` is needed to successfully detect MKL - Move `psutil` installation from specific scripts into requirements-ci.txt Pull Request resolved: https://github.com/pytorch/pytorch/pull/163396 Approved by: https://github.com/Skylion007	2025-10-01 03:28:48 +00:00
Alexander Grund	8bb71c07c4	Skip symmetric memory tests calling `_scaled_mm` on CCC < 8.9 (#164251 ) This avoids them failing on e.g. A100 GPUs with > RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-10-01 03:26:21 +00:00
Yuanyuan Chen	fa90090735	Use dataclass features in two classes (#164221 ) This PR completes two TODO items by using features of `dataclass`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164221 Approved by: https://github.com/Skylion007, https://github.com/mlazos Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-01 03:20:39 +00:00
Aaron Gokaslan	591997490a	[BE][Easy]: Add prims common TypeGuard (#164263 ) Slightly improves typing by adding a TypeGuard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164263 Approved by: https://github.com/albanD	2025-10-01 03:13:10 +00:00
mansiag05	531f3bf5e1	Adding check for square matrix for input tensor in matrix_exp backwar… (#163357 ) …d op. Fixes #146796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163357 Approved by: https://github.com/lezcano	2025-10-01 03:12:30 +00:00
ankushwahaRH	2a5ce2feb4	Add algorithm in header (#164295 ) Fixes #163307. Added ```#include <algorithm>``` to vulkan QueryPool for the std::for_each call Pull Request resolved: https://github.com/pytorch/pytorch/pull/164295 Approved by: https://github.com/Skylion007	2025-10-01 03:09:50 +00:00
Yiming Zhou	3787a5a60e	[export] Explicitly passing requires_grad to nn.Parameter() in deserialization (#164290 ) Summary: `nn.Parameter()` by default has `requires_grad=True` and would cause issues when there are non-float parameters. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_non_float_weight Differential Revision: D83598796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164290 Approved by: https://github.com/angelayi	2025-10-01 02:55:20 +00:00
Animesh Jain	c66d18d24d	[dynamo][sac] Support functools partial context_fn for sac (#164308 ) Fixes https://github.com/pytorch/pytorch/issues/164300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164308 Approved by: https://github.com/Lucaskabela, https://github.com/soulitzer	2025-10-01 02:47:55 +00:00
eellison	e0f118585f	skip non memory deps in memory estimator (#164294 ) Differential Revision: [D83601030](https://our.internmc.facebook.com/intern/diff/D83601030) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164294 Approved by: https://github.com/mlazos	2025-10-01 02:44:58 +00:00
bobrenjc93	10a005e87f	[torchfuzz] add layout operators (#164210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164210 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209, #164211	2025-10-01 02:33:19 +00:00
bobrenjc93	1f3995cdc8	[torchfuzz] raise if Operator abstract method is not implemented (#164211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164211 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209	2025-10-01 02:33:19 +00:00
bobrenjc93	abfcce58a4	[torchfuzz] remove erroneous can_produce check (#164209 ) can_produce is an abstract method that always return false Pull Request resolved: https://github.com/pytorch/pytorch/pull/164209 Approved by: https://github.com/pianpwk ghstack dependencies: #164034	2025-10-01 02:33:19 +00:00
Jane Xu	5b1c39f5a1	Add smoke tests to verify that stable ABI FA3 wheel runs w/ newer torch (#163782 ) Passing CI: https://github.com/pytorch/pytorch/actions/runs/18141589975/job/51635340255?pr=163782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163782 Approved by: https://github.com/huydhn, https://github.com/mikaylagawarecki	2025-10-01 02:30:38 +00:00
Simon Layton	8df3f2fa98	Revert new-test part of #163829 (#164259 ) Summary: New test sizes for `test_scaled_mm_vs_emulated_block_wise` all fail with ``` RuntimeError: Invalid scaling configuration ``` Disable these new tests for now (the remaining test is a parametrized version of the original test case) Test Plan: `pytest test/test_scaled_matmul_cuda.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164259 Approved by: https://github.com/jananisriram ghstack dependencies: #164266	2025-10-01 02:23:21 +00:00
Simon Layton	7a9119948e	Split scaled-mm tests into separate file (#164266 ) Summary: * Split scaled-mm-specific tests into `test/test_scaled_matmul.py` Test Plan: ``` pytest test/test_matmul_cuda.py pytest test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-10-01 02:23:21 +00:00
Shangdi Yu	28c1d2f81b	[aoti] AOTI mingw cross compilation (#163188 ) To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit. See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0 If cross_platform_target is windows, we do the following: - do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side - Use `mingw64-gcc-c++` to compile - Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda ``` python test/inductor/test_aot_inductor_windows.py -k so ``` Other changes: - de-couples compile_standalone config and dynamic link flag - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188 Approved by: https://github.com/desertfire	2025-10-01 02:22:06 +00:00
Banit Agrawal	c4bbc6433e	[PyTorch CCA] Add an API to get expandable segment sizes (#163771 ) Summary: This diffs add an API to query expandable segment size for each stream so that we can use this info to warmup the segment in advance, so we dont incur any performance penalty during steady state inference for new CUDA memory allocations. Differential Revision: D76447308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163771 Approved by: https://github.com/bbus	2025-10-01 02:16:58 +00:00
Jeff Daily	ad7e3c93b1	[ROCm][CD] librocroller.so missing from ROCm 7 wheel (#164244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164244 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-01 00:02:34 +00:00
Jane Xu	7f3dc45300	Migrate DeviceType to torch/headeronly (#163999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999 Approved by: https://github.com/mikaylagawarecki	2025-09-30 23:13:27 +00:00
PyTorch UpdateBot	ff715366aa	[vllm hash update] update the pinned vllm hash (#164190 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190 Approved by: https://github.com/pytorchbot	2025-09-30 22:43:49 +00:00
Sherlock Huang	60a4961ff4	[DTensor] Allow redistribute to Partial if src matches (#164253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164253 Approved by: https://github.com/zpcore	2025-09-30 22:42:49 +00:00
Frank Lin	bec6541d84	[CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186 ) Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565) This PR removes capture/reply overhead while preserving the memory savings: 1. Terminals as free markers We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged. 2. Incremental, cached reachability We add a per-graph reuse context that caches reverse-traversal state: * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier. * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes. * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work. See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-09-30 22:28:46 +00:00
fduwjj	1f1de20ba9	[c10d][BE][ez] Update tensor ptr inside nccl.cpp (#164276 ) This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164276 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501	2025-09-30 22:05:12 +00:00
Anshul Sinha	2810977d3a	[FSDP][Replicate] tests replicate type casting behavior and edge cases in mixed precision (#162861 ) Summary: Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used Test Cases 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule 2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs 3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16 4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16 5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype 6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input Pull Request resolved: https://github.com/pytorch/pytorch/pull/162861 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855	2025-09-30 22:03:23 +00:00
Wei Feng	ae4fd4ea75	[FSDP2] support AC(FSDP) for torchtitan's MOE (#164009 ) for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): https://github.com/pytorch/torchtitan/issues/1624 for implicit prefetching, backward order is * _pre_backward unshard (norm, output) * _backward_prefetch unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _backward_prefetch unshard layers.6.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already) * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op * _pre_backward unshard layers.6.moe.experts (no-op, unsharded already) * _backward_prefetch unshard layers.5 * post_backward reshard layers.6.moe.experts * post_backward reshard layers.6 unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac` before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step: 1 loss: 12.0162 grad_norm: 1.7315 memory: 45.64GiB(48.05%) tps: 1,028 tflops: 10.87 mfu: 1.10% [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50 loss: 6.9302 grad_norm: 0.9985 memory: 59.66GiB(62.80%) tps: 11,712 tflops: 123.89 mfu: 12.53% ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step: 1 loss: 12.0134 grad_norm: 1.6916 memory: 38.42GiB(40.45%) tps: 805 tflops: 8.51 mfu: 0.86% [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50 loss: 6.9346 grad_norm: 1.1875 memory: 52.58GiB(55.36%) tps: 12,583 tflops: 133.10 mfu: 13.46% ``` for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this * _pre_backward unshard (norm, output) * _prefetch_unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _prefetch_unshard layers.5 * _prefetch_unshard layers.5.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op * _pre_backward unshard layers.6.moe.expert (no-op, unsharded already) * post_backward reshard layers.6.moe.expert * post_backward reshard layers.6 before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step: 1 loss: 12.0180 grad_norm: 1.6948 memory: 45.77GiB(48.18%) tps: 849 tflops: 8.98 mfu: 0.91% [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds [rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50 loss: 6.9358 grad_norm: 1.0528 memory: 59.80GiB(62.95%) tps: 11,827 tflops: 125.10 mfu: 12.65%``` ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step: 1 loss: 12.0143 grad_norm: 1.7030 memory: 38.55GiB(40.58%) tps: 988 tflops: 10.45 mfu: 1.06% [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50 loss: 6.9356 grad_norm: 0.9911 memory: 52.81GiB(55.59%) tps: 12,637 tflops: 133.68 mfu: 13.52% ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/164009 Approved by: https://github.com/soulitzer	2025-09-30 22:02:24 +00:00
Animesh Jain	adc11a7634	[export] avoid checks during tracing of export verification (#164219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164219 Approved by: https://github.com/Lucaskabela	2025-09-30 21:46:59 +00:00
Anshul Sinha	99e28ffab3	[FSDP][Replicate] tests replicate core functionality with mixed precision (#162855 ) Summary: Ensures that replicate functionality works the same as fully shard's when mixed precision is used Test Cases 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining Pull Request resolved: https://github.com/pytorch/pytorch/pull/162855 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853	2025-09-30 21:45:58 +00:00
Anshul Sinha	01dd2c2b42	[FSDP][Replicate] tests replicate is composable with tp (#162853 ) Summary: Proof that new replicate API is composable with TP Test Case 1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp Pull Request resolved: https://github.com/pytorch/pytorch/pull/162853 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851	2025-09-30 21:29:54 +00:00
Anshul Sinha	d3bdf8c32e	[FSDP][Replicate] tests replicate with custom forward method (#162851 ) Summary: tests replicate works when users use custom forward methods Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method Pull Request resolved: https://github.com/pytorch/pytorch/pull/162851 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839	2025-09-30 21:15:34 +00:00
Anshul Sinha	1ce9563ff6	[FSDP][Replicate] tests replicate gradient accumulation and 1f1b microbatching (#162839 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation 2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching Pull Request resolved: https://github.com/pytorch/pytorch/pull/162839 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836	2025-09-30 21:00:16 +00:00
xadupre	9e631392dc	Missing lambda in torch._check (#164225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164225 Approved by: https://github.com/Skylion007	2025-09-30 20:32:38 +00:00
PaulZhang12	1cce6efdb8	Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095 ) Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095 Approved by: https://github.com/ngimel	2025-09-30 20:13:13 +00:00
Nikita Shulga	5a93f00c79	[CI] Delete binary smoke workflows (#164260 ) Those were very useful in the past, because: - CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels - CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration Existing, but acceptable gaps: - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals Besides there are better tools to alert about the nightly failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/164260 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-09-30 20:00:07 +00:00
Yuanyuan Chen	e30f01b5b5	[1/N] Simplify "in" operation for containers of a single item (#164224 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224 Approved by: https://github.com/rec, https://github.com/Skylion007	2025-09-30 19:59:43 +00:00
Jeff Daily	ffc645c870	half support for fused_moving_avg_obs_fake_quant() op (#164175 ) Follow up to https://github.com/pytorch/pytorch/pull/162620. Add half support, as well. This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459. `NotImplementedError: "aminmax_kernel" not implemented for 'Half'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164175 Approved by: https://github.com/malfet, https://github.com/jerryzh168	2025-09-30 19:35:17 +00:00
Han Qi	60f0a356fd	Update persons of interest for XLA. The previous one is out of date. (#158652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158652 Approved by: https://github.com/JackCaoG, https://github.com/albanD	2025-09-30 19:21:18 +00:00
Kohaku-Blueleaf	d2c5f231f6	Fix the shape check inside gnll loss (#147522 ) Fixes #147521 This modification allow user to put any size of var in GaussianNLLLoss if the var is broadcastable (to input/target's size) Therefore, the demo code in #147521 will result in expected behaviour and correct output. This allow all input size that match: `input.size = (..., n, ...), var.size = (..., 1, ...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147522 Approved by: https://github.com/mikaylagawarecki	2025-09-30 18:40:15 +00:00
PyTorch MergeBot	cc5d74c366	Revert "[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464 )" This reverts commit 94195a37ae4eae9c486a81b0f67725c8970f74d6. Reverted https://github.com/pytorch/pytorch/pull/163464 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163464#issuecomment-3353307034))	2025-09-30 18:20:20 +00:00
Markus Hoehnerbach	a707042353	fix: inductor non_blocking test - warmup events to make test pass whether it is the first run or not (#164188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164188 Approved by: https://github.com/williamwen42	2025-09-30 18:20:17 +00:00
Pian Pawakapan	d615f6b935	[inductor] use hint_override in kernel benchmark args (#164207 ) Summary: forward fix T239259207 Test Plan: test_multi_kernel Differential Revision: D83539263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207 Approved by: https://github.com/bobrenjc93, https://github.com/mlazos	2025-09-30 18:09:29 +00:00
Nick Riasanovsky	719b64ee8b	Fix TMA transpose logic to handle 1D shapes + string differences (#163966 ) Fixes #163702. This fixes 2 issues: 1. The value may inconsistently be a shape or string. This normalizes to handle both of these. 2. 1D shapes should not transpose data. This fixes the order of operations to prevent this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966 Approved by: https://github.com/eellison	2025-09-30 17:51:37 +00:00
Mwiza Kunda	1cf1b9138d	[inductor][templates] Template hooks should be finalised inside a kernel context (#164229 ) The prologue buffer added in https://github.com/pytorch/pytorch/pull/160480 is added to template code in the DEF_KERNEL [hook](`29221b9828/torch/_inductor/select_algorithm.py (L742)`). The lines in this buffer may be of type `DeferredLine`, and so require the correct kernel context to determine whether lines should be added or removed. Test plan: Tested with a custom template using tensor descriptors for prologue fused inputs, whose tensor descriptors need to be hoisted to the top of the kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164229 Approved by: https://github.com/njriasan	2025-09-30 17:50:59 +00:00
William Wen	5ed4672477	[dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920, #164004	2025-09-30 17:43:03 +00:00
William Wen	2600f8b3d1	[dynamo, 3.14] fix tracing typing.Union (#164004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164004 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920	2025-09-30 17:43:03 +00:00
William Wen	9ce31e4278	[3.14] make unbacked_sym[int/float]_counter integers (#163920 ) 3.14 removed copy/deepcopy/pickle support for `itertools` iterators: https://docs.python.org/3.14/whatsnew/3.14.html#itertools Change unbacked_sym[int/float]_counter from `itertools.count` to regular integers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163920 Approved by: https://github.com/ezyang ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919	2025-09-30 17:42:55 +00:00
William Wen	0657de9c61	[dynamo, 3.14] support LOAD_COMMON_CONSTANT (#163919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163919 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818	2025-09-30 17:42:47 +00:00
William Wen	4ead8ebf70	[dynamo, 3.14] fix BUILD_TUPLE with 0 args (#163818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163818 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796	2025-09-30 17:42:40 +00:00
William Wen	d4b785a6a7	[dynamo, 3.14] fix stack ref copy error (#163796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163796 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292	2025-09-30 17:42:33 +00:00
William Wen	9278b18ec0	[dynamo, 3.14] fix WITH_EXCEPT_START (#163292 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163292 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191	2025-09-30 17:42:26 +00:00
William Wen	008b0a9425	[dynamo, 3.14] fix inactive ctx handling in resume functions (#163191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163191 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110	2025-09-30 17:42:19 +00:00
William Wen	44677ad917	[dynamo, 3.14] support LOAD_CONST on slice, codegen LOAD_CONST slice instead of BINARY/STORE_SLICE (#163110 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163110 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839, #163009, #163109	2025-09-30 17:42:11 +00:00
William Wen	1c9987fdf4	[dynamo, 3.14] fix context managers (#163109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163109 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #161838, #161555, #161839, #163009	2025-09-30 17:42:03 +00:00
William Wen	7cbc011700	[dynamo, 3.14] support some bytecodes, fix CALL_FUNCTION_EX (#163009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163009 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838, #161555, #161839	2025-09-30 17:41:56 +00:00
William Wen	09c774145e	[dynamo, 3.14] Python dynamo changes to get basic programs working (#161839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161839 Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305 ghstack dependencies: #161838, #161555	2025-09-30 17:41:49 +00:00
William Wen	763ab2a6ed	[dynamo, 3.14] compile actual code in C dynamo (#161555 ) No 3.14 CI tests enabled yet, but this was enough to get Dynamo compiling locally and Python Dynamo is at least being called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161555 Approved by: https://github.com/anijain2305 ghstack dependencies: #161838	2025-09-30 17:41:42 +00:00
William Wen	4b8fe795f8	[dynamo] format cpython_defs.c (#161838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161838 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2025-09-30 17:41:35 +00:00
IvanKobzarev	84e1cd7392	[inductor] fx comm overlap: align runtime estimations across dist ranks (#164226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164226 Approved by: https://github.com/eellison	2025-09-30 17:29:18 +00:00
Yiming Zhou	937869657e	Exporting aten.sdpa with cuda under fake mode on a cuda-less machine (#164162 ) Summary: As titled. sdpa will select backend based on hardware check, and it fails when exporting with cuda under fake mode on a cuda-less machine. We guard `at::cuda::is_available()` check before `at::cuda::getCurrentDeviceProperties()` and give warnings. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r nn_functional_scaled_dot_product_attention Differential Revision: D83496154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164162 Approved by: https://github.com/SherlockNoMad	2025-09-30 17:21:31 +00:00
Henry Tsang	7d7ae4d7b2	[submodule] upgrade cutlass version to 4.2.1 and completely resolved python/cutlass name collision (#164156 ) Differential Revision: D83489362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164156 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-30 17:04:57 +00:00
Jeff Daily	906fe7b120	[ROCm][CI] no longer build almalinux image for ROCm 6.3 (#164201 ) Missed during ROCm 7 upgrades. We only build N and N-1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164201 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-30 16:59:31 +00:00
Blaine Burton Rister	7edd18f0fd	[Inductor-FX] Generalize FloorDiv conversion to handle more complex launch grids. Remove python_slow grid mode. (#163828 ) # Problem Inductor's FX backend receives sympy expressions for Triton launch grids, and passes these to a tracer to generate equivalent FX IR. However, the tracer does not support all possible sympy expressions. In particular, it can't handle ops like `floor` and `Pow` which would be found in an expression like `floor(x / y)`. Instead, it expects `FloorDiv(x, y)`, which has the advantage that all intermediate values are integers, unlike `x / y`. Inductor's Python backend uses a trick where `ceil(x / y)` is computed in Python as `-(x // -y)`, which is faster when evaluating Python launch grids at runtime. However, this trick generates more complex sympy expressions, so the FX backend introduced a `"python_slow"` mode using a more familiar form of ceil division. However, this mode is slower to evaluate, which increased production CPU usage. (Internal reviewers see T237853632.) # Solution To get the best of both worlds, this PR removes `"python_slow"` mode, and generalizes the `replace_floor_div` function to handle the more complex expressions resulting from the `"python"` grid mode. The new algorithm is conceptually similar to the existing one, except instead of analyzing only the first argument to a `sympy.Mul` op, it checks all factors, so it can handle expressions containing both `Rational` and `Pow` ops, among other cases. It also uses `Mul.make_args` to handle the case when the argument to `floor` is not a `Mul`. Finally, it uses `expr.is_positive` to check the sign of symbolic exponents. This new algorithm is guaranteed to convert all `floor` ops to an equivalent expression using `FloorDiv`. (To see this, consider that `floor(x) == FloorDiv(x, 1)`.) Note it may not remove all `Pow` ops, with a counterexample being `floor(x / (2 + z ** y))`, but it covers everything we've seen in practice for symbolic launch grids. In particular, it covers the typical case where `Pow` is a factor of the argument to `floor`, and the exponent is `-1`. Is this situation, we move the `Pow` to the denominator of `FloorDiv` and the exponent becomes `1`, eliminating the `Pow` op. # Test plan This PR adds an end-to-end test for static padding with dynamic outer dimensions, which creates a difficult sympy expression that the existing algorithm would not be able to handle. This PR also adds some unit tests for the `replace_floor_div` function. It can be difficult to construct end-to-end tests that expose all the trickiest expressions, as those tests have to pass through a number of other systems handling dynamic shapes. Therefore, it's easier to expose the edge cases with these new unit tests. The tests check that we can replace all `floor` ops in the input expression with `FloorDiv`, then they expand `FloorDiv` back to `floor` and check equality with the original expression. Note this PR also requires some MTIA changes to pass internal tests. Those will be stacked onto the imported diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163828 Approved by: https://github.com/nandesuka, https://github.com/angelayi, https://github.com/jansel	2025-09-30 16:47:49 +00:00
Sherlock Huang	3564cd294c	Fix TestExportOpInfo (#164184 ) Fixes https://github.com/pytorch/pytorch/issues/163699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164184 Approved by: https://github.com/yiming0416, https://github.com/tugsbayasgalan	2025-09-30 16:12:39 +00:00
Zhengxu Chen	1412a4a42f	[precompile] Add option to disable guard check on aot-compiled function. (#163432 ) Summary: Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result. When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432 Approved by: https://github.com/dolpm, https://github.com/mlazos	2025-09-30 16:10:15 +00:00
clee2000	96330f490d	[testing] Add upload for test status during test stat uploads (#164189 ) Add test status (flaky, success, skipped, failure) upload for easier comparison between test status on two commits Pull Request resolved: https://github.com/pytorch/pytorch/pull/164189 Approved by: https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-30 15:53:53 +00:00
eqy	66abba8f49	[CUDA][Expandable Segments] Follow-up cleanups for even more expandable segments tests (#163297 ) Gets original setting even earlier in case of crashes, fixes previous get call where set should be Pull Request resolved: https://github.com/pytorch/pytorch/pull/163297 Approved by: https://github.com/Skylion007	2025-09-30 15:39:14 +00:00
Svetlana Karslioglu	e88cca0691	Update Sphinx theme (#164147 ) Fix links in the top nav bar: `71e55749be` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147 Approved by: https://github.com/albanD	2025-09-30 15:35:58 +00:00
Jonah Bernard	5c020beba4	Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186 ) # Summary - Add a note to each `nn.LPPoold` docstring explaining how `ceil_mode=True` interacts with right padding. - Mirror the same clarification in the `torch.nn.functional.lp_pool` docstrings so the rendered functional docs stay in sync. # Motivation The current PyTorch spec for LPPool does not fully match runtime behavior, which has led to downstream confusion in other specs (e.g., ONNX) and runtimes (e.g., [onnxruntime issue #25848](https://github.com/microsoft/onnxruntime/issues/25848)). A corresponding clarification was also made in the ONNX spec: [onnx/onnx#5741](https://github.com/onnx/onnx/pull/5741). PyTorch’s LPPool implementation calls into AvgPool, which enforces the rule that windows starting entirely in the right padded region are ignored when `ceil_mode=True`. As a result, LPPool inherits the same behavior. This is an edge case where the output size formula shown in the LPPool docs/spec is not sufficient on its own. Without the added caveat, the documentation is technically incorrect. This PR brings the LPPool docs in line with actual behavior. Note that this is a trivial fix to the spec as all major implementers of the spec adhere to this caveat. For comparison, both MaxPool and AvgPool already include this clarification in their spec. Their docstrings explicitly state: > When `ceil_mode=True`, sliding windows are allowed to go off-bounds if they start within the left padding or the input. Sliding windows that would start in the right padded region are ignored. Adding the same note to LPPool ensures consistency across all pooling operators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163186 Approved by: https://github.com/mikaylagawarecki	2025-09-30 15:22:46 +00:00
atalman	edd9e07aff	[BE] Remove not existing mnist mirror (#164238 ) Looks like original source is empty now: http://yann.lecun.com/exdb/mnist/ Pytorch hosted mirror exist. Hence leaving it as only option. https://ossci-datasets.s3.amazonaws.com/mnist/ Fixes these errors in pytorch/ci: ``` C:\actions-runner\_work\pytorch\pytorch>python tools\download_mnist.py --quiet -d C:\actions-runner\_work\pytorch\pytorch\test\cpp\api\mnist Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ... Failed to download (trying next): HTTP Error 404: Not Found Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz ... Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ... Failed to download (trying next): HTTP Error 404: Not Found Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz ... Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ... Failed to download (trying next): HTTP Error 404: Not Found Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz ... Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ... Failed to download (trying next): HTTP Error 404: Not Found Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz ... ``` Link to workflow with example: https://github.com/pytorch/pytorch/actions/runs/18109150240/job/51542177282#step:15:2335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164238 Approved by: https://github.com/jeanschmidt	2025-09-30 15:15:13 +00:00
PyTorch MergeBot	0fb89b84b9	Revert "Consistently use c10_ovrsource in arvr mode everywhere (#164128 )" This reverts commit efd7fd5ed5ac7ec03201a546a09fb19ec59de431. Reverted https://github.com/pytorch/pytorch/pull/164128 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164128#issuecomment-3352544006))	2025-09-30 14:43:52 +00:00
PyTorch MergeBot	79fcfd49d6	Revert "[CI] Push `viable/strict/${time}` tags (#164183 )" This reverts commit 9f27b0c24515d9cf319d9a728d5009bf9ed035cf. Reverted https://github.com/pytorch/pytorch/pull/164183 on behalf of https://github.com/malfet due to Hmm, didn't work that way ([comment](https://github.com/pytorch/pytorch/pull/164183#issuecomment-3352494098))	2025-09-30 14:32:46 +00:00
PyTorch MergeBot	71b4fada57	Revert "Add less warps config to inner reductions (#162447 )" This reverts commit 84d673ef577d42d6ec20c6c9f09863583c3111f5. Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3352474768))	2025-09-30 14:28:19 +00:00
Yuanyuan Chen	46ec0664e3	Remove unused PyIntXXX, THPUtils_newReal_BOOL, THPQXXX macros (#164056 ) The removed macros are not used in other places of the `pytorch` GitHub org. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164056 Approved by: https://github.com/albanD	2025-09-30 13:48:25 +00:00
PyTorch MergeBot	410ed3006b	Revert "Add functions to setup PrivateUse1 as a python backend device. (#157859 )" This reverts commit 1310d6a1f9194ddcf6753f7e12fb78f278451f8a. Reverted https://github.com/pytorch/pytorch/pull/157859 on behalf of https://github.com/jeanschmidt due to introduce linting errors ([comment](https://github.com/pytorch/pytorch/pull/157859#issuecomment-3352140098))	2025-09-30 13:24:37 +00:00
zeshengzong	77354e22e1	[OpenReg] Add AMP Integration guide for accelerators (#162050 ) Fix part of #158917 Add AMP integration document and OpenReg code as example to explain steps of integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162050 Approved by: https://github.com/albanD Co-authored-by: FFFrog <ljw1101.vip@gmail.com>	2025-09-30 12:27:11 +00:00
ankushwahaRH	7f29c47a4f	Fix cdist export compute mode validation (#161724 ) Fixes #161089. Added '0' as the acceptable value for compute mode in _meta_registrations.py. Also, added a test case in test_export.py file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161724 Approved by: https://github.com/albanD, https://github.com/angelayi	2025-09-30 12:23:20 +00:00
Mwiza Kunda	ace6c76103	[inductor] Small refactor of CachingAutotuner (#162406 ) This is a simple refactor that just moves some logic in `_precompile_config` to two new functions for separation of concerns. This will allow subclasses e.g. out of tree to configure options and metadata for triton.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162406 Approved by: https://github.com/exclamaforte	2025-09-30 11:29:15 +00:00
Han Qi	1310d6a1f9	Add functions to setup PrivateUse1 as a python backend device. (#157859 ) Fixes #156052 and #156444. This PR setup the privateuseone key in Python to be used as a python backend for pytorch. Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor. Changes done in this PR: 1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine. 2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops. 3. python function that invokes the other incantations to setup the privateusekey backend. This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859 Approved by: https://github.com/albanD	2025-09-30 08:39:36 +00:00
Tristan Rice	7f4c3e7d2f	distributed/serialization: support zero sized tensors (#164198 ) Fixes ``` [4] ValueError: both buffer length (0) and count (-1) must not be 0 ``` Test plan: ``` pytest test/distributed/test_serialization.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164198 Approved by: https://github.com/amirafzali	2025-09-30 08:11:29 +00:00
Sherlock Huang	6e5b4249a5	[DTensor][Export] Supporting exporting a model with DTensor params/inputs (#163609 ) I experimented with 3 paths to get joint graph for DTensorized module and input 1. strict_export + aot_export_joint_with_descriptors 2. graph_capture + aot_export_joint_with_descriptors 3. aot_export_joint_with_descriptors alone Added test to guard them. 1 doesn't work, as bw graph region is missing from the joint graph. I am leaning towards making 2 the recommended path. If 2 doesn't work going forward, we can fallback to 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163609 Approved by: https://github.com/tugsbayasgalan Co-authored-by: suo <suo@fb.com>	2025-09-30 07:54:13 +00:00
Animesh Jain	5274753873	[dynamo][device_mesh] Support mesh_dim_names (#164200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164200 Approved by: https://github.com/SherlockNoMad, https://github.com/jansel	2025-09-30 07:16:28 +00:00
Yavuz Yetim	7afcb030d8	Back out "Revert D81959389" (#163905 ) Summary: Original commit changeset: 06888d7ebff0 Original Phabricator Diff: D82932788 Restricted the test to SM90 for scaled_grouped_mm Test Plan: TBD (will share the linux CI results) Differential Revision: D83283991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905 Approved by: https://github.com/angelayi	2025-09-30 07:05:13 +00:00
Animesh Jain	bbf6816f35	[dynamo] Special path for cloning of torch dispatch tensors (#164081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081 Approved by: https://github.com/tugsbayasgalan, https://github.com/mlazos	2025-09-30 05:15:56 +00:00
vishalgoyal316	ace89350fc	better error handling for rrelu when lower or upper range is infinite (#160965 ) … - issue#153281 Fixes #153281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160965 Approved by: https://github.com/janeyx99	2025-09-30 05:01:32 +00:00
eellison	7d59e37434	Add Comm-Compute Preserving Bucketer (#163960 ) tl;dr performs bucketing while preserving comm-compute overlap. In comm-compute overlap we will have a graph with: ``` def foo(...): ag = all_gather(...) hiding_compute = mm(...) wait(ag) ``` There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap. Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set. We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in. TODO: - need to instrument fx graph so inductor respects these relationships. - the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through - more memory aware handling Pull Request resolved: https://github.com/pytorch/pytorch/pull/163960 Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev ghstack dependencies: #163215, #163754, #163959	2025-09-30 04:53:58 +00:00
eellison	92108f4abd	Helper to augment graph with additional deps (#163959 ) In comm-compute overlap we will have a graph with: ``` def foo(...): ag = all_gather(...) hiding_compute = mm(...) wait(ag) ``` There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap. Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set. This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959 Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev ghstack dependencies: #163215, #163754	2025-09-30 04:53:58 +00:00
eellison	0b2fdc30a2	refactor bucketing (#163754 ) Preparatory refactory Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #163215	2025-09-30 04:53:58 +00:00
eellison	0d7994ca97	[inductor] do comm compute overlap at aten fx level (#163215 ) This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing. Subsequent prs will handle: - use of exposure analysis to do bucketing - make sure inductor respects comm/compute overlapping done at fx level - non-profiling mm estimation/rank broadcasting of profile results Other mis: - Validate accuracy of nccl estimations ( use ruisi's profiling instead ?) For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives. fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3 bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215 Approved by: https://github.com/IvanKobzarev	2025-09-30 04:53:58 +00:00
bobrenjc93	c39357bab6	[torchfuzz] Make scalar and tensor distribution configurable (#164034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164034 Approved by: https://github.com/pianpwk	2025-09-30 04:50:54 +00:00
Yuanyuan Chen	a293206bd5	Fix invalid f-strings (#164112 ) Fixes invalid f-strings detected by `ruff`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164112 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-30 04:17:13 +00:00
Nikita Shulga	9f27b0c245	[CI] Push `viable/strict/${time}` tags (#164183 ) Every time viable strict is updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183 Approved by: https://github.com/seemethere	2025-09-30 04:00:22 +00:00
Yuanyuan Chen	85012fe167	Remove unnecessary list comprehensions (#164103 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-09-30 03:56:54 +00:00
PyTorch MergeBot	ca19815e3c	Revert "Enable outer reductions in fbcode (#163884 )" This reverts commit 872edd89d62f0095d3fbd8ae9204d7c8bd980460. Reverted https://github.com/pytorch/pytorch/pull/163884 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163884#issuecomment-3349822031))	2025-09-30 03:42:24 +00:00
Rachel Guo	0b0ed6fd33	[doc] Add AOTInductor intermediate debug printer OSS user manual (#163794 ) Summary: Add a OSS user manual for AOTI intermediate debug printer so we can link it in the Pytorch conference poster. Test Plan: N/A Differential Revision: D83171374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163794 Approved by: https://github.com/yushangdi	2025-09-30 03:01:03 +00:00
Nikita Shulga	55840fb4bb	[CMake] Fix `USE_FBGEMM_GENAI` option (#164165 ) ---- - `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention) - Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build Pull Request resolved: https://github.com/pytorch/pytorch/pull/164165 Approved by: https://github.com/Skylion007	2025-09-30 02:38:03 +00:00
Jeff Daily	b7419b920d	[ROCm][CI] Upgrade ROCm to 7.0 (#163140 ) Upgrade all the ROCm docker image to ROCm 7.0 release version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163140 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-30 02:23:26 +00:00
Wei Wang	3b4ad4a17d	[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988 ) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark. Fixes https://github.com/pytorch/pytorch/issues/163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: https://github.com/pytorch/pytorch/pull/119750 and `5c814e2527` Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then `c6ad34f7eb/python/triton/knobs.py (L216)` would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988 Approved by: https://github.com/atalman	2025-09-30 01:56:12 +00:00
Jeff Daily	4cf2900474	CUDACachingHostAllocatorImpl skip event query during capture (#164001 ) The CUDACachingAllocator already does this, so there is precedent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164001 Approved by: https://github.com/eqy	2025-09-30 01:19:53 +00:00
Pian Pawakapan	474d07554a	[dynamic shapes] unbacked-safe slicing (#161414 ) Summary: Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Test Plan: contbuild & OSS CI, see `56218d85e2` Rollback Plan: Differential Revision: D80948073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414 Approved by: https://github.com/laithsakka	2025-09-30 01:15:19 +00:00
Yukio Siraichi	089f9130ed	Install `fmtlib` headers. (#164139 ) `fmtlib` version was updated to 12.0.0 in #163441. In this new version, due to https://github.com/fmtlib/fmt/pull/4536, PyTorch started not installing `fmtlib` headers anymore. Because of that, PyTorch/XLA build CI started to fail https://github.com/pytorch/xla/issues/9653. While we did fix it internally https://github.com/pytorch/xla/pull/9650, I believe that PyTorch should continue installing the `fmtlib` headers, since it is a dependency of its C API [`python_arg_parser.h`][1]. PyTorch/XLA CI was moved to `unstable.yml` in #159272, and later removed in #163564. This PyTorch/XLA build failure went under the radar, since the `fmtlib` update only landed on September 22. [1]: `84d673ef57/torch/csrc/utils/python_arg_parser.h (L42)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164139 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-09-30 01:10:13 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
atalman	cee4e36f9a	[BE] remove manylinuxcxx11-abi-builder:cpu-cxx11-abi docker image (#164187 ) I believe this image is not used anywhere anymore. Test: ``` git grep manylinuxcxx11-abi-builder git grep manylinuxcxx11 ``` Return no results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164187 Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/seemethere	2025-09-30 00:26:20 +00:00
Ke Wen	704cd771f6	[PP] Customize pipeline's submod name (#164037 ) Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037 Approved by: https://github.com/H-Huang ghstack dependencies: #164045, #164035	2025-09-29 23:29:52 +00:00
eellison	d58f7c3ad1	[Easy] Add pointwise tag to fma (#164149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164149 Approved by: https://github.com/fmassa	2025-09-29 22:40:04 +00:00
dependabot[bot]	170e0309ca	Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker (#156157 ) * Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.4 to 5.29.5. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl) - [Commits](https://github.com/protocolbuffers/protobuf/compare/v5.29.4...v5.29.5) --- updated-dependencies: - dependency-name: protobuf dependency-version: 5.29.5 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update .ci/docker/requirements-ci.txt --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-29 15:20:44 -07:00
PyTorch MergeBot	0f619c1f89	Revert "[inductor] do comm compute overlap at aten fx level (#163215 )" This reverts commit c9b5af9a384e7ef5f95613abe1622f5f55133c3a. Reverted https://github.com/pytorch/pytorch/pull/163215 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see `c9b5af9a38 (51526707590-box)` ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))	2025-09-29 21:53:42 +00:00
PyTorch MergeBot	b28e4f1f87	Revert "refactor bucketing (#163754 )" This reverts commit e1bd5b60cf243d3a026a6c89733488a6d9d4b33d. Reverted https://github.com/pytorch/pytorch/pull/163754 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see `c9b5af9a38 (51526707590-box)` ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))	2025-09-29 21:53:42 +00:00
PyTorch MergeBot	84dc54ae5e	Revert "Helper to augment graph with additional deps (#163959 )" This reverts commit b5d4d350f573db12b8181ee13f9386d6ef8a1e57. Reverted https://github.com/pytorch/pytorch/pull/163959 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see `c9b5af9a38 (51526707590-box)` ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))	2025-09-29 21:53:42 +00:00
Klaus Zimmermann	50d418f69f	Replace setup.py bdist_wheel with python -m build --wheel (#156712 ) Previously we already replaced most use of `python setup.py develop/install`. This PR also replaces the use of `setup.py bdist_wheel` with the modern `python -m build --wheel` alternative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156712 Approved by: https://github.com/atalman ghstack dependencies: #156711	2025-09-29 21:51:32 +00:00
Catherine Lee	c332d58184	[testing] upload test stats: Add info to the invoking file summary and some other changes (#164016 ) * Changes some internal logic for grouping so hopefully it's slightly less annoying write code for * Changes the invoking file summary to just use file, which I think is correct most of the time * Adds some fields to the file summary, like skips, errors, etc so I can reuse it for file report regression things Output should be the same, maybe with slightly more fields since I got rid of some of the pops Pull Request resolved: https://github.com/pytorch/pytorch/pull/164016 Approved by: https://github.com/huydhn	2025-09-29 21:20:18 +00:00
Edward Yang	efd7fd5ed5	Consistently use c10_ovrsource in arvr mode everywhere (#164128 ) Summary: Previously, many arvr targets transitively depended on c10, not c10_ovrsource, because they either explicitly depended on c10 (because they didn't know better) or they depended on legacy Caffe2, which never got the ovrsource treatment. So we found all these spots (driven by D82283623) and forced them to query arvr mode to figure out which one they should use. The goal is you NEVER have both targets in the same build rule at the same time. This diff could be reverted if D82224960 works out but I haven't gotten it to work yet. Test Plan: sandcastle Reviewed By: EscapeZero Differential Revision: D82390436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164128 Approved by: https://github.com/albanD, https://github.com/malfet	2025-09-29 20:47:20 +00:00
eellison	b5d4d350f5	Helper to augment graph with additional deps (#163959 ) In comm-compute overlap we will have a graph with: ``` def foo(...): ag = all_gather(...) hiding_compute = mm(...) wait(ag) ``` There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap. Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set. This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959 Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev ghstack dependencies: #163215, #163754	2025-09-29 20:43:12 +00:00
Nikita Shulga	6db1b9dd21	[MPS] Chunk fillBuffer into 4Gb slices (#164108 ) To avoid regression on MacOS 26, which one could observe by running the following script ```swift import Metal let bufferSize = 1<<32 + 4 guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") } guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") } guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") } guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") } guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") } blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42) blitEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() let tailOffs = 8 let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize) let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs)) for (idx, val) in tail.enumerated() { print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))") } ``` Test plan: run `test_indexing.py` on MacOS-26 Fixes https://github.com/pytorch/pytorch/issues/161265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108 Approved by: https://github.com/Skylion007	2025-09-29 20:19:29 +00:00
PyTorch MergeBot	9e792f583a	Revert "[export] Skip the check instead of disable (#164084 )" This reverts commit c2768d0f5af840a94c342ed9eac3e26c819aa3f0. Reverted https://github.com/pytorch/pytorch/pull/164084 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))	2025-09-29 20:09:13 +00:00
PyTorch MergeBot	6650f5af74	Revert "[dynamo] Special path for cloning of torch dispatch tensors (#164081 )" This reverts commit 811c693c49f7cd3da2ea174955d12f2f8780bd46. Reverted https://github.com/pytorch/pytorch/pull/164081 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))	2025-09-29 20:09:13 +00:00
atalman	349c960970	Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956 ) Workaround for https://github.com/pytorch/pytorch/issues/163658 Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956 Approved by: https://github.com/malfet Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-09-29 19:38:17 +00:00
atalman	f090818a40	Rename remaining periodic and xpu workflows py3.9->py3.10 (#164127 ) Fix naming py3.9 should be py 3.10 These jobs where already migrated to 3.10 Please see: https://github.com/pytorch/pytorch/actions/runs/18091356163/job/51472526131#step:16:224 ``` Python version: + python --version Python 3.10.18 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164127 Approved by: https://github.com/malfet	2025-09-29 19:26:21 +00:00
eellison	e1bd5b60cf	refactor bucketing (#163754 ) Preparatory refactory Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #163215	2025-09-29 18:32:41 +00:00
eellison	c9b5af9a38	[inductor] do comm compute overlap at aten fx level (#163215 ) This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing. Subsequent prs will handle: - use of exposure analysis to do bucketing - make sure inductor respects comm/compute overlapping done at fx level - non-profiling mm estimation/rank broadcasting of profile results Other mis: - Validate accuracy of nccl estimations ( use ruisi's profiling instead ?) For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives. fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3 bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215 Approved by: https://github.com/IvanKobzarev	2025-09-29 18:18:03 +00:00
Blaine Burton Rister	604da4bb9a	[Inductor-FX] Support unbacked symbol definitions (#163729 ) # Problem Inductor sometimes generates unbacked symints to handle things like mismatched branches of `torch.cond`. This code is represented by `pytree.KeyPath`, with special codegen logic to convert it to Python and C++. This was not previously supported by the FX backend. # Feature This PR adds support for unbacked symbol declarations to the FX backend. The implementation is fairly straightforward. 1. Instead of raw Python/C++, update the wrapper codegen method to emit a new Wrapper IR line called `UnbackedSymbolDefsLine`. This contains all the information needed to generate the Python and C++ code. 2. Move the existing Python/C++ codegen to a private method, which is invoked by `UnbackedSymbolDefsLine.codegen()`. 3. Implement a method to generate FX IR from unbacked symbol definitions. The implementation is based on recursive descent, consuming some keypath entries, emitting an FX IR node, and recursing to the rest of the keypath. It is conceptually identical to the existing algorithm for Python and C++, except it generates FX nodes. 4. The FX backend currently relies on size hints to generate autotuning arguments, and consequently autotuning does not support unbacked SymInts. At some point, we would like to generalize the autotuning logic to support these. But for now, simply emit a warning and skip autotuning when we see them. 5. The new test case exposed some tricky issues reconciling Triton call args with constants stored in `triton_meta`. This PR rewrites the relevant helper function to do this in a more principled way. # Test plan This PR imports an existing control flow test to the FX backend's test suite. The test uses unbacked symbol definitions to handle mismatched dynamic shapes coming from `torch.cond` branches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163729 Approved by: https://github.com/jansel	2025-09-29 18:10:37 +00:00
Nikita Shulga	8f32adc90a	[MPSHooks] Release pending command encoder (#164093 ) Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error ``` tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer' ``` Added regression test to `test_mps_extension` Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that Fixes https://github.com/pytorch/pytorch/issues/163721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-09-29 17:50:12 +00:00
Nikita Shulga	3fa3bfbfda	[EZ][BE] Fix unused parameter warnings in EmbeddingBag (#164135 ) Before this change following were emitted during compilation ``` [7/31] Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal to EmbeddingBag_31.air /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:28:12: warning: unused parameter 'is_first' [-Wunused-parameter] bool is_first) { ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:47:16: warning: unused parameter 'per_sample_weights_index' [-Wunused-parameter] uint32_t per_sample_weights_index, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:48:19: warning: unused parameter 'per_sample_weights' [-Wunused-parameter] constant T* per_sample_weights, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:49:16: warning: unused parameter 'per_sample_weights_stride' [-Wunused-parameter] uint32_t per_sample_weights_stride) { ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:74:19: warning: unused parameter 'weight_val' [-Wunused-parameter] opmath_t<T> weight_val, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:75:19: warning: unused parameter 'out_val' [-Wunused-parameter] opmath_t<T> out_val, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:76:12: warning: unused parameter 'is_first' [-Wunused-parameter] bool is_first, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:77:17: warning: unused parameter 'max_idx' [-Wunused-parameter] thread I& max_idx, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:78:9: warning: unused parameter 'weight_idx' [-Wunused-parameter] I weight_idx, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:79:12: warning: unused parameter 'pad' [-Wunused-parameter] bool pad) {} ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164135 Approved by: https://github.com/Skylion007	2025-09-29 17:44:09 +00:00
Fabian	8701f18bc0	Adjust ...mark_unbacked() -> ...decorators.mark_unbacked() in logs. (#164131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164131 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-29 17:44:00 +00:00
Janani Sriram	a56e7a1920	[Max Autotune][B200] Add addmm config to avoid test OOM (#164020 ) Summary: Add a new `addmm` config that is small enough to not cause an OOM (out of memory error), since the configs for `blackwell_persistent_mm_configs`, which `addmm` used, are too large. Test Plan: `test_max_autotune.py` Differential Revision: D83378477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164020 Approved by: https://github.com/coconutruben, https://github.com/njriasan	2025-09-29 17:38:46 +00:00
Janani Sriram	e2c894c97d	[Inductor][ATen][FP8] Relax stride check for block-wise scaling when scaling dimension is 1 (#163829 ) Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`. Test Plan: ``` pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda 2>&1 \| tee ~/personal/stride_check.log ``` Differential Revision: D83023706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829 Approved by: https://github.com/lw, https://github.com/eqy	2025-09-29 17:28:26 +00:00
PyTorch MergeBot	6b473c90cf	Revert "[inductor] require shape in TritonCSEVariable (#162275 )" This reverts commit c257570e6cd25753f9f0a640b965148ead2cf918. Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/jeffdaily due to sorry this broke rocm CI; inductor/test_select_algorithm.py::TestTemplateRender::test_finalized_subclass_hooks [GH job link](https://github.com/pytorch/pytorch/actions/runs/18048893250/job/51366715091) [HUD commit link](`c257570e6c`) ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3348159095))	2025-09-29 17:26:54 +00:00
Janani Sriram	6bcc6bbc85	[Inductor][FP8] Add op_name for ScaledMM TMA template heuristic (#164019 ) Summary: For H100s and below, add `op_name="scaled_mm"` to the template heuristic for `CUDAScaledTMATemplateConfigHeuristic` such that `scaled_mm` persistent + TMA tests do not default to the "mm" heuristics. Test Plan: `test_max_autotune.py` Differential Revision: D83390775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164019 Approved by: https://github.com/njriasan	2025-09-29 17:24:26 +00:00
Nikita Shulga	95be302889	Skip test_conv3d_cudnn_broken on ROCM (#164138 ) Followup after https://github.com/pytorch/pytorch/pull/163903 Fixes https://github.com/pytorch/pytorch/issues/164137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138 Approved by: https://github.com/Camyll	2025-09-29 16:56:51 +00:00
Yuanyuan Chen	f433e681b9	Remove export of slice_in_dim (#164117 ) Cannot find `slice_in_dim` in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164117 Approved by: https://github.com/soulitzer	2025-09-29 16:56:14 +00:00
Dev Sashidhar	5ff2387dbe	Fix comment on broadcasting example to clarify dimension mismatch (#162177 ) Fixes #162116 Updated the comment in the broadcasting example to clarify that tensors with mismatched dimension sizes (0 vs 2) are not broadcastable. Removed incorrect reference to missing dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162177 Approved by: https://github.com/soulitzer	2025-09-29 16:47:48 +00:00
Nikita Shulga	84b57c93db	[MPSInductor] Unskip test_repeat_interleave_Tensor_decomp (#164136 ) Not sure what was the problem, but it passes for me locally Fixes https://github.com/pytorch/pytorch/issues/159408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164136 Approved by: https://github.com/v0i0	2025-09-29 16:20:34 +00:00
Markus Hoehnerbach	069ccf5f1e	[inductor] pdl: enable launch and deduplicate waits (#162014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162014 Approved by: https://github.com/eellison	2025-09-29 16:10:26 +00:00
Vismai Khanderao	1c12d7416b	[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598 ) Fixes #163597 - Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride. - Updated tests with additional transpose/permutation layouts. New tests catch the regression. ### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport) Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro ### Vector Fast Path (q_len=1, k_len=256) - Before: 0.160 ms - After: 0.157 ms ### Vector 2-pass (q_len=1, k_len=4096) - Before: 0.342 ms - After: 0.339 ms ### Vector Fast Path (q_len=8, k_len=256) - Before: 0.228 ms - After: 0.231 ms ### Vector 2-pass (q_len=8, k_len=4096) - Before: 0.432 ms - After: 0.436 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598 Approved by: https://github.com/malfet	2025-09-29 16:09:46 +00:00
dolpm	3746039b47	[inductor] fix: 'get_raw_stream' undefined (#163707 ) Summary: ran into this when precompiling baidu/ERNIE-4.5-21B-A3B-PT codegen after fix: ```py import triton import triton.language as tl from torch._inductor.runtime.triton_heuristics import start_graph, end_graph from torch._C import _cuda_getCurrentRawStream as get_raw_stream with torch.cuda._DeviceGuard(0): stream0 = get_raw_stream(0) ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163707 Approved by: https://github.com/jamesjwu	2025-09-29 15:48:16 +00:00
Paul Zhang	872edd89d6	Enable outer reductions in fbcode (#163884 ) Summary: Enabling the outer reduction optimization in fbcode Test Plan: Evals in https://docs.google.com/document/d/1-tcItRsyEaibaXL56Zq2-CWh5wCmHXDDgDQT_9uOvXE/edit?tab=t.0#bookmark=id.tkgzaitxacg0 Differential Revision: D81948542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163884 Approved by: https://github.com/Skylion007	2025-09-29 15:25:17 +00:00
Howard Huang	47ed41109f	Fix PgNccl coalseced profiling (#160680 ) Admittedly I'm a noob when looking at traces, but this looked pretty off to me: <img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" /> 1. Why are there so many "nccl:coalesced" on the CPU thread 2. Why is there "nccl:coalesced" on compute stream (stream 7) Here is what is happening: CPU side: In `endCoalescing`, we create a [work object ](`3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)`) with the profiling title "nccl:coalesced" GPU side: The CUDA kernels will inherit this profiling title What is missing: We forgot to call the record function [callback](`3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)`). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this: <img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-09-29 15:21:55 +00:00
Klaus Zimmermann	fa54b08cd5	Replace setup.py install with pip install (#156711 ) #156027 already replaced most use of `python setup.py install`. This PR only adds a few more occurrences and adds `--no-build-isolation` in a few places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156711 Approved by: https://github.com/atalman	2025-09-29 15:15:10 +00:00
Nicolas De Carli	92284fb2ff	Add SVE128 ISA (#158932 ) Summary: Partly Importing and adapting https://github.com/pytorch/pytorch/pull/138388, adding SVE128 as ISA. Intention is to add SVE128 translation layers for Vectorized data types. Idea is to have 1 PR per file, aside from the current one, plus a last one modifying cmake files to enable the new ISA selectively. Tested current changes on a nightly run, to verify no regressions occur on systems leveraging SVE256. No regressions spotted when running test_ops.py, a set of 34k unit tests. A machine leveraging SVE128 was used towards this testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158932 Approved by: https://github.com/malfet	2025-09-29 14:49:19 +00:00
PaulZhang12	84d673ef57	Add less warps config to inner reductions (#162447 ) Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread <img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" /> Differential Revision: [D83343892](https://our.internmc.facebook.com/intern/diff/D83343892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314	2025-09-29 13:48:36 +00:00
Jean Schmidt	d633bac252	Update issue templates adding a DISABLE AUTOREVERT option (#163858 ) This should be used to disable autorevert functionality if users feels the need to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163858 Approved by: https://github.com/izaitsevfb	2025-09-29 13:10:05 +00:00
PyTorch UpdateBot	d81476e211	[xla hash update] update the pinned xla hash (#163494 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163494 Approved by: https://github.com/pytorchbot	2025-09-29 12:31:16 +00:00
PyTorch UpdateBot	a0ae2f9aa0	Update slow tests (#163493 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163493 Approved by: https://github.com/pytorchbot	2025-09-29 11:58:17 +00:00
Ke Wen	615da7b95e	[fx] Allow customization of submod name in split graph (#164035 ) Fixes #164030: HOP and pipelining both name things submod_i by adding an optional argument `partition_affix` to `split_module` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035 Approved by: https://github.com/ezyang ghstack dependencies: #164045	2025-09-29 09:16:36 +00:00
Deng, Daisy	4fd70d4e7b	[1/N]Enable some tests in test_ops.TestCommon on Intel GPU (#159944 ) For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Extended XPUTestBase.get_all_devices to support multiple devices 2. Added skipXPU decorator 3. Extended onlyOn to support device list 4. Enabled 'xpu' for some test pathes 5. Added allow_xpu=True for supported test class. 6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests 7. Use skipIfXpu and skipXPU to disable unsupported test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-09-29 09:08:04 +00:00
Animesh Jain	e1e5e040cd	[dynamo][export] Add some missing trace rules (#164080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164080 Approved by: https://github.com/tugsbayasgalan	2025-09-29 08:47:24 +00:00
Ke Wen	5ddad22196	[PP] Use default export mode (non-strict) (#164045 ) export's default mode has switched from strict to non-strict. We just follow suit in PP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164045 Approved by: https://github.com/H-Huang	2025-09-29 06:31:06 +00:00
Valentine233	90512fa5bd	[Quant] extend the op list for quant lift up (#163621 ) Add `aten.reshape.default` into the op list of quant lift up, in order to fuse more potential quantized kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163621 Approved by: https://github.com/mingfeima, https://github.com/Xia-Weiwen, https://github.com/jansel	2025-09-29 06:14:45 +00:00
Isalia20	48a5470cf8	[CUDA] fix indexing on large tensor causing nvalid configuration argument (#164049 ) Fixes #164048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164049 Approved by: https://github.com/eqy	2025-09-29 06:07:35 +00:00
CaoE	b9854c9d89	[Inductor][CPP] Fix the test case of test_linear_reuse_kernels (#163723 ) Fixes #163491. Add tolerances to make `test_linear_reuse_kernels` more stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163723 Approved by: https://github.com/leslie-fang-intel	2025-09-29 05:29:01 +00:00
can-gaa-hou	eb4361a801	[Fix] Adding missing `f` prefixes to formatted strings [1/N] (#164065 ) As stated in the title. * #164068 * #164067 * #164066 * __->__ #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065 Approved by: https://github.com/Skylion007	2025-09-29 04:53:00 +00:00
PyTorch UpdateBot	d131f213ac	[vllm hash update] update the pinned vllm hash (#164092 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164092 Approved by: https://github.com/pytorchbot	2025-09-29 04:41:06 +00:00
can-gaa-hou	7c7ae86991	[Fix] Adding missing `f` prefixes to formatted strings [2/N] (#164066 ) As stated in the title. * #164068 * #164067 * __->__ #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066 Approved by: https://github.com/Skylion007	2025-09-29 04:40:44 +00:00
can-gaa-hou	ad32ed83b3	[Fix] Adding missing `f` prefixes to formatted strings [3/N] (#164067 ) As stated in the title. * #164068 * __->__ #164067 * #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164067 Approved by: https://github.com/Skylion007	2025-09-29 04:35:23 +00:00
Animesh Jain	d8becd1cf4	[dynamo][export] Make the source_stack and fqn info same between dynamo and export (#164085 ) preparing for landing the install_free_tensors flag Pull Request resolved: https://github.com/pytorch/pytorch/pull/164085 Approved by: https://github.com/tugsbayasgalan	2025-09-29 04:35:13 +00:00
can-gaa-hou	e64dd8c694	[Fix] Adding missing `f` prefixes to formatted strings [4/N] (#164068 ) As stated in the title. * __->__ #164068 * #164067 * #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164068 Approved by: https://github.com/Skylion007	2025-09-29 04:07:07 +00:00
Xuehai Pan	047ae24e34	Eliminate setup.py install/develop in the codebose (#162329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162329 Approved by: https://github.com/ezyang	2025-09-29 03:54:28 +00:00
Yuanyuan Chen	3cda34ebde	[2/N] Apply ruff UP035 check in torch files (#164054 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. `TypeAlias` is also imported from `typing`. This PR is the follow-up of #163947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-09-29 03:35:32 +00:00
Yuanyuan Chen	352197c508	Remove old ROCm skip conditions in tests (#164058 ) This PR removes skip conditions for ROCM <= 3.5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164058 Approved by: https://github.com/kwen2501	2025-09-29 03:00:58 +00:00
Animesh Jain	811c693c49	[dynamo] Special path for cloning of torch dispatch tensors (#164081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164084	2025-09-29 01:44:44 +00:00
Animesh Jain	c2768d0f5a	[export] Skip the check instead of disable (#164084 ) Its unclear why we had disable in the first place. With install_free_tensors, we are tracing into this hook. A better way would be to place the tracer without any hook. For now, disable the checking while dynamo is tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164084 Approved by: https://github.com/tugsbayasgalan	2025-09-29 01:44:44 +00:00
Yuanyuan Chen	a8c528c105	[1/N] Apply UP035 rule in tests (#163947 ) Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163947 Approved by: https://github.com/ezyang	2025-09-29 01:42:01 +00:00
Animesh Jain	dc54ce7554	[hops] Support unspecialized nn module for export hops (#164082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164082 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164079	2025-09-29 01:34:10 +00:00
Animesh Jain	1981ed4f60	[dynamo][logging] Add to param_count only if metrics_count is active (#164079 ) This is rare but happens with executorch tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164079 Approved by: https://github.com/tugsbayasgalan	2025-09-29 00:59:18 +00:00
jainapurva	54b38f3b46	Add operator benchmarking run to CI nightly (#162530 ) This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode. Benchmark Configuration and Coverage: * Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI. * The CI benchmarking is running on various hardwares: H100, A100. * The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-29 00:46:38 +00:00
RiyaP-QA	bc5a072ebf	fixes import error 'functionalize' from functorch (#163746 ) Fixes #163637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163746 Approved by: https://github.com/malfet	2025-09-28 23:16:45 +00:00
RajeshvShiyal	d1b3481131	registraion replaced with registration in jit_type.h file comment (#164072 ) Fixes #164071 typo correction done Pull Request resolved: https://github.com/pytorch/pytorch/pull/164072 Approved by: https://github.com/Skylion007	2025-09-28 22:55:24 +00:00
Yuanyuan Chen	3766513d25	Remove C++ workarounds for Python < 3.10 (#164055 ) Remove two unnecessary `PY_VERSION_HEX` branches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164055 Approved by: https://github.com/ezyang	2025-09-28 20:00:02 +00:00
FFFrog	ea6846b231	[CI] Remove the unnecessary workflow related functorch (#162581 ) The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162581 Approved by: https://github.com/ezyang	2025-09-28 19:56:20 +00:00
Tugsbayasgalan Manlaibaatar	f6537d9616	Move control flow export tests to new tracer (#163259 ) Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163259 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #163136, #163137, #163258	2025-09-28 19:56:09 +00:00
Tugsbayasgalan Manlaibaatar	cc0332563e	Use new_tracer_experimental for torchao strict export (#163258 ) Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph. Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163258 Approved by: https://github.com/anijain2305 ghstack dependencies: #163136, #163137	2025-09-28 19:55:54 +00:00
Tugsbayasgalan Manlaibaatar	8239ba4087	Fix various bugs in subclass input in export (#163770 ) This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors. Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163770 Approved by: https://github.com/avikchaudhuri	2025-09-28 18:03:32 +00:00
Tugsbayasgalan Manlaibaatar	1fdd99de71	Building guards should be under metrics_context (#163967 ) Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163967 Approved by: https://github.com/avikchaudhuri	2025-09-28 16:28:34 +00:00
lichuyang	38ed608956	Better error handling in torch/nativert/* (#163308 ) Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK in torch/nativert/* The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in torch/nativert/* . Fixes part of #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163308 Approved by: https://github.com/dolpm	2025-09-28 14:23:44 +00:00
Alessandro Fanfarillo	238dc65368	[ROCm] use hipSolver instead of MAGMA for Cholesky (#163977 ) Currently, the Cholesky factorization and least squares operation defaults to magma when Pytorch is compiled for ROCm. This shows suboptimal performance. This change allows PyTorch to rely on hipSolver instead of Magma. @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/163977 Approved by: https://github.com/Skylion007	2025-09-28 06:53:06 +00:00
Laith Sakka	7bbde0c094	Remove unused argument from DEFINE_BINARY macro. (#163868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163868 Approved by: https://github.com/Skylion007 ghstack dependencies: #163822	2025-09-28 06:32:41 +00:00
Laith Sakka	dfcab0e7e1	Handle DDE in infer_size_impl (#163822 ) hit this while running VLLM with unbacked for model Qwen/Qwen2-1.5B-Instruct Pull Request resolved: https://github.com/pytorch/pytorch/pull/163822 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2025-09-28 06:32:41 +00:00
PyTorch UpdateBot	1cc9263f52	[vllm hash update] update the pinned vllm hash (#164053 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164053 Approved by: https://github.com/pytorchbot	2025-09-28 04:35:17 +00:00
Yuanyuan Chen	c2862c8e66	[distributed] Remove python code older than 3.10 (#163613 ) Because now that the minimum Python version is 3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163613 Approved by: https://github.com/XuehaiPan, https://github.com/kwen2501	2025-09-28 04:15:24 +00:00
Laith Sakka	b377c9e365	graph break on tolist if capture_scalar_outputs is false (#163807 ) address https://github.com/pytorch/pytorch/issues/163798 its problematic to not graph break because: 1. break current contract. 2. well dynamo trace then we have .item call then if we ever re-trace later in autograd for example we hit a failure (We do not know where to graph break at that point)! see the added unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163807 Approved by: https://github.com/bobrenjc93	2025-09-28 04:02:52 +00:00
Avik Chaudhuri	3059b08012	[inductor] add subsystem to pattern matcher (#163922 ) Summary: Running a toy example through `torch.compile(fullgraph=True, backend="inductor")` with default inductor config, I tried to see what passes are run in each of pre-grad, joint-graph, and post-grad phases by printing out the subsystem in `GraphTransformObserver`. However the subsystem showed up as None in a bunch of transforms that were run in each of those phases, so this PR adds some additional annotations. Note that these annotations are probably not a complete set, since other transforms may run based on changes to the config that are not covered here. Hopefully this doesn't change behavior. However, I did notice that bisecting relies on disabling various phases, which means that while before some passes would not be disabled (because their subsystem was `None`), now they would. Test Plan: existing tests + manual test described in summary Differential Revision: D83306676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163922 Approved by: https://github.com/jansel	2025-09-28 03:15:23 +00:00
Aaron Gokaslan	5504a06e01	[BE]: Update NCCL to 2.28.3 (#162351 ) @eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory. Also allows FP8 support for non-reductive operations on pre-sm90 devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162351 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman	2025-09-28 01:38:59 +00:00
lichuyang	1ad491dd88	Better error handling in torch/csrc/jit/ir/* (#163757 ) Refactor error handling to use TORCH_CHECK for improved clarity in constants and scope management Fixes some parts of ISSUE #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163757 Approved by: https://github.com/albanD	2025-09-28 01:18:24 +00:00
Bob Ren	fd20889d0b	Add type annotations to MPS profiler utilities (#163486 ) ## Summary - drop the local mypy allow-untyped-defs escape hatch in the MPS profiler helpers - annotate the context managers and bool helpers so they type-check cleanly ## Testing - python -m mypy torch/mps/profiler.py --config-file mypy-strict.ini ------ https://chatgpt.com/codex/tasks/task_e_68d0ce4df2e483268d06673b65ef7745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163486 Approved by: https://github.com/Skylion007	2025-09-27 23:00:53 +00:00
fduwjj	2ce2e48a05	[WIP][symm_mem] Add a wait for signal and put signal for one side API (#159837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159837 Approved by: https://github.com/kwen2501	2025-09-27 21:20:13 +00:00
Aart J.C. Bik	1d98be6abf	[NFC] fixed typo in sparse semi structured filename (#163904 ) Make sure all semi structured files use "SparseSemiStructured" Pull Request resolved: https://github.com/pytorch/pytorch/pull/163904 Approved by: https://github.com/Skylion007	2025-09-27 21:19:48 +00:00
Chien-Chin Huang	dfda239cce	[DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement (#163941 ) A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it. There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement. Ideally, we want to add the following test: ``` with self.assertRaisesRegex( RuntimeError, "Any checkpointing related operations are not supported for" ): dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir()) ``` While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941 Approved by: https://github.com/tianyu-l	2025-09-27 19:50:16 +00:00
Animesh Jain	991e3d0d16	[dynamo][guards] Revert introduction of different types of lambda_guards (#163385 ) With https://fb.workplace.com/groups/260102303573409/permalink/787294574187510/ issue, it might be a better idea to just speedup _realize_dict and keep the changes very local. So reverting this PR as well, to return to clean slate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163385 Approved by: https://github.com/jansel	2025-09-27 18:20:48 +00:00
Yidi Wu	8f6dbc0ba8	[scan] create fw and bw graphs via partitioning (#162754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162754 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664, #161808, #162025, #161732	2025-09-27 18:13:15 +00:00
Yidi Wu	3413490f53	[scan] materialize combine_fn in forward add more autograd tests (#161732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664, #161808, #162025	2025-09-27 18:13:15 +00:00
Yidi Wu	b85bee3bbb	[hop] refactor check input alias and mutation to be a graph pass (#162025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162025 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664, #161808	2025-09-27 18:13:15 +00:00
Yidi Wu	66dbf2c9f5	[scan][autograd] clone outputs that's aliasing with inputs or outputs in bw (#161808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161808 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664	2025-09-27 18:13:15 +00:00
Yidi Wu	f5d85874dd	[scan][be] remove unnecessary tensor checks (#161664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161664 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #161557	2025-09-27 18:13:14 +00:00
Yidi Wu	8f15d6a0c9	[test][scan] refactor inductor test and prepare for adding bw tests (#161557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161557 Approved by: https://github.com/zou3519	2025-09-27 18:13:14 +00:00
redwrasse	e78792a70d	Update ctc loss docs float32 input required for CuDNN (#162042 ) Discovered while working on https://github.com/pytorch/pytorch/pull/159106 the non-obvious requirement that inputs must be float32 to use CuDNN (https://github.com/pytorch/pytorch/pull/159106#issuecomment-3189981705), otherwise the native CUDA implementation is called. Updates the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162042 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2025-09-27 18:10:17 +00:00
Ke Wen	d9db838f58	[CI] Re-enable test_all_to_all_vdev_2d_offset (#163985 ) Fixes https://github.com/pytorch/pytorch/issues/163847 Moving allocations upfront and collectives later. The hang goes away. My investigation indicates that the hang is inside the last call `torch.testing.assert_close(out_expected, out[:out_numel])`. Rank 3 calls into it, but never gets out. Don't know why yet. I will investigate more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163985 Approved by: https://github.com/fegin	2025-09-27 16:56:25 +00:00
FFFrog	6ba83e06a5	[AMP] Add deprecated decorator for torch.xxx.amp.autocast class (#163654 ) As the title stated. Changes: - torch.cuda.amp.autocast - torch.cpu.amp.autocast - add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654 Approved by: https://github.com/Skylion007	2025-09-27 14:37:12 +00:00
FFFrog	960290d629	[Docs] Add standard-imghdr for PyTorch Doc (#163944 ) As the title stated. Python [Pep-0594](https://peps.python.org/pep-0594) have removed imghdr from python standard libaries, the older version of sphinx don`t add it as installation dependencies, so we need to add it to requirement as an temporary dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163944 Approved by: https://github.com/albanD, https://github.com/svekars	2025-09-27 08:14:51 +00:00
Min Si	b1a4efc302	[amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007 ) Summary: See title Test Plan: ``` buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test ``` After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2 Before fix: failed in D83294731 (version 1): https://www.internalfb.com/sandcastle/workflow/1792432651703947243 Differential Revision: D83375414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007 Approved by: https://github.com/llxxee	2025-09-27 06:09:50 +00:00
Wei Wang	96182faf96	[CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988 ) Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 ) Creating this subtest to get something distributed on B200. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988 Approved by: https://github.com/malfet	2025-09-27 05:12:05 +00:00
bobrenjc93	dcb8af7501	[torchfuzz] fix bool propagation (#164003 ) bools can't propogate through the current pointwise ops such as add/mul. once we add more that can, we'll probably want to add an additional subclass that supports pointwise bools, but for now just don't allow it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164003 Approved by: https://github.com/pianpwk ghstack dependencies: #163743, #163812, #163890, #164002	2025-09-27 04:51:29 +00:00
PyTorch UpdateBot	280e712c13	[vllm hash update] update the pinned vllm hash (#164029 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164029 Approved by: https://github.com/pytorchbot	2025-09-27 04:34:57 +00:00
Arsh Zahed	254d2864d6	Add runtime_overhead PR Time Benchmark (#163866 ) This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead. Example Results: ``` runtime_overhead_inductor,instruction_count,222645 runtime_overhead_inductor_inference_mode,instruction_count,234998 runtime_overhead_inductor_requires_grad,instruction_count,293556 runtime_overhead_inductor_requires_grad_backward,instruction_count,78181 runtime_overhead_inductor_dynamic,instruction_count,234870 runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711 runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979 runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2025-09-27 03:26:59 +00:00
Eli Uriegas	9dac6437da	lint: Filter out /usr/include from results (#164012 ) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164012 Approved by: https://github.com/ZainRizvi ghstack dependencies: #164008	2025-09-27 00:54:07 +00:00
Eli Uriegas	8a0e8cad5f	lint: Only include files in pytorch (#164008 ) We were seeing instances of stdlib files in clang-tidy output so this just essentially removes them from the things that lintrunner will report up. Longer term fix here would be to just modify the clang-tidy configuration in order to do the correct thing here but that requires a bit more investigation as to why this is only happening in CI and is not reproduceable locally. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164008 Approved by: https://github.com/ZainRizvi	2025-09-27 00:54:07 +00:00
bobrenjc93	3a115da3e6	[torchfuzz] ones over zero (#164002 ) reduces likelihood of divide by zero errors. long term we'll probably want to just fuzz these values entirely Pull Request resolved: https://github.com/pytorch/pytorch/pull/164002 Approved by: https://github.com/pianpwk ghstack dependencies: #163743, #163812, #163890	2025-09-27 00:53:02 +00:00
fduwjj	b48a3d0a38	[CuTe] Add layout overlap checking util function in _MeshLayout (#163367 ) While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163367 Approved by: https://github.com/fegin ghstack dependencies: #163212, #163288, #163928, #163930	2025-09-27 00:22:14 +00:00
Nan Zhang	8d474bdc14	Change python grid calc for MTIA back to python mode (#163601 ) Differential Revision: D83000165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601 Approved by: https://github.com/blaine-rister	2025-09-27 00:12:53 +00:00
Chang Pan	008051b13c	[Dynamic Shape][BE] trim _DimHint serialization (#163891 ) Summary: current serialization is a bit hard to read ``` Exporting with the dynamic shape spec: {getitem_123: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_118: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_117: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_116: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_115: ( _DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True), _DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_46: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False), _DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_ro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False), _DimHint(t ype=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_nro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False)... ``` Test Plan: UT Differential Revision: D83175131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163891 Approved by: https://github.com/pianpwk	2025-09-27 00:08:01 +00:00
rraminen	e4ffd718ec	Fix setting of memory fraction in test_garbage_collect_expandable (#164000 ) Fixes #160598 Fixes #160551 Fixes #160507 This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm. This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused. For example, `test_autocast_checkpointing` failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm `torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164000 Approved by: https://github.com/jeffdaily	2025-09-26 23:57:32 +00:00
Eddie Yan	ed3085814a	[cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903 ) cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman	2025-09-26 23:50:09 +00:00
Eddie Yan	e2817ac204	[cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581 ) To workaround #163539 Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581 Approved by: https://github.com/ngimel, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-26 23:47:29 +00:00
Chang Pan	1d138e658d	[AOTI] log error triton kernel name during autotune (#163889 ) Summary: can't tell from current error msg which kernel got exception Test Plan: lint & pyre Reviewed By: muchulee8 Differential Revision: D83246522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163889 Approved by: https://github.com/jansel	2025-09-26 23:29:49 +00:00
Taras	f9095fb285	[Windows] Update libuv version from 1.39 to 1.51 (#160318 ) Fixes: [#148315](https://github.com/pytorch/pytorch/issues/148315) The PR updates `libuv` version as `conda-forge` channel doesn't contain `libuv=1.39` for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160318 Approved by: https://github.com/iremyux, https://github.com/malfet	2025-09-26 23:29:21 +00:00
Kurt Mohler	a0136f149c	[MPS] Fix nan behavior in `grid_sampler_3d` (#163881 ) Fixes #163851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163881 Approved by: https://github.com/malfet	2025-09-26 23:08:00 +00:00
Isalia20	62b0ebd8f9	[MPS] [Sparse] unique_dim and sparse broadcast (#163694 ) Implements unique_dim, sparse broadcast ops and adds dtypes for mps for tests where we expect to fail, otherwise they would always fail due to being run in double precision Pull Request resolved: https://github.com/pytorch/pytorch/pull/163694 Approved by: https://github.com/malfet	2025-09-26 23:03:13 +00:00
bobrenjc93	19f16a65b4	[torchfuzz] Add support for fuzz templates (#163890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163890 Approved by: https://github.com/pianpwk ghstack dependencies: #163743, #163812	2025-09-26 22:51:45 +00:00
Joel Schlosser	0ebfa3d7d2	Avoid fast path mask left-align check in compiled TransformerEncoder (#163773 ) Fixes #163640 This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773 Approved by: https://github.com/mikaylagawarecki	2025-09-26 22:29:37 +00:00
eqy	0ea10f9912	[cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171 ) test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171 Approved by: https://github.com/ngimel, https://github.com/malfet	2025-09-26 22:12:17 +00:00
Bin Bao	48a852b7ae	[AOTI] Update AOTInductor tutorial (#163808 ) Summary: Remove the BC breaking warning. Add inductor_config to the example code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163808 Approved by: https://github.com/yushangdi	2025-09-26 22:01:31 +00:00
Jeff Daily	f1260c9b9a	[ROCm][CI/CD] upgrade nightly wheels to ROCm 7.0 (#163937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163937 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-26 21:42:09 +00:00
Zhijing Li	28c7d11428	[AOTI] Pass in shape_env for get_stride_order (#163925 ) Summary: As titled. Without the diff, we got P1963055009 With the diff passing in the enviroment, we can do correct sym_int deduction: https://fburl.com/mlhub/p5zy7o28 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)" ``` Without the fix: P1964887260 With the fix: P1964888579 Differential Revision: D83211018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925 Approved by: https://github.com/ColinPeppler	2025-09-26 21:10:03 +00:00
fduwjj	a60c6ed99f	[DeviceMesh][ez] Extract the pg creation as a util function (#163930 ) This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163930 Approved by: https://github.com/fegin ghstack dependencies: #163212, #163288, #163928	2025-09-26 20:42:58 +00:00
Isuru Fernando	c257570e6c	[inductor] require shape in TritonCSEVariable (#162275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275 Approved by: https://github.com/mlazos	2025-09-26 20:41:12 +00:00
Shangdi Yu	2f85de0b42	Fix preserve annotation with decomp (#163896 ) If we use `fx_traceback.preserve_node_meta()`, we will have a few extra node.meta fields on nodes, such as "seq_nr", added from `fx/proxy.py`. As a result, there might be non-empty node.meta on graph nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163896 Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4	2025-09-26 20:28:47 +00:00
Shangdi Yu	e21b037756	Add tests for aot_export_joint_with_descriptors annotation (#163893 ) As title, test 1) Annotation works with aot_export_joint_with_descriptor API 2) Annotation works with the 2 step "strict export.export + aot_export_joint_with_descriptor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/163893 Approved by: https://github.com/SherlockNoMad	2025-09-26 19:25:44 +00:00
q1l1	f8c7505855	[inductor] Fix unbounded number of substitutions when equality checks contain Max expr (#163685 ) ## Issue From an internal use case, we found that if we have an equality rule like: ``` Max(15, u0) == s0 * Max(15, u0) ``` This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions: ``` Max(15, u0) --> s0 * Max(15, u0) --> s0 ** 2 * Max(15, u0) --> s0 ** 3 * Max(15, u0) --> s0 ** 4 * Max(15, u0) ... ``` The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions. ## Fix 1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now 2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here ## Testing - with the unittiest alone --> unittest stuck - with the unittest and while-loop breakout, we could see tests finished with warning "Substitution limit reached": ``` test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+ W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] \| !!! WARNING !!! \| W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+ W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead. stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [5.6947s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1*30Max(15, u0) W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64u130Max(15, u0) stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [5.6278s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s] ============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================ ``` - with the unittest + comparison logic enhanced, we don't see the warning any more: ``` Running 3 items in this shard test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+ W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] \| !!! WARNING !!! \| W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+ W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead. stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [6.6093s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [6.0502s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s] ============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685 Approved by: https://github.com/jansel	2025-09-26 18:46:36 +00:00
Catherine Lee	425ea90f95	[testing] Add test owner labels for some cuda? tests (#163296 ) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: https://github.com/pytorch/pytorch/pull/163296 Approved by: https://github.com/eqy, https://github.com/msaroufim	2025-09-26 18:26:56 +00:00
Catherine Lee	5b764267f4	[testing] Add test owner labels for some distributed tests (#163174 ) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: https://github.com/pytorch/pytorch/pull/163174 Approved by: https://github.com/ezyang	2025-09-26 18:19:04 +00:00
Scott Wolchok	50c0550f5a	Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527 ) See comment on the macro definition. In short, pybind11 3.x added `py::native_enum`, and also had to add overhead for that new way to bind enums on the critical path for calling functions that take regular old `py::enum_`s as arguments (for example, `__eq__`). Differential Revision: [D82873169](https://our.internmc.facebook.com/intern/diff/D82873169/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163527 Approved by: https://github.com/ezyang	2025-09-26 17:59:22 +00:00
arkadip-maitra	d7491fb1c1	Fix tensor creation with empty names crash (#163957 ) Partially fixes #148324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163957 Approved by: https://github.com/malfet, https://github.com/janeyx99	2025-09-26 17:41:00 +00:00
Huamin Li	9534c59311	[Inductor] address comments from https://github.com/pytorch/pytorch/pull/163803 (#163901 ) Summary: address comments from https://github.com/pytorch/pytorch/pull/163803 Differential Revision: D83291637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163901 Approved by: https://github.com/desertfire	2025-09-26 17:18:44 +00:00
hanchchch	5880996b4c	Expose torch.nn.utils.parametrize (#163835 ) `torch.nn.utils.parametrize` is not imported from `torch/nn/utils/__init__.py`, thus is not exposed and make it hard for code editors to statically analyze the code and provide auto-completion based on the function signature. <img width="615" height="292" alt="Screenshot 2025-09-25 at 12 01 52 PM" src="https://github.com/user-attachments/assets/a276f6f0-87f3-4732-943d-2a92ea871974" /> after the fix: <img width="964" height="393" alt="Screenshot 2025-09-25 at 12 02 16 PM" src="https://github.com/user-attachments/assets/ca47f09e-dc4e-4420-a2d2-11669e07471a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163835 Approved by: https://github.com/albanD	2025-09-26 16:38:18 +00:00
Tugsbayasgalan Manlaibaatar	1d26eb0fcc	Move inductor.aot_compile to use new tracer (#163137 ) Differential Revision: [D82603768](https://our.internmc.facebook.com/intern/diff/D82603768) I feel no one probably uses this API now but still useful path for more test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163137 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #163136	2025-09-26 15:54:24 +00:00
Tugsbayasgalan Manlaibaatar	a05f6ecfec	Fix bug with renaming submodules in dynamo for new tracer (#163136 ) Differential Revision: [D82603767](https://our.internmc.facebook.com/intern/diff/D82603767) Previously, i forgot to add handle call_module case which now will have export_root prepended to their names. Basically i want to clean up sth like: ``` graph(): %l_self_export_root_sub_mod = call_module[target=l_self_export_root_sub_mod](%x, %y) %l_self_export_root_sub_mod_1 = call_module[target=l_self_export_root_sub_mod](%x, %y) ``` Dynamo graph can have call_module nodes that have messed up name due to our wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163136 Approved by: https://github.com/avikchaudhuri	2025-09-26 15:54:24 +00:00
thenumberouscode	c106ee8515	[FakeTensor] Supplement the relevant logic for converting conv1d to conv2d in meta_conv (#160408 ) ## Fixes https://github.com/pytorch/pytorch/issues/159462 also fixes #163569 , #163604 ## summary the issue is caused by the wrong stride of conv1d's result generated by meta_conv: `4d5b3f2d5a/torch/_meta_registrations.py (L2453-L2471)` and the wrong stride will be used to codegen size assert in inductor: `4d5b3f2d5a/torch/_inductor/ir.py (L6152-L6163)` ## reason So why the computed stride is wrong in the meta_conv function? because the corresponding backend will convert conv1d to conv2d and change the input tensor' size and memory_format(channel last). but the meta_conv do not do this transformation, so a mismatch happend. `4d5b3f2d5a/aten/src/ATen/native/Convolution.cpp (L1502-L1510)` just add corresponding logic in meta_conv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160408 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/mlazos	2025-09-26 15:45:02 +00:00
Isalia20	8aba513506	[MPS] test sparse add MPS dtypes so we get proper expected failure (#163951 ) Adds dtypeIfMPS so if op is supported we get proper error like unexpected success. Before we would never get unexpected success because tests were run in torch.double dtype which will always fail on MPS due to it not supporting the dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/163951 Approved by: https://github.com/malfet	2025-09-26 14:48:58 +00:00
fduwjj	8c194a367e	[DeviceMesh][ez] Add a type alias for backend config (#163928 ) Create a type alias for `tuple[Optional[str], Optional[C10dBackend.Options]]` since it is too long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163928 Approved by: https://github.com/fegin ghstack dependencies: #163212, #163288	2025-09-26 14:46:53 +00:00
Yiming Zhou	33f3413bd3	[WIP][precompile] Set fake_mode of base tensor in fx graph pickler (#163738 ) Summary: When unpickling a fake tensor in fx graph pickler. It only sets the fake mode of the current tensor's metadata to the one that is consistent with pickler's `unpickle_state`. However, it doesn't set the fake mode of a tensor's base tensor when that tensor is a view. This will cause an issue when dumping and loading the following graph ``` class GraphModule(torch.nn.Module): def forward(self, s77: "Sym(s77)", L_x_: "f32[s77, 8]"): l_x_ = L_x_ chunk = l_x_.chunk(2, dim = -1); l_x_ = None y: "f32[s77, 4]" = chunk[0]; chunk = None y_repeat: "f32[s77, 8]" = y.repeat_interleave(2, dim = -1); y = None return (y_repeat,) ``` because `repeat_interleave` will create an intermediate fake tensor of size `[s77, 2, 4]` and it will become the base of the node `y_repeat`'s `meta['val']`. This causes issues during the deserialization phase when applying AOT precompile to DeepSeek in vLLM. Test Plan: This has been tested in vLLM with DeepSeek. As for unittest, ideally it should be `test_aot_compile_repeat_interleave` with mark_dynamic turned on. However, that's leading to some other pickle issues. ``` python test/dynamo/test_aot_compile.py -k test_aot_compile_repeat_interleave ``` I have yet to figure out a more appropriate unittest. But a proof-of-concept demo would be the following: ``` import inspect import sympy import torch from torch.fx._graph_pickler import GraphPickler from torch.fx.experimental.symbolic_shapes import ShapeEnv from torch._subclasses import FakeTensorMode from torch.fx._graph_pickler import GraphPickler, Options from unittest.mock import patch class M(torch.nn.Module): def forward(self, x): chunk = x.chunk(2, dim=-1) y = chunk[0] y_repeat = y.repeat_interleave(2, dim=-1) return y_repeat def my_custom_backend(gm, example_inputs): global gm_global gm_global = gm return gm.forward m = M() m_opt = torch.compile(m, backend=my_custom_backend, fullgraph=True) sample_inputs = (torch.randn(2, 8),) torch._dynamo.mark_dynamic(sample_inputs[0], [0]) opt_out = m_opt(*sample_inputs) graph_reducer_override = GraphPickler.reducer_override def _graph_reducer_override(self, obj): if (inspect.isclass(obj) and issubclass(obj, sympy.Function) and hasattr(obj, "_torch_unpickler")): return obj._torch_unpickler, (obj._torch_handler_name, ) if isinstance(obj, FakeTensorMode): return type(None), () return graph_reducer_override(self, obj) with patch.object(GraphPickler, "reducer_override", _graph_reducer_override): pickled_gm = GraphPickler.dumps(gm_global, Options(ops_filter=None)) fake_mode = FakeTensorMode(shape_env=ShapeEnv()) loaded_gm = GraphPickler.loads(pickled_gm, fake_mode) ``` Differential Revision: D83112599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163738 Approved by: https://github.com/zhxchen17	2025-09-26 14:36:37 +00:00
mansiag05	d4e4f70768	Fix overflow in slow_conv3d when kernel size is too large. (#162718 ) Also, adding check for padding to avoid segmentation fault caused by overflow. Fixes #141846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162718 Approved by: https://github.com/jgong5, https://github.com/Skylion007	2025-09-26 13:39:29 +00:00
PyTorch MergeBot	bfd21cd3e6	Revert "Add less warps config to inner reductions (#162447 )" This reverts commit 768361e67f0eb36491d7b763ef38d7c928ebefe6. Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to failed to land internally ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3338680532))	2025-09-26 13:16:04 +00:00
Yuanyuan Chen	7441a1b9b1	Update ruff to 0.13.1 (#163744 ) Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff. There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059) ``` Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple. ``` A BC-breaking change is introduced to change the typing of `OrderedSet .storage` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744 Approved by: https://github.com/Skylion007, https://github.com/jingsh	2025-09-26 10:12:21 +00:00
jianyizh	6a2bd1f4ee	[inductor] skip bmm when converting channel last (#159459 ) Workaround of #159458 by remove some nodes output channel last set Pull Request resolved: https://github.com/pytorch/pytorch/pull/159459 Approved by: https://github.com/etaf, https://github.com/eellison, https://github.com/shunting314	2025-09-26 09:11:40 +00:00
Cui, Yifeng	4783e3ff49	Update torch-xpu-ops commit pin (#163758 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](`229e8ba104`), includes: - Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak - Enable SYCL warnings on Linux - Fix accuracy issues with CTC loss - Enable aten::nonzero_static on XPU backend - Stop recursive calculations in polynomial kernels if tensor has NaNs Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758 Approved by: https://github.com/EikanWang	2025-09-26 09:05:08 +00:00
CaoE	c8e5b7dabb	Add SDPA patterns for T5 variants when batch size is 1 (#163252 ) As mentioned in https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L838, this PR generates patterns for the cases batch size == 1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163252 Approved by: https://github.com/Valentine233, https://github.com/jansel	2025-09-26 08:50:06 +00:00
Valentine233	04b51499f7	[CPU] Support transpose and packing fusion for bit8 (#163233 ) To be used by CPU INT8 SDPA in TorchAO https://github.com/pytorch/ao/pull/3025. This change has a kernel improvement of about 9%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163233 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-09-26 07:15:04 +00:00
Arsh Zahed	54461a53bd	[Inductor] Check if profiling before using record_function in CompiledFxGraph (#163747 ) The call to `record_function` adds overhead even if profiling is disabled, which can as much as double the total runtime overhead of a compiled function. #163566 aims to make `record_function` more efficient, but doesn't fully eliminate overhead. This change adds a check if profiling is active before using `record_function`, which avoids this issue all together. `TestExecutionTrace.test_execution_trace_with_pt2` in https://github.com/pytorch/pytorch/blob/main/test/profiler/test_execution_trace.py#L372 already checks that the `record_function` region is tracked during profiling. Comparison of the `benchmarks/dynamo/microbenchmarks/overheads.py ` results: Before Change: ``` requires_grad=False compiled 56.9us (warmup=10.7s) requires_grad=True compiled 99.4us (warmup=0.2s) inference_mode() compiled 55.7us (warmup=0.1s) ``` After Change: ``` requires_grad=False eager 6.9us (warmup=0.0s) compiled 23.9us (warmup=22.3s) requires_grad=True eager 8.7us (warmup=0.0s) compiled 56.8us (warmup=0.1s) inference_mode() eager 6.3us (warmup=0.0s) compiled 22.2us (warmup=0.1s) ``` Additionally, #163866 introduces an instruction count benchmark. Because that is not merged and activated yet, here is a comparison: Before Change: ``` runtime_overhead_inductor,instruction_count,222645 runtime_overhead_inductor_inference_mode,instruction_count,234998 runtime_overhead_inductor_requires_grad,instruction_count,293556 runtime_overhead_inductor_requires_grad_backward,instruction_count,78181 runtime_overhead_inductor_dynamic,instruction_count,234870 runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711 runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979 runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599 ``` After Change: ``` runtime_overhead_inductor,instruction_count,149997 runtime_overhead_inductor_inference_mode,instruction_count,163397 runtime_overhead_inductor_requires_grad,instruction_count,220722 runtime_overhead_inductor_requires_grad_backward,instruction_count,78276 runtime_overhead_inductor_dynamic,instruction_count,161177 runtime_overhead_inductor_inference_mode_dynamic,instruction_count,175495 runtime_overhead_inductor_requires_grad_dynamic,instruction_count,235674 runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77475 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163747 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2025-09-26 06:49:40 +00:00
Lu Fang	d1403250c9	Fix specialize_impl from triton.runtime.jit (#163844 ) Summary: In https://github.com/triton-lang/triton/pull/7771/ , create_specialize_impl is removed. We extend the support using native_specialize_impl. Otherwise, PyTorch won't work with trunk triton. Test Plan: scripts/lufang/llm/launch_qwen3_vl_235b_a22b_thinking_2507_h100.sh No more error message like ``` (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last): (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ttir_module, ordered_tensor_names = generate_ttir( (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] specialization = _get_specialization(ordered_args.values()) (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] from triton.runtime.jit import specialize_impl as specialize_impl_orig (Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py) (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last): (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ttir_module, ordered_tensor_names = generate_ttir( (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] specialization = _get_specialization(ordered_args.values()) (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] from triton.runtime.jit import specialize_impl as specialize_impl_orig (Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py) (Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated (Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last): (Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors ``` Differential Revision: D83229128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163844 Approved by: https://github.com/henryoier, https://github.com/davidberard98, https://github.com/BoyuanFeng	2025-09-26 06:37:26 +00:00
Laith Sakka	b42e81def5	Allow unbacked to unbacked replacements if rhs unbacked symbols are all inputs (#163652 ) This partially solve the issue https://github.com/pytorch/pytorch/issues/163641. We do not need to ban unbacked to unbacked replacement if all rhs symbols are inputs since we know those symbols are seen by the whole program. This issue was found as i was tracing some vllm models with unbacked, namely Qwen/Qwen2-1.5B-Instruct it makes reasoning logic easier to do those replacements. as for data dependent similar pattern, I am thinking to create a set of replacements that we apply only during static eval instead of none. to make reasoning better. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163652 Approved by: https://github.com/bobrenjc93	2025-09-26 06:23:22 +00:00
Yiming Zhou	2a45f30ae7	Exporting aten.conv with cuda under fake mode on a cuda-less machine (#163912 ) Summary: Improve op coverage of exporting a CUDA model on a CPU-only machine under fake tensor mode. For `torch.nn.functional.conv2d`, it will `_select_conv_backend` based on input and weight shapes. When calling into `supportsDepthwiseConvolutionWithCuDNN()`, it calls `at::cuda::getCurrentDeviceProperties()` and fails on a CPU-only machine. So we check if CUDA is actually enabled first. Test Plan: TORCH_SHOW_CPP_STACKTRACES=1 buck2 run fbcode//caffe2/test:test_export -- --r nn_functional_conv2d Reviewed By: angelayi, henryoier Differential Revision: D80562984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163912 Approved by: https://github.com/SherlockNoMad	2025-09-26 06:04:20 +00:00
angelayi	11b4c0eb9e	[aoti] Save compute information (#163792 ) Metadata looks like: ``` { 'AOTI_DEVICE_KEY': 'cpu', 'AOTI_PLATFORM': 'linux', 'AOTI_MACHINE': 'x86_64', 'AOTI_CPU_ISA': 'AVX512', 'AOTI_COMPUTE_CAPABILITY': '90' } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163792 Approved by: https://github.com/yushangdi, https://github.com/desertfire ghstack dependencies: #163779	2025-09-26 05:40:44 +00:00
angelayi	fb93491ddc	[aoti] Load metadata w/o loading package (#163779 ) Add a function to load the metadata stored in aoti without needing to load the .so. This can be used to store what platform we are compiling the .so on which we can check before loading the .so. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163779 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-09-26 05:40:44 +00:00
Karhou Tam	39df24fe04	[Code Clean] Replace `std::runtime_error` with `TORCH_CHECK` (#163610 ) Including: - `torch/csrc/instruction_counter` - `torch/csrc/lazy` - `torch/csrc/monitor` - `torch/csrc/profiler` - `torch/csrc/dynamo` Fixes part of #148114 Personal mistake about (PR #163317), this PR does the same thing and PR #163317 has already been approved by @albanD. This is a personal mistake on my part, and I'm so sorry about that. Hope you won't mind @albanD. 🥹 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163610 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-26 04:52:48 +00:00
PyTorch UpdateBot	bbde16fe98	[vllm hash update] update the pinned vllm hash (#163823 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163823 Approved by: https://github.com/pytorchbot	2025-09-26 04:29:52 +00:00
Nick Riasanovsky	1b78ca2ef5	[Triton] [Inductor] Prune template selection based on decompose_k (#163781 ) Summary: Triton templates tend to perform very poorly on large K, hence the introduction of decompose_k. As a result, when decompose_k is selected will disable exploring the Triton templates. We may want to consider an override in the future. Note: Based on the timing results it may be desirable to better refine/prune the decompose k decisions. Testing: Tested by looking at the autotune/compilation time using a single shape in TritonBench. `TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 python run --op gemm --rep 1000 --sleep 1.0 --m 512 --n 512 --k 300000 --only pt2_matmul_maxautotune` Before this change: `SingleProcess AUTOTUNE benchmarking takes 13.5368 seconds and 0.1595 seconds precompiling for 38 choices` With this change: `SingleProcess AUTOTUNE benchmarking takes 9.9626 seconds and 0.0020 seconds precompiling for 11 choices` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163781 Approved by: https://github.com/eellison, https://github.com/PaulZhang12	2025-09-26 04:09:35 +00:00
fduwjj	082eaf4aae	[DeviceMesh] Add extra check in flatten result cache lookup (#163288 ) while refactoring DeviceMesh bookkeeping, we found that there is one corner case which we just don't check whether the dims to be flattened into is same as the dims which an existing flattened name maps to. So we need to add extra cases in the unit test and extra check logic in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163288 Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/fegin ghstack dependencies: #163212	2025-09-26 03:41:58 +00:00
fduwjj	f1f2e3e4da	[DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping (#163212 ) DeviceMesh essentially is a way to specify how devices interact with each other or device layout. They are all integers but because they can have various shapes and meshes, it make internal bookkeeping internally way more challenging. Currently our internal bookkeeing inside DeviceMesh is not scalable, so in order to support new functions like `_unflatten`, we need to introduce very complicated logics inside DeviceMesh as pointed out per comment (https://github.com/pytorch/pytorch/pull/159482/files#r2256025452). So thanks to @lw 's suggestion and PoC PR (https://github.com/pytorch/pytorch/pull/160429), we realize that by leveraging CuTe layout algebra([ref](https://docs.nvidia.com/cutlass/media/docs/cpp/cute/02_layout_algebra.html)) from Cutlass will greatly simply our internal mechanical bookkeeping for and make the abstraction ops way easier on top of it. So to make things go incrementally, we propose couple steps here https://github.com/pytorch/pytorch/issues/160337#issuecomment-3195106243. On top of what we have been doing about PyCute we want to continue add methods into the wrapper class so that we can get rank indexes needed for ProcessGroup Creation with a layout object. We also added detailed explanations and comments (thanks to llm) and unit test to show case the code indeed is working as expected. More PRs are on the way. This is a continue of https://github.com/pytorch/pytorch/pull/161016 (originally messed with EasyCLA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163212 Approved by: https://github.com/ezyang, https://github.com/fegin, https://github.com/lw	2025-09-26 03:32:19 +00:00
Kevin Fu	67cc0e0ac9	Add Static Dispatch Kernels (#163676 ) (#163870 ) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1951 X-link: https://github.com/pytorch/FBGEMM/pull/4927 Add a few missing static dispatch kernels for remote_ro. Test Plan: Tested with scripts in D83028841. Differential Revision: D83258808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163870 Approved by: https://github.com/henryoier	2025-09-26 03:00:07 +00:00
Ke Wen	bbf8aa43ef	[a2av] Separate in/out splits into two tensors (#163837 ) Old signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)` New signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)` i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better. Also to be in line with the 2D version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837 Approved by: https://github.com/fduwjj ghstack dependencies: #163886	2025-09-26 01:03:54 +00:00
Yuanyuan Chen	5daa79fd6e	Remove dataclass_slots (#163623 ) `dataclass` now has `slots` kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623 Approved by: https://github.com/Skylion007	2025-09-26 00:54:42 +00:00
Jeff Daily	b776e0c71e	[ROCm][CI/CD] create ROCm 7.0 magma tarball (#163883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163883 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-26 00:51:17 +00:00
Yiming Zhou	5c2f09d1f9	[export] _detect_attribute_assignment gives warning instead of raising ValueError (#163809 ) Summary: LSTM was not exportable with non-strict export as it failed at `_detect_attribute_assignment` This is because the `_flat_weights` attribute in LSTM is a list of registered parameters and will be updated by the `_update_flat_weights` method in `forward`. However, in `_detect_attribute_assignment`, we manually restore the state of the module by `mod.__dict__.update(snapshot)`. Therefore, it should be fine to turn the `ValueError` into a warning so that RNN models are exportable with non-strict export. Added test to verify that there is no lifted tensor constant and no fake tensor leakage. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_export_rnn_variants_with_warning Differential Revision: D83196971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163809 Approved by: https://github.com/tugsbayasgalan	2025-09-26 00:43:29 +00:00
Jerry Mannil	b4be380480	[ROCm] Implement float32 copy kernel (#163869 ) * Add `float32_copy_kernel` for vectorizing float16/bfloat16 to float32 conversion Pull Request resolved: https://github.com/pytorch/pytorch/pull/163869 Approved by: https://github.com/jeffdaily	2025-09-26 00:39:30 +00:00
Randy Shuai	5b8fef3f17	Extend triton_mm auto-tune options for HIM shapes (#163273 ) Summary: Add an option to auto-tune for shape: ``` M=1024 N=171712 K=1024 ``` Test Plan: ``` TRITON_PRINT_AUTOTUNING=1 buck2 run mode/opt-amd-gpu -c fbcode.enable_gpu_sections=true //pytorch/tritonbench:run -- --op fp8_gemm_rowwise --no_use_tma --no_use_persistent --m 1024 --n 171712 --k 1024 --bias ``` Before: {F1982074581} After, saw 10%~ boost: {F1982074585} Differential Revision: D82687336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163273 Approved by: https://github.com/jananisriram, https://github.com/Camyll	2025-09-26 00:05:57 +00:00
Nikita Shulga	ff2f319e6e	[MPS] Fix conv layout handling (#162776 ) What started as simple fix for `mps_convolution_backward_input` resulted in a pretty significant refactor/fixes: - Updated `mps_conv_use_channels_last` to return channels last output if either input or weights are channels last - Use the same primitive throughout `Convolution.mm` to determine wether output should be allocated in channels last format or not But doing only those two, resulted in crash in `test_memory_format_nn_Conv2d_mps_float32`, when weights were backward, and bias is present: ``` % python -c "import torch;print(torch.nn.functional.conv2d(torch.rand(2, 4, 3, 4,device='mps'), torch.rand(5, 4, 3, 3,device='mps').to(memory_format=torch.channels_last), torch.rand(5,device='mps')))" /AppleInternal/Library/BuildRoots/4~B5E4ugDCh2RsPWAjMEoPu8LC5w1yXEwd7XweDhg/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:3619: failed assertion `Error: MLIR pass manager failed' zsh: abort python -c ``` Which requires a more thorough redesign/cleanup, namely: - Do not alter the layout based on MacOS version, but rather do additional copies on MacOS-14 if inputs/output or weight are in channels last format ( done by defining `std::optional<Tensor> output_c;` that contains a contiguous copy of the output tensor - Introduced `input_suggested_layout` which is set to ChannelsLast if and only if input is channels last and is running on MacOS-15+ - Delete unused `memory_layout` and `group` arguments from `fill_depthwise_conv_desc` - Fix bias broadcasting logic for channels last As result, in addition to adding one more regression test this change removes `expectedFailures` from: - `TestModule.test_memory_format` for `Conv2d`, `ConvTranspose2d`, `LazyConv1d`, `LazyConvTranspose1d` - `test_require_stride_expanded_dynamic_shapes` - `test_mutable_custom_op_fixed_layout2` for MacOS-14 Fixes https://github.com/pytorch/pytorch/issues/161905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162776 Approved by: https://github.com/Skylion007	2025-09-25 23:41:34 +00:00
PaliC	94195a37ae	[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464 ) Removes HermeticPyObjectTLS as we no longer need since torch deploy is no longer supported. PythonOpRegistrationTrampoline is also drastically simplified as and being prepped for removal in a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163464 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-25 23:30:50 +00:00
suo	c58e096cd0	[DTensor] implement logsumexp (#163879 ) as title, mostly copypasta from internal. I am a dtensor noob, so please scrutinize my added test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163879 Approved by: https://github.com/XilunWu	2025-09-25 23:08:30 +00:00
Anshul Sinha	2a6e6a9e3b	[FSDP][Replicate] tests replicate parity for shared parameters (#162836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162836 Approved by: https://github.com/mori360 ghstack dependencies: #162830	2025-09-25 23:08:22 +00:00
Ke Wen	6e6c899347	[Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163549 ) #163423 was approved but reverted due to a revert of base. Relanding without base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549 Approved by: https://github.com/wdvr Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>	2025-09-25 23:02:00 +00:00
Anshul Sinha	366961df78	[FSDP][Replicate] tests replicate parity with activation checkpointing (#162830 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This tests that replicate function works correctly when combined with activation checkpointing Test Case 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_with_activation_checkpointing Pull Request resolved: https://github.com/pytorch/pytorch/pull/162830 Approved by: https://github.com/mori360	2025-09-25 22:57:00 +00:00
Shangdi Yu	520fca82c8	Refactor Provenance Tracking (#163378 ) Summary: - Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code - Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`. - If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a `codegen_comment` call. - Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling - Remove `debug_handle` from `call_kernel`. Test Plan: CI ``` buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing ``` Differential Revision: D82839271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378 Approved by: https://github.com/angelayi	2025-09-25 22:55:59 +00:00
Mu-Chu Lee	908bcfd403	[AOTInductor] Add input information for Triton Kernels in AOTI (#160380 ) Summary: We use record_function to pass in input information to let Kineto show input information. Test Plan: Before: <img width="459" height="582" alt="Screenshot 2025-09-19 at 10 45 10 AM" src="https://github.com/user-attachments/assets/baa0c251-86e9-49ca-8c6c-fcd2619f7f48" /> After: <img width="473" height="1130" alt="Screenshot 2025-09-19 at 10 44 53 AM" src="https://github.com/user-attachments/assets/b7942d84-0362-4b9e-9232-14de92bbdd00" /> Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/160380 Approved by: https://github.com/desertfire ghstack dependencies: #163593	2025-09-25 22:41:04 +00:00
Ke Wen	96275dbf88	[CI] Fix test_triton_wait_until hang (#163886 ) I don't know why `nvshmem_barrier_all_kernel` leads the test to hang. Will investigate. But since it is an unnecessary call here, I am removing it to unblock other PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886 Approved by: https://github.com/fegin	2025-09-25 22:22:16 +00:00
bobrenjc93	b14a14a662	[torchfuzz] make generated code much more concise and cleaner (#163812 ) ``` import torch torch._dynamo.config.capture_scalar_outputs = True torch.manual_seed(42) def fuzzed_program(arg_0, arg_1, arg_2): var_node_3 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_4 = torch.full((1,), (-0.29262632146522655-0.7687848816195035j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_2 = torch.ops.aten.add(var_node_3, var_node_4) # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_6 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_7 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_5 = torch.ops.aten.add(var_node_6, var_node_7) # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_1 = torch.ops.aten.add(var_node_2, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda var_node_0 = var_node_1.item() # dtype=complex128 return var_node_0 arg_0 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,)) arg_1 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,)) arg_2 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,)) args = (arg_0, arg_1, arg_2) result_original = fuzzed_program(args) print('✅ eager success') compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True) result_compiled = compiled_program(args) print('✅ compile success') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163812 Approved by: https://github.com/pianpwk ghstack dependencies: #163743	2025-09-25 22:12:33 +00:00
Tianyu Liu	92f7361e27	[DTensor] fix uneven _StridedShard (#163843 ) Previous uneven `_StridedShard` in https://github.com/pytorch/pytorch/pull/150490 seems failing cases like sharding `tensor = torch.arange(6)` with FSDP 2, TP 2. This PR attempts to reinvent `_StridedShard`. I didn't test nested `_StridedShard`, because there shouldn't be any use cases. I think it will become quite messy when it comes to nested uneven `_StridedShard`. We are probably going to deprecate it anyway after @zpcore 's work https://github.com/pytorch/pytorch/pull/160266 on ordered sharding, so IMO not worth it to make it too general. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163843 Approved by: https://github.com/ezyang	2025-09-25 22:12:29 +00:00
Jane Xu	6a6d838832	Add H100 runner to be recognized in actionlint (#163795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795 Approved by: https://github.com/huydhn, https://github.com/seemethere	2025-09-25 22:09:11 +00:00
Huamin Li	183dca423f	[Inductor] add a new config fallback_embedding_bag_byte_unpack (#163803 ) Differential Revision: D82988783 introduce an inductor config fallback_embedding_bag_byte_unpack so we can have options to not let inductor decompose the op Pull Request resolved: https://github.com/pytorch/pytorch/pull/163803 Approved by: https://github.com/henryoier	2025-09-25 22:07:04 +00:00
bobrenjc93	b8efa336d2	[torchfuzz] simplify codegen and runner (#163743 ) much less code. a followup PR will make these repro files even smaller. small is important since it reduces the time for users to understand what the repro is doing. here's a sample: ``` (/home/bobren/local/a/pytorch-env) [21:34] devgpu009:/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz [130] python fuzzer.py --seed 42 Running single fuzz_and_execute... Using seed: 42, max_depth: 10 Running generated program... Selected CUDA_VISIBLE_DEVICES=2 === Program Output === ✅ eager success ✅ compile success =============================== === Program Source === import torch import sys import os fuzzer_dir = r'/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz' if fuzzer_dir not in sys.path: sys.path.insert(0, fuzzer_dir) from tensor_fuzzer import fuzz_scalar, fuzz_tensor_simple, ScalarSpec, TensorSpec def fuzzed_program(arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26): # Node node_4: arg (depth 6) var_node_4 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_7: constant (depth 4) var_node_7 = torch.full((1,), (-0.8353595860703585-0.8384634248041143j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_8: arg (depth 4) var_node_8 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_6: tensor_pointwise (depth 5) var_node_6 = torch.ops.aten.mul(var_node_7, var_node_8) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_9: constant (depth 5) var_node_9 = torch.full((1,), (-0.32478860712861235+0.033909682598544454j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_5: tensor_pointwise (depth 6) var_node_5 = torch.ops.aten.mul(var_node_6, var_node_9) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_3: tensor_pointwise (depth 7) var_node_3 = torch.ops.aten.sub(var_node_4, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_11: arg (depth 6) var_node_11 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_18: constant (depth 0) var_node_18 = torch.full((1,), (0.12855308616305575+1.5268033634325642j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_19: arg (depth 0) var_node_19 = arg_3 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_17: tensor_pointwise (depth 1) var_node_17 = torch.ops.aten.mul(var_node_18, var_node_19) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_21: arg (depth 0) var_node_21 = arg_4 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_22: arg (depth 0) var_node_22 = arg_5 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_20: tensor_pointwise (depth 1) var_node_20 = torch.ops.aten.sub(var_node_21, var_node_22) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_16: tensor_pointwise (depth 2) var_node_16 = torch.ops.aten.add(var_node_17, var_node_20) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_25: arg (depth 0) var_node_25 = arg_6 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_26: arg (depth 0) var_node_26 = arg_7 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_24: tensor_pointwise (depth 1) var_node_24 = torch.ops.aten.add(var_node_25, var_node_26) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_27: constant (depth 1) var_node_27 = torch.full((1,), (-0.6315711191260084+1.342004076501214j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_23: tensor_pointwise (depth 2) var_node_23 = torch.ops.aten.mul(var_node_24, var_node_27) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_15: tensor_pointwise (depth 3) var_node_15 = torch.ops.aten.mul(var_node_16, var_node_23) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_28: constant (depth 3) var_node_28 = torch.full((1,), (1.064498531874825-0.37289464356501284j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_14: tensor_pointwise (depth 4) var_node_14 = torch.ops.aten.mul(var_node_15, var_node_28) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_30: arg (depth 3) var_node_30 = arg_8 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_32: arg (depth 2) var_node_32 = arg_9 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_33: constant (depth 2) var_node_33 = torch.full((1,), (1.5815627438573372+0.5124667911691704j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_31: tensor_pointwise (depth 3) var_node_31 = torch.ops.aten.div(var_node_32, var_node_33) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_29: tensor_pointwise (depth 4) var_node_29 = torch.ops.aten.div(var_node_30, var_node_31) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_13: tensor_pointwise (depth 5) var_node_13 = torch.ops.aten.div(var_node_14, var_node_29) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_39: arg (depth 0) var_node_39 = arg_10 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_40: constant (depth 0) var_node_40 = torch.full((1,), (-0.5987350493494642-0.5711360569376475j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_38: tensor_pointwise (depth 1) var_node_38 = torch.ops.aten.mul(var_node_39, var_node_40) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_41: arg (depth 1) var_node_41 = arg_11 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_37: tensor_pointwise (depth 2) var_node_37 = torch.ops.aten.add(var_node_38, var_node_41) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_42: constant (depth 2) var_node_42 = torch.full((1,), (0.7246044564672116-0.5930730980273312j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_36: tensor_pointwise (depth 3) var_node_36 = torch.ops.aten.mul(var_node_37, var_node_42) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_43: constant (depth 3) var_node_43 = torch.full((1,), (-0.7582976293117148+1.1880929376258396j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_35: tensor_pointwise (depth 4) var_node_35 = torch.ops.aten.mul(var_node_36, var_node_43) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_45: constant (depth 3) var_node_45 = torch.full((1,), (1.0896212896322774+0.3124038130417098j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_46: arg (depth 3) var_node_46 = arg_12 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_44: tensor_pointwise (depth 4) var_node_44 = torch.ops.aten.add(var_node_45, var_node_46) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_34: tensor_pointwise (depth 5) var_node_34 = torch.ops.aten.div(var_node_35, var_node_44) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_12: tensor_pointwise (depth 6) var_node_12 = torch.ops.aten.div(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_10: tensor_pointwise (depth 7) var_node_10 = torch.ops.aten.mul(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_2: tensor_pointwise (depth 8) var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_48: constant (depth 7) var_node_48 = torch.full((1,), (-1.047745491289218+0.279447315087422j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_54: arg (depth 2) var_node_54 = arg_13 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_55: arg (depth 2) var_node_55 = arg_14 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_53: tensor_pointwise (depth 3) var_node_53 = torch.ops.aten.div(var_node_54, var_node_55) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_56: arg (depth 3) var_node_56 = arg_15 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_52: tensor_pointwise (depth 4) var_node_52 = torch.ops.aten.div(var_node_53, var_node_56) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_59: arg (depth 2) var_node_59 = arg_16 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_60: arg (depth 2) var_node_60 = arg_17 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_58: tensor_pointwise (depth 3) var_node_58 = torch.ops.aten.div(var_node_59, var_node_60) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_61: constant (depth 3) var_node_61 = torch.full((1,), (-0.7386327586576402-0.027025998767172658j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_57: tensor_pointwise (depth 4) var_node_57 = torch.ops.aten.add(var_node_58, var_node_61) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_51: tensor_pointwise (depth 5) var_node_51 = torch.ops.aten.sub(var_node_52, var_node_57) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_64: arg (depth 3) var_node_64 = arg_18 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_67: arg (depth 1) var_node_67 = arg_19 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_68: constant (depth 1) var_node_68 = torch.full((1,), (-0.6840241429755998+1.327637020136433j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_66: tensor_pointwise (depth 2) var_node_66 = torch.ops.aten.mul(var_node_67, var_node_68) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_69: arg (depth 2) var_node_69 = arg_20 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_65: tensor_pointwise (depth 3) var_node_65 = torch.ops.aten.sub(var_node_66, var_node_69) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_63: tensor_pointwise (depth 4) var_node_63 = torch.ops.aten.sub(var_node_64, var_node_65) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_70: arg (depth 4) var_node_70 = arg_21 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_62: tensor_pointwise (depth 5) var_node_62 = torch.ops.aten.sub(var_node_63, var_node_70) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_50: tensor_pointwise (depth 6) var_node_50 = torch.ops.aten.mul(var_node_51, var_node_62) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_76: constant (depth 1) var_node_76 = torch.full((1,), (1.864651314238342+0.27066487315113186j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_77: arg (depth 1) var_node_77 = arg_22 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_75: tensor_pointwise (depth 2) var_node_75 = torch.ops.aten.mul(var_node_76, var_node_77) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_78: arg (depth 2) var_node_78 = arg_23 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_74: tensor_pointwise (depth 3) var_node_74 = torch.ops.aten.add(var_node_75, var_node_78) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_79: arg (depth 3) var_node_79 = arg_24 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_73: tensor_pointwise (depth 4) var_node_73 = torch.ops.aten.mul(var_node_74, var_node_79) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_80: arg (depth 4) var_node_80 = arg_25 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_72: tensor_pointwise (depth 5) var_node_72 = torch.ops.aten.mul(var_node_73, var_node_80) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_82: constant (depth 4) var_node_82 = torch.full((1,), (1.6341547018841247+0.3096989611326181j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_84: constant (depth 3) var_node_84 = torch.full((1,), (0.9609065596935821+0.2920229825681946j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_85: arg (depth 3) var_node_85 = arg_26 # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_83: tensor_pointwise (depth 4) var_node_83 = torch.ops.aten.add(var_node_84, var_node_85) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_81: tensor_pointwise (depth 5) var_node_81 = torch.ops.aten.sub(var_node_82, var_node_83) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_71: tensor_pointwise (depth 6) var_node_71 = torch.ops.aten.sub(var_node_72, var_node_81) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_49: tensor_pointwise (depth 7) var_node_49 = torch.ops.aten.mul(var_node_50, var_node_71) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_47: tensor_pointwise (depth 8) var_node_47 = torch.ops.aten.add(var_node_48, var_node_49) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_1: tensor_pointwise (depth 9) var_node_1 = torch.ops.aten.add(var_node_2, var_node_47) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_0: torch.ops.aten.item (depth 10) var_node_0 = var_node_1.item() # dtype=complex128 # Final result from root node return var_node_0 arg_0 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10042) arg_1 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10043) arg_2 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10044) arg_3 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10045) arg_4 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10046) arg_5 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10047) arg_6 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10048) arg_7 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10049) arg_8 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10050) arg_9 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10051) arg_10 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10052) arg_11 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10053) arg_12 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10054) arg_13 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10055) arg_14 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10056) arg_15 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10057) arg_16 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10058) arg_17 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10059) arg_18 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10060) arg_19 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10061) arg_20 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10062) arg_21 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10063) arg_22 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10064) arg_23 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10065) arg_24 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10066) arg_25 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10067) arg_26 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10068) import torch import sys torch._dynamo.config.capture_scalar_outputs = True args = (arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26) result_original = fuzzed_program(args) print('✅ eager success') sys.exit(1) compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True) result_compiled = compiled_program(args) print('✅ compile success') ====================== Program exited with code: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163743 Approved by: https://github.com/pianpwk	2025-09-25 21:42:22 +00:00
Roman Bobniev	1cffa42d4d	PyTorch `histc` fix for values with large magnitudes (#163506 ) Summary: The current implementation of the `histc` function on CPU doesn't take into account the nature of the floating point precision represenation when two numbers have very different magnitudes. In the code of `histc` there is a following logic, which tries to fix an issue when automatically calculated `min` and `max` are identical: ``` if (leftmost_edge == rightmost_edge) { leftmost_edge -= 1; rightmost_edge += 1; } ... TORCH_CHECK(leftmost_edge < rightmost_edge, "torch.histc: max must be larger than min"); ``` But, not for all floating point values expanding the range exactly by 1 will give the representable result that is different from the original value. The test code: ``` info = th.finfo(th.float32) f_min = info.min test_tensor = th.ones((224, 224), dtype=th.float64) * f_min res = th.histc(test_tensor, bins=10) ``` Actual result: ``` RuntimeError: torch.histc: max must be larger than min ``` Expected result: Everything should work fine. NOTICE: If we set `f_min` just to small enough number, code works, which demonstrates the correct purpose of the possible range correction. In short, `f_min + 1 == f_min` executes to true, since we reach the precision of the floating point prepresentation. Please notice, this is not limitation of the float32 data type, since all computations happen in float64 (C++ data type `double`). The magnitudes are just different enough, that we reach the precision representation with simple approach of `+/-1`. Interesting is that `histogram` function doesn't throw an exception, because edges range selection is implemented differently. The fix we propose is to use `std::nextafter` which returns next representable floating point value starting from the current one in the direction of the lowest or max numbers. In theory, mathecmatically correct is to use this function without constrains, but to maintain backward compatibility in case if there is a code which relies on the current logic of `+/-1` offset we call `std::min` and `std::max` to pick the right representable value (i.e. for small floating point values the next representable value has step smaller than 1 for large values it's larger than 1). We could stick to `histogram` implementation, but again, to avoid possible backward compatibility breaks, we decided to use the fix presented in this change. The real use case scenario: In our project we use the well-known transformer version from HuggingFace which fills up the buffer with float32 min (please note this is not a minimal value closer to 0, it's minimal absolute value which is often like `-max`). The code where it sits is here: https://github.com/huggingface/transformers/blob/v4.51.1/src/transformers/models/mimi/modeling_mimi.py#L1159 Switching to other version of the transformer will lead to other issues in our project and the bug which we fix here may appear in other projects and scenarios. The real world problem appears when for such tensor the CPU version of the `histc` is called. In our usecase, it happens because this tensor is an input to the softmax activaiton function and as part of the quantisation the input parameter should go trough the observer as well. In our case the default Histogram observer is selected, which calls the `histc`. Test Plan: The simple test code snippet doesn't produce failure: ``` f_min = th.finfo(th.float32).min test_tensor = th.ones((224, 224), dtype=th.float32) * f_min th.histc(test_tensor, bins=10) ``` Testing update: The `test_histc` has been updated accordingly. Now when we have +INF as all values of the tensor, the previous representation of the floating number should be <max_float>, hence the assert message is changed from `[inf, inf]` to `[<max_float>\|inf, inf]`. The test also extended to check the assert message when tensor is filled with values -INF and with combination of (-INF, +INF). The new regexp assert includes possible output as `inf` and any floating point number in scientific representation for one of the bin edges. We left `inf` as possible value due to possible difference in implementation between CPU and CUDA. Differential Revision: D82955597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163506 Approved by: https://github.com/jermenkoo, https://github.com/malfet	2025-09-25 20:55:25 +00:00
Shangdi Yu	ebfc87e303	Always produce kernel_info.json (#163715 ) Summary: Always produce kernel_info.json so zoomer can use this json to populate GPU traces Test Plan: CI Differential Revision: D82762435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163715 Approved by: https://github.com/angelayi	2025-09-25 19:38:49 +00:00
Yidi Wu	21a41edd4f	Add fake_impl for _native_multi_head_attention (#163700 ) Test Plan: See added test in test_export.py Differential Revision: D83099187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700 Approved by: https://github.com/angelayi	2025-09-25 19:01:27 +00:00
PyTorch MergeBot	7bad9c5a64	Revert "Update ruff to 0.13.1 (#163744 )" This reverts commit 3dd89a079f2b0c1d39351f98ff5d5ca882523152. Reverted https://github.com/pytorch/pytorch/pull/163744 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/18016220484/job/51261729375 looks like a landrace with PR that updated min-version to 3.10 ([comment](https://github.com/pytorch/pytorch/pull/163744#issuecomment-3335534084))	2025-09-25 18:54:03 +00:00
catswe	151e66e50d	Update documentation for torch.index_select (#163616 ) Description said "entries in index which is a LongTensor" but index_select can accept an IntTensor as the parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/163616 Approved by: https://github.com/jbschlosser Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>	2025-09-25 18:29:17 +00:00
Svetlana Karslioglu	b61bdc7cc4	Fix cpp build (#162774 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774 Approved by: https://github.com/malfet, https://github.com/atalman	2025-09-25 18:21:45 +00:00
Yuanyuan Chen	3dd89a079f	Update ruff to 0.13.1 (#163744 ) Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff. There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059) ``` Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744 Approved by: https://github.com/Skylion007, https://github.com/jingsh	2025-09-25 17:52:35 +00:00
Jeff Daily	6539537a59	[ROCm][CD] create ROCm 7.0 images for binary builds (#163860 ) Adds gfx950. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163860 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-25 17:26:40 +00:00
Xinya Zhang	3cbfbbd691	[ROCm] Transformer/SDPA unit test parity (#163745 ) ## Major Changes * Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes #157120 Fixes #157121 Fixes #157122 Fixes #157167 Fixes #155217 Fixes #157043 Fixes #157060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745 Approved by: https://github.com/jeffdaily	2025-09-25 17:14:19 +00:00
PyTorch MergeBot	112e204797	Revert "[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299 )" This reverts commit 7163dce1e091cb5564c723110314bb372b5e81a8. Reverted https://github.com/pytorch/pytorch/pull/161299 on behalf of https://github.com/nWEIdia due to Incorrectly suppressing useful warnings when running sm89 binary on sm86 ([comment](https://github.com/pytorch/pytorch/pull/161299#issuecomment-3335127621))	2025-09-25 17:13:32 +00:00
Sherlock Huang	f9821b1be7	DebugMode supports_higher_order_operators=True (#163824 ) Make DebugMode supports HOP Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824 Approved by: https://github.com/ydwu4	2025-09-25 17:11:43 +00:00
FFFrog	c4312b443f	[Tools] Adapting the Hypothesis library (version 5.x) for use with the PyTorch framework (#163748 ) Starting from version 5.x, the Hypothesis library removed the timeout setting and only retained the deadline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163748 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-25 16:41:50 +00:00
Dmitry Nikolaev	7194d77550	Revert "enable test_sampled_addmm_zero_sized_cuda for rocm (#121940 )" (#163848 ) This reverts commit 5494b2a8d38c3ddbeb2d96a5ac990e20ec4c48fd. Need to skip `test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_` again. Tests are failing now with "core dumped" error ``` python test_sparse_csr.py -v -k test_sampled_addmm_zero_sized_cuda_float64 test_sampled_addmm_zero_sized_cuda_float64 (__main__.TestSparseCSRCUDA) ... /tmp/pytorch/test/test_sparse_csr.py:2503: c = torch.empty(m, n, dtype=dtype, device=device, layout=torch.sparse_csr) GPU core dump created: gpucore.186789 :0:rocdevice.cpp :2992: 4701819131755 us: Callback: Queue 0x760cdcd00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Aborted (core dumped) ``` These failures are linked to `test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_` due to incorrect test log parsing. We will be able to close these issues also: - Fixes https://github.com/pytorch/pytorch/issues/163663 - Fixes https://github.com/pytorch/pytorch/issues/160786 - Fixes https://github.com/pytorch/pytorch/issues/160785 - Fixes https://github.com/pytorch/pytorch/issues/160784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163848 Approved by: https://github.com/jeffdaily	2025-09-25 16:38:00 +00:00
can-gaa-hou	22d5f5ff94	[OpenReg][BE] Replacing explicit prefix/suffix with CMake variables (#163850 ) As the title states, suffixes like`.dylib` and `lib` can be replaced by `CMAKE_SHARED_LIBRARY_SUFFIX`, and prefixes like `lib` can be replaced by `CMAKE_SHARED_LIBRARY_PREFIX` on Unix or `CMAKE_IMPORT_LIBRARY_PREFIX` on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163850 Approved by: https://github.com/albanD	2025-09-25 16:33:16 +00:00
fduwjj	c8e75c48b9	[fr] Skip the dtype check for some one to all or all to one collective (#163839 ) As title, in practice we found that sometimes, the dtype of gather does not match when it comes to output among all ranks, which is a undefined behavior. Same with broadcast and scatter. And they are all completed, so we should not think they are errors, we can skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163839 Approved by: https://github.com/VieEeEw	2025-09-25 16:02:06 +00:00
Aart J.C. Bik	e8f5f1b1a2	[NFC] fixed mistake in comment (#163697 ) I used "floor" instead of "ceil", so fix it. Also fixed other typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163697 Approved by: https://github.com/jcaip	2025-09-25 15:53:51 +00:00
Sherlock Huang	10e69a6e17	Preserve user annotation in graph (#163673 ) ``` import torch import torch.fx.traceback as fx_traceback import torch.export class M(torch.nn.Module): def forward(self, x): with fx_traceback.annotate({"pp_stage": 0}): with fx_traceback.annotate({"fdsp_bucket": 0}): x = x + 1 x = x - 2 with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}): x = x * 2 x = x / 3 return x m = M() with fx_traceback.preserve_node_meta(): ep = torch.export.export(m, (torch.randn(10),)) for node in ep.graph.nodes: if node.op == "call_function": print(f"{node.target}, {node.meta.get("custom", {})}") ``` prints ``` aten.add.Tensor, {'pp_stage': 0, 'fdsp_bucket': 0} aten.sub.Tensor, {'pp_stage': 0} aten.mul.Tensor, {'pp_stage': 0, 'cuda_stream': 2, 'fsdp_bucket': 1} aten.div.Tensor, {} ``` TODOs: - run_decomposition is failing - Need to test with the new full graph capture + aot_export_joint apis - Need to make the annotation propagate through autograd engine to reach the bw nodes. Sample impl here: https://github.com/pytorch/pytorch/pull/83558 - Edward want to restrict the key in custom field to be top-level singleton objects only - also need to take care of metadata merging when passes are fusing nodes Thanks @angelayi for contributing the dynamo fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163673 Approved by: https://github.com/albanD, https://github.com/angelayi	2025-09-25 15:50:15 +00:00
Timm Ruland	5fcde74aed	Fix pipeline parallelism not correctly initializing backwards stages when evaluating before training. (#162823 ) Previously, an eval() call before a training step() would not correctly initialize the backward pass of the pipeline stages, leading to errors during the subsequent training step. This PR ensures that the backward stages can still be initialized after an eval() call. Fixes #162822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162823 Approved by: https://github.com/dcci, https://github.com/H-Huang	2025-09-25 15:13:19 +00:00
Mihai Polceanu	6fa3715c12	Expose Kineto event metadata in PyTorch Profiler events (#161624 ) ## Overview This PR allows the profiler users to access `Kineto` and `TorchOp` metadata in JSON string format through a new `metadata_json` attribute in `FunctionEvent` objects, which is triggered through a new `expose_kineto_event_metadata` flag in `ExperimentalConfig`. ## Testing A unit test was added to validate functionality. ## Documentation Added/updated function doc strings where appropriate. ## Example output ```python import torch from torch.profiler import profile with profile(experimental_config=torch._C._profiler._ExperimentalConfig(expose_kineto_event_metadata=True)) as prof: res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024)) for event in prof.events(): print(f'name: {event.key}, metadata: {event.metadata_json}') ``` ``` name: aten::rand, metadata: "Ev Idx": 0 name: aten::empty, metadata: "Ev Idx": 1 name: aten::uniform_, metadata: "Ev Idx": 2 name: aten::rand, metadata: "Ev Idx": 3 name: aten::empty, metadata: "Ev Idx": 4 name: aten::uniform_, metadata: "Ev Idx": 5 name: aten::mm, metadata: "Ev Idx": 6 name: aten::resolve_conj, metadata: "Ev Idx": 7 name: aten::resolve_conj, metadata: "Ev Idx": 8 name: aten::resolve_conj, metadata: "Ev Idx": 9 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161624 Approved by: https://github.com/sraikund16	2025-09-25 14:58:30 +00:00
atalman	98c4e35f14	[CD] Add statically linked windows libraries to exclude list (#163768 ) Fixes: https://github.com/pytorch/pytorch/issues/159514 Seeing following in the Wheel build logs: ``` Linking CXX static library lib\kineto.lib Linking CXX static library lib\dnnl.lib .... ``` These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds. Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768 Approved by: https://github.com/albanD, https://github.com/malfet	2025-09-25 14:03:14 +00:00
PyTorch MergeBot	00059db034	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367))	2025-09-25 13:47:46 +00:00
IvanKobzarev	22fcc8b76b	[async_tp] Support mm+rs with scatter_dim matmul K by sharding B (#162794 ) Current state: Shape mismatch failure when mm+rs on the last mm scatter dim. Adding separate path to handle lastdim for aten.mm, scaled_mm should be handled similarly, but needs additional PR. So disabling scaled_mm case with filter matmul function. Adding inductor.config for this change that is True by default for fast debuggability of new path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162794 Approved by: https://github.com/fegin	2025-09-25 12:18:39 +00:00
FFFrog	ab2ce3c50e	[Code Clean] Replace std::runtime_error with TORCH_CHECK (#163264 ) Related ISSUE: https://github.com/pytorch/pytorch/issues/148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163264 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-09-25 11:28:51 +00:00
Brian Hirsh	7d710403b0	Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" (#163769 ) ### Summary: NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files (Summary from #161994 ) Attempted rebase of https://github.com/pytorch/pytorch/pull/143712. This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D81524507 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769 Approved by: https://github.com/dolpm Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-09-25 10:27:37 +00:00
PaliC	29cbcbac42	[BE] Make PyObjectSlot use a global PyInterpreter (#162659 ) This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process Gonna ask for review from @huydhn as there are some changes to CI. Testing: imported internally and the failed android build seems to work now! Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659 Approved by: https://github.com/albanD, https://github.com/huydhn	2025-09-25 08:53:19 +00:00
Pian Pawakapan	5f90e8c7ae	[PGO] ignore extra PGO key if warm/cold cache present (#163810 ) Summary: avoids PGO profile merges Test Plan: test_pgo Differential Revision: D83200714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163810 Approved by: https://github.com/bobrenjc93	2025-09-25 07:16:05 +00:00
Klaus Zimmermann	eb7f4e0004	Add PEP 517 compliant Python source distribution to release process (#157815 ) This adds the actual creation of a standards compliant sdist along with its upload to s3 to the create release workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157815 Approved by: https://github.com/malfet, https://github.com/atalman ghstack dependencies: #157814, #160315	2025-09-25 07:15:52 +00:00
Klaus Zimmermann	42928876eb	Add sdist handling to version finding (#160315 ) The version finding logic triggered from `setup.py` generally tries to take the git information into account. This is fine for most situations where we are building from a checkout, but it creates a problem in the case of sdists, as here the version is determined at the time of sdist creation, taking the git information into account, but then later recalculated when building wheels or installing from the sdist, now with the git information missing. The solution is to take the version information directly from the sdist, which this PR adds by means of parsing the `PKG-INFO` which marks an unpacked sdist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160315 Approved by: https://github.com/atalman ghstack dependencies: #157814	2025-09-25 07:15:51 +00:00
Klaus Zimmermann	c44ec9f4c2	Improve MANIFEST.in for source distribution (#157814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157814 Approved by: https://github.com/XuehaiPan, https://github.com/atalman	2025-09-25 07:15:42 +00:00
Pian Pawakapan	353991dd92	[PGO] distinguish sticky PGO put (#163799 ) Summary: put_remote_code_state vs. put_extra_remote_code_state Test Plan: test_pgo Differential Revision: D83195687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163799 Approved by: https://github.com/bobrenjc93	2025-09-25 06:59:25 +00:00
Filip	2b6a74abf1	[optim] prevent unintended aliasing in lr_scheduler; update type annotations/docs (#163120 ) 1. Prevents unintended aliasing of `self._last_lr`/`get_last_lr(...)` with `group["lr"]` when `group["lr"]` is a tensor. 2. Prevents unintended aliasing of `LRScheduler.base_lrs` with the `group["initial_lr"]`s. 3. Updates `test/optim/test_lrscheduler.py` to test tensor LRs. 4. Changes type annotations for `_last_lr`, `get_last_lr()`, `base_lrs`, `get_lr()`, and `_get_closed_form_lr()` from `list[float]` to `list[float \| Tensor]`; adds documentation. Fixes #163103 LR schedulers can behave in unexpected ways when using a tensor LR due to patterns like this: ```python self._last_lr: list[float] = [group["lr"] for group in self.optimizer.param_groups] ``` This PR adds a helper to address this: ```python def _param_groups_val_list(optimizer: Optimizer, key: str) -> list[Any]: """Create a list containing group[key] for each optimizer param_group. Prevents aliasing when group[key] could be a Tensor. Raises a KeyError when group[key] does not exist. """ return [ group[key].clone() if isinstance(group[key], Tensor) else group[key] for group in optimizer.param_groups ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163120 Approved by: https://github.com/janeyx99	2025-09-25 06:58:58 +00:00
Bob Ren	ad869c58f5	remove allow-untyped-defs from ./torch/utils/benchmark/op_fuzzers/sparse_unary.py (#163476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163476 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #163478, #163475, #163471	2025-09-25 06:48:44 +00:00
Bob Ren	d5afb9e31a	remove allow-untyped-defs from ./torch/ao/quantization/quantizer/utils.py (#163471 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163471 Approved by: https://github.com/Skylion007 ghstack dependencies: #163478, #163475	2025-09-25 06:48:44 +00:00
Bob Ren	e7d6ea65ca	remove allow-untyped-defs from ./torch/nn/utils/_expanded_weights/embedding_expanded_weights.py (#163475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163475 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #163478	2025-09-25 06:48:44 +00:00
Bob Ren	a6974195da	remove allow-untyped-defs from ./torch/fx/experimental/unification/multipledispatch/core.py (#163478 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163478 Approved by: https://github.com/ezyang	2025-09-25 06:48:44 +00:00
FFFrog	a213848703	[Code Clean] Remove deadcodes about Python3.9 [8/N] (#163728 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163728 Approved by: https://github.com/albanD, https://github.com/cyyever ghstack dependencies: #163626, #163627, #163629, #163643, #163644, #163645, #163646	2025-09-25 05:12:46 +00:00
dolpm	cde5c9aebd	fix pickling for BitwiseFn (#163571 ) Summary: ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395 Fixes #147841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571 Approved by: https://github.com/jamesjwu	2025-09-25 04:52:11 +00:00
Sampath Victor	783a9dcb6d	[6/n] Quantization with min & max bounds support - using fbgemm changes in ATen (#162924 ) Summary: This diff uses the FBGEMM changes made in D78181177 & D81858256 to support using the provided per row min/max values while quantizaing float/half to 8-bit, 4-bit & 2-bit in ATen library. Please find more context on this here: https://fburl.com/gdoc/yutf32a0 Test Plan: ``` buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test ``` https://www.internalfb.com/intern/testinfra/testrun/7881299640979446 Please refer to D80905814's test plan for integration testing. Rollback Plan: Differential Revision: D81327342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162924 Approved by: https://github.com/jerryzh168	2025-09-25 02:52:04 +00:00
bobrenjc93	ad2f7315ca	[torchfuzz] print out tensor descriptor as comments in codegen (#163739 ) eg. ``` # Node node_12: tensor_pointwise (depth 6) var_node_12 = torch.ops.aten.mul(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_10: tensor_pointwise (depth 7) var_node_10 = torch.ops.aten.div(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda # Node node_2: tensor_pointwise (depth 8) var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163739 Approved by: https://github.com/pianpwk ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558, #163560, #163698	2025-09-25 01:29:29 +00:00
Nikita Shulga	cc660d38ac	[CI] Install libuv for Win testing (#163797 ) Current working theory why `f0078941cf` caused a regression, are because Windows CI no longer could be build with distributed, as it could not find libuv Pull Request resolved: https://github.com/pytorch/pytorch/pull/163797 Approved by: https://github.com/wdvr	2025-09-25 01:10:14 +00:00
Nikita Shulga	00f96dd84d	[CI] Run CUDA-13 binary builds on trunk (#163787 ) There are numerous other workflows that could be used to catch CUDA-12 build regression (our CI builds are almost identical to CD ones), but not many CUDA-13 builds around, so https://github.com/pytorch/pytorch/issues/163342 are really hard to detect in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/163787 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-09-25 00:58:17 +00:00
Aaron Pollack	77b9aac6c2	Add rule for typechecking maintainers (#161307 ) Allow the following people merge rights on type checking configs: - @lolpack - @maggiemoss - @ndmitchell - @kinto0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161307 Approved by: https://github.com/albanD, https://github.com/ezyang	2025-09-25 00:14:31 +00:00
Wei Wang	7163dce1e0	[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299 ) Fixes misleading warning messages when running on sm12x devices using binaries built with sm120. PyTorch binary built with sm120 is compatible with e.g. sm121, so no need for the warning of incompatibility. Also allow the 'matched_cuda_warn' message to show when e.g. the user is running a binary built with only sm90 on sm12x, so that the user would be prompted to get a build which supports e.g. sm120. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161299 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-24 23:59:19 +00:00
Sherlock Huang	4ac4a7351e	Shortcut redistribution when num_shards == 1 (#163742 ) Redistribution doesn't need collectives when num_shards == 1 on a mesh dimension. Only placement update is needed, local_tensor remains unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163742 Approved by: https://github.com/tianyu-l Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>	2025-09-24 23:49:08 +00:00
Raman Kumar	65ddd91421	Fix redundant H2D/D2H memcpy in cpp_wrapper by creating scalar tensors on CPU (#160584 ) Fixes #160520 Summary: When running Inductor with cpp_wrapper under a DeviceContext, non-tensor arguments were being wrapped with torch.tensor(arg) without specifying the device. creating the tensor on the current active device (like CUDA), and later fetching it back to CPU via .item(), causing unnecessary host-device-host memory transfers. PR fixes issue by explicitly creating scalar tensors on the CPU: ``` input_tensors = [ arg if isinstance(arg, torch.Tensor) else torch.tensor(arg, device='cpu') for arg in args ] ``` impact: inductor, codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/160584 Approved by: https://github.com/benjaminglass1, https://github.com/desertfire, https://github.com/mlazos, https://github.com/jeffdaily	2025-09-24 23:40:37 +00:00
karthickai	8c98aee436	[Inductor] Update DeviceAssert op to behave like store (#163696 ) Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696 Approved by: https://github.com/mlazos	2025-09-24 23:35:56 +00:00
bobrenjc93	d927e55498	[torchfuzz] refactor multi_process_fuzzer to be more readable (#163698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163698 Approved by: https://github.com/pianpwk ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558, #163560	2025-09-24 23:32:34 +00:00
Maggie Moss	754c7e2e88	Update pyrefly configuration file (#163775 ) Related to: https://github.com/pytorch/pytorch/issues/163283 This simply updates the existing pyrefly configuration and opts out additional directories. Running `pyrefly check` with this setup will result in ~100 errors reported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163775 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-09-24 23:14:39 +00:00
Jithun Nair	0ec946a052	[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776 ) Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-24 23:02:08 +00:00
Rob Timpe	2b1236de61	[dynamo] Fix handling of kwargs in exception constructor (#163390 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163390 Approved by: https://github.com/guilhermeleobas	2025-09-24 22:44:14 +00:00
Haifeng Jin	bc8680c298	Avoid `at::alias` in the `repeat` op implementation (#163455 ) Avoid `at::alias` in the `repeat` op implementation ## Summary This PR removed the usage of `at::alias` in the implementation and just `permute`+`reshape` the tensor to fit the specs of the result. This is a less hacky and a more readable way of implementing the op. All the new ops we are using are view-only ops, which does not introduce overhead of changing the storage. ## Who want this We are using `PrivateUse1` and accelerator, but this request to avoid `at::alias` in any op should be general enough for any backend who is using XLA, or who do not have explicit control over the memory allocation on the devices. ## Why we/they need this As we support TPU, we are overriding some ATen ops by binding them to PrivateUse1. However, it is not recommended to override the `repeat` op directly as we saw the following in `RegistrationDeclaration.h`. ``` at::Tensor repeat(const at::Tensor & self, c10::SymIntArrayRef repeats); // {"schema": "aten::repeat(Tensor self, SymInt[] repeats) -> Tensor", "dispatch": "True", "default": "True"} ``` We had to reuse the existing implementation of `repeat` to decomposite to other ops. However, we are unable to support the current implementation, which uses `at::alias`. It have two tensors share the same storage and modify one of them and return the other assuming it is changed, too. As, we do not have explicit control over the memory allocation of the tensors using XLA/PJRT. ## Alternatives We are open to alternative solutions that work for us if this PR is not in favor of the PyTorch community. For example, we may just bind our version of `repeat` op implementation to both `PrivateUse` and `AutogradPrivateUse1`. However, to my understanding, this would not work well with torch dynamo and `torch.compile`. Would you mind guiding us on how to solve this? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/163455 Approved by: https://github.com/Skylion007	2025-09-24 22:28:24 +00:00
Andrey Talman	1495b35d29	Remove Python 3.9 for Triton builds (#163778 ) Related to https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163778 Approved by: https://github.com/malfet	2025-09-24 20:19:43 +00:00
zeshengzong	90a282504e	Add `inference_mode` hint message to use `eval` with inference. (#163619 ) Fixes #162923 ## Test Result ### Before <img width="985" height="889" alt="image" src="https://github.com/user-attachments/assets/41de5cfa-7b25-4ba4-ade8-a6df745dcb30" /> ### After <img width="913" height="977" alt="image" src="https://github.com/user-attachments/assets/b6c06860-8db3-4b5d-9d46-31ece01fb04d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163619 Approved by: https://github.com/jbschlosser	2025-09-24 20:07:14 +00:00
Jeff Daily	0dce2afd44	[ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (#163783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-24 19:39:15 +00:00
Natalia Gimelshein	71eec6a0bf	[dist] handle discontiguous allgather/reducescatter inputs (#163712 ) Fixes #163483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501	2025-09-24 19:38:44 +00:00
Xu Han	0456b23b77	[AOTI] Add verbose error information for extract file (#163718 ) This PR optimize `extract_file` functions: 1. `normalize_path_separator` the dest path for Windows. 2. Add verbose error message: a. On Linux, add mz_zip error string. b. On Windows, add mz_zip error string and Windows error code. For the UT `test_package_user_managed_weight`: <img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" /> It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`. It is a little complex to debug, I will continue to working on it in further PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163718 Approved by: https://github.com/desertfire	2025-09-24 19:27:30 +00:00
Benji Beck	c414f75c8b	[WOQ][Inductor] Enable CUDA coverage for _weight_int8pack_mm (#163461 ) Summary: What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by #159325. Why: Confirm CUDA backend for _weight_int8pack_mm is registered. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda ``` https://www.internalfb.com/intern/testinfra/testrun/2533275104869494 Differential Revision: D82926440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163461 Approved by: https://github.com/jerryzh168	2025-09-24 19:20:38 +00:00
PaulZhang12	768361e67f	Add less warps config to inner reductions (#162447 ) Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread <img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314	2025-09-24 19:09:02 +00:00
Nan Zhang	9341ede617	Revert to old behaviour of not padding strides if shape or stride is dynamic (#163639 ) Differential Revision: D83053287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163639 Approved by: https://github.com/blaine-rister	2025-09-24 18:31:01 +00:00
Sherlock Huang	4c2c401ccf	Record redistribute_local_tensor in DebugMode (#163704 ) Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: https://github.com/pytorch/pytorch/pull/163704 Approved by: https://github.com/ezyang	2025-09-24 16:11:26 +00:00
Frank Lin	5d0f639234	Make `Tensor.__dlpack__(stream=None)` capture-safe during CUDA Graph capture (#163242 ) Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before #150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After #150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream. See this example ``` import torch s = torch.cuda.Stream() x = torch.randn(8, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.stream(s): with torch.cuda.graph(g): _ = x + 1 cap = x.__dlpack__() _ = torch.utils.dlpack.from_dlpack(cap) ``` This PR partially reverts #150217 that stream=None defaults to no sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163242 Approved by: https://github.com/ngimel	2025-09-24 16:04:19 +00:00
atalman	9d0d98acfe	Use cuda nvrtc so file based on cuda version used by torch (#163642 ) Fixes https://github.com/pytorch/pytorch/issues/162367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642 Approved by: https://github.com/msaroufim	2025-09-24 14:23:39 +00:00
Angel Li	3b73841f43	update test_quantization tests to run weekly (#163077 ) Fixes #162854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163077 Approved by: https://github.com/huydhn	2025-09-24 11:31:11 +00:00
atalman	141fc7276e	[CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661 ) Preload logic no longer works with CUDA 13.0 See the installation path: ``` ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/ libcheckpoint.so libcudadevrt.a libcufft.so.12 libcufile_rdma.so.1 libcusolver.so.12 libnvJitLink.so.13 libnvperf_target.so libnvrtc.alt.so.13 libpcsamplingutil.so libcublas.so.13 libcudart.so.13 libcufftw.so.12 libcupti.so.13 libcusolverMg.so.12 libnvblas.so.13 libnvrtc-builtins.alt.so.13.0 libnvrtc.so.13 libcublasLt.so.13 libcudart_static.a libcufile.so.0 libcurand.so.10 libcusparse.so.12 libnvperf_host.so libnvrtc-builtins.so.13.0 libnvtx3interop.so.1 ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/ cu13 cudnn cusparselt nccl nvshmem ``` Test using script from : https://github.com/pytorch/pytorch/issues/162367 ``` Kernel test passed! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163661 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll	2025-09-24 11:27:05 +00:00
Robert Hardwick	b66aa1ade1	[ARM] Add test_memory_profiler to aarch64 tests (#145260 ) TestMemoryProfilerE2E.test_memory_timeline is failing on AArch64, this fixes it and enables it in the opt-in list of tests for AArch64. Fixes #142371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145260 Approved by: https://github.com/fadara01, https://github.com/sraikund16	2025-09-24 09:29:13 +00:00
Nick Riasanovsky	207f104594	[Triton] [Inductor] Set default configs for Blackwell Matmul Template (#163740 ) Summary: Sets the default configs for the Blackwell Matmul Templates. Test Plan: NFC Differential Revision: D83116342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163740 Approved by: https://github.com/jananisriram	2025-09-24 08:17:35 +00:00
Jason Ansel	3e1b1a30f2	Revert "[inductor] Fix issue with scalar arg handling" (#163737 ) This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904. See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774 This PR might also cause issues with cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737 Approved by: https://github.com/ezyang ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520, #163482	2025-09-24 07:33:12 +00:00
FFFrog	2390d34c9b	[Code Clean] Remove deadcodes about Python3.9 [7/N] (#163646 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163646 Approved by: https://github.com/jansel ghstack dependencies: #163626, #163627, #163629, #163643, #163644, #163645	2025-09-24 07:30:50 +00:00
FFFrog	a635505a99	[Code Clean] Remove deadcodes about Python3.9 [6/N] (#163645 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163645 Approved by: https://github.com/albanD ghstack dependencies: #163626, #163627, #163629, #163643, #163644	2025-09-24 07:30:50 +00:00
FFFrog	6f34cc040f	[Code Clean] Remove deadcodes about Python3.9 [5/N] (#163644 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163644 Approved by: https://github.com/jansel ghstack dependencies: #163626, #163627, #163629, #163643	2025-09-24 07:30:50 +00:00
FFFrog	ec0cd81c38	[Code Clean] Remove deadcodes about Python3.9 [4/N] (#163643 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163643 Approved by: https://github.com/albanD ghstack dependencies: #163626, #163627, #163629	2025-09-24 07:30:50 +00:00
FFFrog	33aabdd8ac	[Code Clean] Remove deadcodes about Python3.9 [3/N] (#163629 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163629 Approved by: https://github.com/albanD ghstack dependencies: #163626, #163627	2025-09-24 07:30:50 +00:00
FFFrog	0bca77951d	[Code Clean] Remove deadcodes about Python3.9 [2/N] (#163627 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163627 Approved by: https://github.com/jansel ghstack dependencies: #163626	2025-09-24 07:30:50 +00:00
FFFrog	bf0747c6c6	[Code Clean] Remove deadcodes about Python3.9 [1/N] (#163626 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163626 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-09-24 07:30:50 +00:00
Ke Wen	11a231ef52	[c10d] P2P tensors must be dense (#163719 ) Fixes #161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719 Approved by: https://github.com/ngimel	2025-09-24 06:58:03 +00:00
angelayi	dad54ca7c0	Add mistral/gpt-oss to benchmarks (#163565 ) Potential issues * gpt-oss-20b is probably too big (I can't run on my devserver) * Mistral requires HF authentication * Mistral also takes a while to run the performance checks (need to wait for CI) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163565 Approved by: https://github.com/huydhn	2025-09-24 06:12:36 +00:00
Edward Yang	2c5a3d7e60	Delete functorch C extension entirely. (#163340 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340 Approved by: https://github.com/aorenste, https://github.com/wdvr, https://github.com/albanD, https://github.com/malfet	2025-09-24 06:08:58 +00:00
Blaine Burton Rister	f68de58c9d	[Inductor-FX] Support symbol and dynamic scalar graph inputs and outputs (#163596 ) # Problems This PR fixes a few edge cases that the FX converter missed related to dynamic shapes. 1. Inductor graphs can sometimes take `sympy.Symbol` inputs. We have logic to convert these to FX placeholder nodes. However, this logic did not update the `self.expr_to_proxy` table mapping symbols to proxy nodes. (There was existing logic to do this for `ir.TensorBox` inputs, but not `sympy.Symbol`.) This caused sympy tracing to fail when these symbol inputs were used in other expressions. 2. We lacked codegen for `ShapeAsConstantBuffer`. This IR node is seen when the graph input or output is a scalar computed from dynamic shapes. # Fixes a. Update `self.expr_to_proxy` when generating placeholders for `sympy.Symbol` inputs. Change `SymbolBuffer.get_example` to convert the symbol to a `torch.SymInt`, so we can populate `meta["val"]` correctly and use the value in other computations. b. Support `ShapeAsConstantBuffer` by tracing the sympy expression. c. Move output generation inside the metadata hook, allowing us to populate `meta["val"]` for the nodes computing `ShapeAsConstantBuffer`. # Test plan Added several new CI tests: 1. `torch.cond` with dynamic shapes. This exposes both issues, as the predicate is a `ShapeAsConstantBuffer` and one of the subgraphs uses a symbol input, due to the closure. Also tests when the parent and subgraphs have different input shapes. 2. Output dynamic shape scalar. This tests `ShapeAsConstantBuffer` as an output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163596 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-24 06:08:14 +00:00
Shunting Zhang	a8e9ed2407	[inductor] turn on loaf (for oss) by default (#162030 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162030 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-24 06:02:02 +00:00
Nick Riasanovsky	0390798dad	[Triton] [Inductor] Enable Epilogue Subtiling in the blackwell ws template (#163145 ) Summary: Enables support for epilogue subtiling in the blackwell ws template. This requires the ability to call `store_output` twice in the same kernel and reuse the same tensor descriptor across allocations. Test Plan: Tested with test_max_autotune.py on a Blackwell server. Rollback Plan: Differential Revision: D82610077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163145 Approved by: https://github.com/eellison	2025-09-24 05:38:02 +00:00
Simon Fan	124dd364e9	[hop] support local_map + SAC (#163322 ) Some ops like local_map hop's deferred mode are not desugared by make_fx, this means that when we apply SAC tags, we will need to define dispatch rules for the SAC torch dispatch modes as pointed out here: https://github.com/pytorch/pytorch/issues/162246#issuecomment-3259176721. This PR adds those rules. Additionally it fixes a pre-existing issue where we weren't coercing tangent layout (that AOTAutograd typically does) when partitioning the HOP joint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163322 Approved by: https://github.com/ezyang	2025-09-24 04:57:40 +00:00
orangeH25	20eeb54814	Add api info for torch._C._nn.pyi (#162936 ) Fix part of #148404 APis involved are as followed: - silu - silu_ - smooth_l1_loss - soft_margin_loss Pull Request resolved: https://github.com/pytorch/pytorch/pull/162936 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2025-09-24 04:55:57 +00:00
PyTorch UpdateBot	6f1d962d5b	[vllm hash update] update the pinned vllm hash (#163711 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163711 Approved by: https://github.com/pytorchbot	2025-09-24 04:31:37 +00:00
Eli Uriegas	42e9902a0f	cd: Move arm64 to linux.arm64.r7g.12xlarge.memory (#163681 ) This should reduce the amount of build time we have by a lot by just throwing more hardware at the problem. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163681 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet	2025-09-24 04:06:09 +00:00
Jason Ansel	d746b987d8	[inductor] Fix divmod error in decomp (#163482 ) Fixes #163457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163482 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520	2025-09-24 02:52:36 +00:00
Jason Ansel	6fa972796e	[inductor] Fix bugs in emulate_precision_casts (#163520 ) Fixes #163449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163520 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481	2025-09-24 02:52:36 +00:00
Jason Ansel	ca512af3e7	[inductor] Fix issue with scalar arg handling (#163481 ) Fixes #163420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422	2025-09-24 02:52:36 +00:00
Edward Z. Yang	c261c71f3e	Simplify _compute_local_shape_and_global_offset and make it SPMD. (#163344 ) There is only one substantive change: the branch on `global_offset[shard_dim] <= local_offset[shard_dim]` is removed because it is unnecessary: you can always treat the first shard uniformly with the rest of the shards, because your global offset is guaranteed to be zero in this case anyway. I also switch the shard_size case to sym_ite, to make it possible for LocalTensor to deal with the MPMD-ness here, but it's equivalent to the old if-then-else. I tried to rewrite the comments to be more clear what is going on algorithmically here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163344 Approved by: https://github.com/albanD, https://github.com/zpcore, https://github.com/tianyu-l	2025-09-24 02:24:09 +00:00
drisspg	e2ce79e4cc	[Flex] Fix silent correctness w/ backpropping grads (#163677 ) Fixes #https://github.com/pytorch/pytorch/issues/162228 # Summary Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here. TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677 Approved by: https://github.com/ydwu4	2025-09-24 02:12:19 +00:00
Bin Bao	be6c127927	[AOTI] Pass comments from metadata to the autotune block (#163600 ) Summary: When generating Triton kernels in the compile-time autotune blocks, it will be useful to generate source information as code comments. Previously we ignore these comments for autotune code blocks because the generated main output code will contain the same information, but it won't work if the generated autotune code crashes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163600 Approved by: https://github.com/yushangdi	2025-09-24 02:01:59 +00:00
Avik Chaudhuri	1e754d5a80	docs and optional kwargs for full graph capture (#163550 ) Test Plan: existing tests Differential Revision: D82995546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163550 Approved by: https://github.com/tugsbayasgalan	2025-09-24 01:20:27 +00:00
Nick Riasanovsky	dc9352938b	[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688 ) Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`. Test Plan: mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` Rollback Plan: Differential Revision: D82181924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos	2025-09-24 01:03:40 +00:00
Yuanyuan Chen	4535254c28	[3/N] Use std::filesystem in inductor (#163632 ) Continued work to use std::fs in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163632 Approved by: https://github.com/Skylion007	2025-09-24 00:23:34 +00:00
Markus Hoehnerbach	eb3fbf5b08	[inductor] in emulate_precision_casts, disable fma fusion in triton (#163073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-23 23:59:17 +00:00
Suryadev Sahadevan Rajesh	ee75c3d91f	Support for amin, amax, and aminmax (#163669 ) Support for amin, amax, and aminmax Test Plan: E2E tests in the stack with benchmark suite passes. Differential Revision: D83016894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163669 Approved by: https://github.com/albanD, https://github.com/malfet	2025-09-23 23:45:43 +00:00
Nikita Shulga	f9fa138a39	[BE] Delete all pre py-3.10 checks (#163653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649	2025-09-23 23:22:53 +00:00
drisspg	f3f67ff43a	Fix warn message (#163578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0	2025-09-23 22:46:51 +00:00
Mu-Chu Lee	6b5ad5f211	[Kineto] Add list of string parsing for profiler (#163593 ) Summary: We add the parsing for list of string. This is needed for AOTInductor profiling for input information of Triton kernels. Test Plan: Included in commit. test_profiler_op_event_kwargs_list_of_strings Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593 Approved by: https://github.com/sraikund16	2025-09-23 22:45:49 +00:00
Kurt Mohler	20149080f2	[MPS] Compute `offset2bag/bag_size/max_indices` in `_embedding_bag` (#163281 ) Part of #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281 Approved by: https://github.com/malfet	2025-09-23 22:30:48 +00:00
Jeff Daily	b879ef7c0d	[ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 22:15:10 +00:00
eellison	c63e417c79	use reduction hint for aggressive rblock (#163371 ) I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371 Approved by: https://github.com/PaulZhang12	2025-09-23 22:04:22 +00:00
bobrenjc93	c3d9f089d9	[torchfuzz] introduce multi process fuzzer (#163560 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163560 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558	2025-09-23 22:00:51 +00:00
eellison	29af25844b	Less aggressive persistent reduction when it could induce large masking with dynamic shapes (#163365 ) As per comment in source code: ``` # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower # and upper are large, for the lower values we will be masking off large % of read/writes, # when we could expand the coalescing xblock instead. ``` For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values). Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365 Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0	2025-09-23 21:58:57 +00:00
Svetlana Karslioglu	8c8416b021	Update pytorch.org links in docs/conf.py (#163682 ) Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD	2025-09-23 21:40:11 +00:00
Bob Ren	b182365660	[ez] use list initializer syntax in fill_diagonal_ (#163607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163607 Approved by: https://github.com/Skylion007 ghstack dependencies: #163485	2025-09-23 21:27:12 +00:00
bobrenjc93	5ca563ea09	symintify fill_diagonol_ (#163485 ) Fixes #162271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163485 Approved by: https://github.com/Skylion007	2025-09-23 21:27:12 +00:00
Tugsbayasgalan Manlaibaatar	e671dcc969	Update tests to check for more robust pattern (#163107 ) Landing this instead of https://github.com/pytorch/pytorch/pull/162994. Here is how i think the whole dynamo + frame construction logic work: 1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that. 2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (`f6ea41ead2/test/export/test_export.py (L1369)`). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self. 3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (`f6ea41ead2/torch/_dynamo/variables/functions.py (L463)`) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export: "While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]" My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names. Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107 Approved by: https://github.com/avikchaudhuri	2025-09-23 21:11:48 +00:00
Mark Saroufim	fc84743707	Implement CUDA stream protocol (#163614 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163614 Approved by: https://github.com/eqy	2025-09-23 21:02:08 +00:00
Pian Pawakapan	2a9745de3c	[multi-kernel] shape-similarity kernel selection (#163090 ) Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space). Some caveats/changes: - Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes - Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint: <img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" /> Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090 Approved by: https://github.com/bobrenjc93	2025-09-23 21:00:47 +00:00
PaulZhang12	22c5e8c17c	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-09-23 20:36:39 +00:00
Jithun Nair	bcb893acb0	[ROCm] Build FBGEMM_GENAI for gfx942 only (#162648 ) Fixes build timeouts >4h on libtorch build jobs: `75e7f49f9c/1` Brings back code to narrow down CK compilation targets from `69a25f6888 (diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777)` gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 18:55:35 +00:00
Blaine Burton Rister	8e6b0c71fb	[Inductor] Remove `no_type_check` annotation on properties (#163570 ) Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163570 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007	2025-09-23 18:20:04 +00:00
Nikita Shulga	0696a4b0b8	[EZ] Perma-ignore UP038 (#163649 ) As it has been removed, see https://docs.astral.sh/ruff/rules/non-pep604-isinstance/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/163649 Approved by: https://github.com/Skylion007 ghstack dependencies: #163648	2025-09-23 17:58:18 +00:00
Nikita Shulga	ca35dc2fdd	[EZ] Fix UP041 violations (#163648 ) I.e. use `TimeoutError` instead of `socket.timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163648 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-09-23 17:58:18 +00:00
Raman-RH	649ceda8a5	[export] handling NamedTuple inputs (#162959 ) Fixes #160547 ### Summary: bug ``` def test_namedtuple(self): from collections import namedtuple Point = namedtuple('Point', 'x y') class M(torch.nn.Module): def forward(self, x, y): return x + y inp = Point(torch.ones(3), torch.ones(3)) print(M()(*inp)) # errors ep = torch.export.export(M(), inp, strict=False) print(ep) # succeeds ep = torch.export.export(M(), inp, strict=True) print(ep) # workaround could be to convert namedtuple to a kwarg inp_kwargs = {field: getattr(inp, field) for field in inp._fields} ep = torch.export.export(M(), (), inp_kwargs) print(ep) ``` FIx : namedtuple is subclass of tuple but namedtuple is not expected So, this change handles named tuple case I have added 🧪 test case for this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959 Approved by: https://github.com/angelayi Co-authored-by: Angela Yi <angelayi@meta.com>	2025-09-23 17:43:50 +00:00
Jerry Mannil	2aadcea05c	[ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562 ) * Unroll loops manually to hide memory access latency Co-author: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/163562 Approved by: https://github.com/jeffdaily	2025-09-23 17:42:48 +00:00
Xu Han	fde929c8a8	[AOTI] Fix model_package_loader get_cpp_compile_command (#163561 ) It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`. reproducer: ```cmd pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods ``` <img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" /> Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code: `3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)` `3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)` Fixed on Windows: <img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" /> Also validated on Linux: <img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561 Approved by: https://github.com/jansel	2025-09-23 17:38:18 +00:00
Saurabh Mishra	134dfbeaef	[DCP] DTensor slice dequantization with proper block alignment (#163532 ) Summary: When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions. This fix addresses the issue by: 1. Proper coordinate mapping: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape. 3. Block-aligned dequantization: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors. The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions. Test Plan: Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that: 1. Dequantized tensor values are numerically correct 2. Block boundaries are properly calculated relative to full tensor shape 3. Scale factors are applied to correct tensor regions 4. Tensor shapes map is built efficiently using only metadata Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py ``` { "model.layers.0.mlp.gate_proj.weight": { "mse": 4.30626645453458e-11, "mae": 9.98388827611052e-07, "max_abs_diff": 0.0009703934192657471, "cosine_similarity": 1.010810375213623, "relative_error": 0.001330620958469808, "kl_divergence_1_to_2": "6.563401e-08", "kl_divergence_2_to_1": "-6.522914e-08", "js_divergence": 1.3711876079014476e-10, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.4453125, "max": 0.30859375, "mean": -1.2592146958922967e-05 }, "t2_stats": { "min": -0.44529813528060913, "max": 0.3085886240005493, "mean": -1.2624391274584923e-05 } }, "model.layers.0.mlp.up_proj.weight": { "mse": 2.5534721906361746e-11, "mae": 3.118609583907528e-06, "max_abs_diff": 0.00047551095485687256, "cosine_similarity": 1.038962483406067, "relative_error": 0.0013681650161743164, "kl_divergence_1_to_2": "-5.8253768e-08", "kl_divergence_2_to_1": "5.8747577e-08", "js_divergence": NaN, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.228515625, "max": 0.2333984375, "mean": 8.862222955485777e-08 }, "t2_stats": { "min": -0.2285017967224121, "max": 0.23338991403579712, "mean": 8.824501662729745e-08 } }, "model.layers.0.mlp.down_proj.weight": { "mse": 2.2803769289536646e-11, "mae": 2.8916260816913564e-06, "max_abs_diff": 0.0008973777294158936, "cosine_similarity": 1.0376262664794922, "relative_error": 0.001346255769021809, "kl_divergence_1_to_2": "1.2744896e-07", "kl_divergence_2_to_1": "-1.2736885e-07", "js_divergence": 5.992362162032805e-11, "shape": [ 7168, 18432 ], "t1_stats": { "min": -0.54296875, "max": 0.546875, "mean": -2.9487239316949854e-07 }, "t2_stats": { "min": -0.5429964661598206, "max": 0.5469087362289429, "mean": -2.9507478416235244e-07 } } } ``` https://www.internalfb.com/intern/testinfra/testrun/3940649985202645 Differential Revision: D82975005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532 Approved by: https://github.com/wwwjn	2025-09-23 16:48:16 +00:00
PyTorch MergeBot	221ac81043	Revert "[precompile] Add option to disable guard check on aot-compiled function. (#163432 )" This reverts commit 539e84e289fa7563032410706ede50a4eaa7a15d. Reverted https://github.com/pytorch/pytorch/pull/163432 on behalf of https://github.com/Camyll due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/163432#issuecomment-3324757069))	2025-09-23 16:31:30 +00:00
dilililiwhy	6e5dddba64	Use accelerator API in common_dtensor (#163498 ) Fixes #ISSUE_NUMBER Try to unify the device checking in common_dtensor (testing module) by accelerator API Pull Request resolved: https://github.com/pytorch/pytorch/pull/163498 Approved by: https://github.com/albanD, https://github.com/H-Huang	2025-09-23 16:30:20 +00:00
Jeff Daily	ebddbe787a	[ROCm][CI] skip test_sparse_triangular_solve (#163651 ) need more time to debug, but also need clean CI signal test was unskipped by #163495, but had been skipp on rocm prior Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163651 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 15:55:51 +00:00
drisspg	5f0c7cb4aa	Add B200 smoke test (#159494 ) Okay running test_max_autotune locally on B200is horrible read, for now to get something landed I am focusing on test_matmul_cuda.py and test_fp8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159494 Approved by: https://github.com/nWEIdia, https://github.com/huydhn ghstack dependencies: #163460, #163537, #163552	2025-09-23 15:45:05 +00:00
drisspg	b3cf5c79dd	Skip on sm100 later since Tests are non determinisitic (#163552 ) This is tracked https://github.com/pytorch/pytorch/issues/163462 skipping since we are seeing sporadic errors locally and on CI, Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552 Approved by: https://github.com/eqy, https://github.com/Skylion007 ghstack dependencies: #163460, #163537	2025-09-23 15:45:05 +00:00
drisspg	0f674077f4	Large tests failing on bfloat16 (#163537 ) # Summary I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later) ``` Shell ==================================================================================================== FAILURES ===================================================================================================== _________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, *kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Okay the bfloat16 are forsure real cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy ghstack dependencies: #163460	2025-09-23 15:45:05 +00:00
Yiming Zhou	720a7b2887	[export] Remove .contiguous() when saving weights to raw bytes (#163587 ) Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing Differential Revision: D83016250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587 Approved by: https://github.com/angelayi	2025-09-23 15:44:56 +00:00
Jason Ansel	49e7b2f69d	[inductor] Fix error from custom CUDA allocators (#163422 ) Fixes #163257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163422 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412	2025-09-23 15:37:45 +00:00
Jason Ansel	6ef74879f6	[dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412 ) Fixes #162624 Fixes #162586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163412 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393	2025-09-23 15:37:45 +00:00
Jason Ansel	9c4d9f940b	[inductor] Support out_dtype arg to matmul (#163393 ) Fixes #163275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393 Approved by: https://github.com/eellison, https://github.com/coconutruben ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434	2025-09-23 15:37:38 +00:00
Jason Ansel	ed84e808f0	[inductor] Freeze layouts in FlexAttention (#163434 ) Fixes #163300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163434 Approved by: https://github.com/drisspg ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419	2025-09-23 15:37:29 +00:00
Jason Ansel	518c320676	[inductor] libdevice.sqrt => tl.sqrt_rn (#163419 ) Fixes #163082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419 Approved by: https://github.com/Skylion007, https://github.com/mlazos ghstack dependencies: #163386, #163398, #163387, #163414, #163415	2025-09-23 15:37:21 +00:00
Scott Wolchok	4264fd34ec	Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968 ) Next PR writes a C++ implementation. Seems good to have tests first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162968 Approved by: https://github.com/ezyang ghstack dependencies: #161695, #162508	2025-09-23 14:56:32 +00:00
Jeff Daily	e05c9c0c84	[ROCm][CI] cudagraph trees ut fixes (#163592 ) Fixes #162125. Fixes #160719. Fixes #157901. Fixes #157871. Fixes #157761. Fixes #157723. Fixes #157643. Fixes #157616. Fixes #157556. Fixes #157533. Fixes #157449. Fixes #157428. Fixes #157413. Fixes #157367. Fixes #157350. Fixes #157339. Fixes #157312. Fixes #157280. Fixes #157258. Fixes #157173. Fixes #157143. Fixes #157112. Fixes #157086. Fixes #157058. Fixes #157035. Fixes #156984. Fixes #156957. Fixes #156954. Fixes #156922. Fixes #156886. Fixes #156838. Fixes #156808. Fixes #156801. Fixes #156778. Fixes #156755. Fixes #156735. Fixes #156693. Fixes #152561. Fixes #130749. Fixes #100074. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163592 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 14:45:00 +00:00
PyTorch MergeBot	aff76c046d	Revert "Add fake_impl for _native_multi_head_attention (#163167 )" This reverts commit 27164b6788cab6e6d8095012839e51c958a819d6. Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see `1a42656d6c/1` ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))	2025-09-23 14:36:45 +00:00
Isalia20	1a42656d6c	[Flex attention] Fix flex attention head broadcast (#163426 ) Fixes part of #163314 In particular bug: Bug 1: H=None Broadcasting Produces Incorrect Results This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (mask[:, :, i]). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426 Approved by: https://github.com/drisspg	2025-09-23 13:01:51 +00:00
Simon Fan	bda9ab291d	[inductor] fix as_strided lowering with .view(dtype) inputs (#163319 ) FIXES https://github.com/pytorch/pytorch/issues/163286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319 Approved by: https://github.com/eellison	2025-09-23 12:50:57 +00:00
atalman	3c64b2abab	CUDA 13.0 Warning update for supported architectures (#163585 ) Please see build script: `8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)` This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585 Approved by: https://github.com/malfet	2025-09-23 11:27:11 +00:00
Yuanyuan Chen	5d749ceb92	Remove test conditions for CUDA<12 (#163495 ) Because it required that CUDA >=12. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495 Approved by: https://github.com/janeyx99	2025-09-23 07:52:00 +00:00
Nicolas Macchioni	8d81564df5	[pt2][cache] rework cache for true generic usage + better tests (#163488 ) Differential Revision: D82933509 over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out. with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible) putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green I have the next PR ready as well (implements a fresh cache context manager), will export once this lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488 Approved by: https://github.com/aorenste, https://github.com/masnesral	2025-09-23 07:31:48 +00:00
bobrenjc93	b426ba1d5e	[torchfuzz] introduce tensor and scalar pointwise ops (#163558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163558 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557	2025-09-23 06:20:13 +00:00
KarhouTam	375f3e3a61	[OpenReg][Docs] Correct docs about `openreg` usage example. (#163235 ) ## Why this PR? I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4). Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path. ## What I've changed? - I added `-I ./` to correctly search for `./include/openreg.h`. - I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235 Approved by: https://github.com/FFFrog, https://github.com/albanD Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-09-23 06:16:45 +00:00
Shivam Raikundalia	45d9dcccc5	Update Kineto Submodule (#162222 ) Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222 Approved by: https://github.com/sanrise	2025-09-23 06:08:55 +00:00
bobrenjc93	309fe03f4b	[torchfuzz] remove unneeded try catch (#163557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163557 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556	2025-09-23 06:05:08 +00:00
bobrenjc93	1545bb1c00	[torchfuzz] shuffle compatible ops (#163556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163556 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555	2025-09-23 05:53:44 +00:00
bobrenjc93	d5e51d34f7	[torchfuzz] decompose -> fuzz_inputs_specs (#163555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163555 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554	2025-09-23 05:44:59 +00:00
bobrenjc93	08c5efde5f	[torchfuzz] cache operators (#163554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163554 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553	2025-09-23 05:28:07 +00:00
PyTorch MergeBot	19b754dff8	Revert "Update cutlass version for fbcode (#163091 )" This reverts commit 509c4e86270cc4decca58905d0f446e1fc0cf618. Reverted https://github.com/pytorch/pytorch/pull/163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163091#issuecomment-3322428791))	2025-09-23 05:08:42 +00:00
Yuanyuan Chen	d3a1345ed8	Use functools.cache on has_efa (#163439 ) Cache the result of `has_efa` by `functools.cache`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163439 Approved by: https://github.com/janeyx99	2025-09-23 05:03:03 +00:00
Yuanyuan Chen	e3b392bdfd	[BC breaking] Remove deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438 ) This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not. Before change: ``` import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe ``` works After change: there is an import error exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438 Approved by: https://github.com/janeyx99	2025-09-23 05:02:06 +00:00
Valentin Andrei	bb5be56619	[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 ) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-09-23 04:48:19 +00:00
bobrenjc93	0e122380c2	[torchfuzz] remove supports_variable_inputs for now (#163553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163553 Approved by: https://github.com/laithsakka ghstack dependencies: #163547	2025-09-23 04:44:54 +00:00
PyTorch UpdateBot	fcd79d5228	[vllm hash update] update the pinned vllm hash (#163590 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163590 Approved by: https://github.com/pytorchbot	2025-09-23 04:44:15 +00:00
Sherlock Huang	95ac7d724e	Rename to _debug_mode.py to make it private (#163534 ) rename debug_mode.py to _debug_mode.py to make it private, per @alban's request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534 Approved by: https://github.com/albanD	2025-09-23 04:27:10 +00:00
bobrenjc93	0b75a16200	[torchfuzz] Encapsulate fuzzing and codegen logic into ops (#163547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163547 Approved by: https://github.com/laithsakka	2025-09-23 04:26:00 +00:00
Yidi Wu	27164b6788	Add fake_impl for _native_multi_head_attention (#163167 ) Test Plan: See added test in test_export.py Rollback Plan: Reviewed By: henryoier Differential Revision: D77747446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163167 Approved by: https://github.com/angelayi	2025-09-23 04:02:20 +00:00
cyy	447b8fc56d	[2/N] Use filesystem in inductor (#163465 ) Use std::filesystem in most inductor code. This is follow-up of https://github.com/pytorch/pytorch/pull/152288 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/163465 Approved by: https://github.com/Skylion007	2025-09-23 03:56:16 +00:00
Yuanyuan Chen	6a48f57d2f	[1/N] Remove 'type: ignore' suppressions (#163468 ) Remove some unnecessary 'type: ignore' suppressions from python code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163468 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-09-23 03:53:11 +00:00
Bob Ren	e9300b2b7c	remove allow-untyped-defs from ./torch/onnx/_internal/torchscript_exporter/_globals.py (#163472 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163472 Approved by: https://github.com/Skylion007 ghstack dependencies: #163246, #163469, #163470	2025-09-23 03:50:29 +00:00
Mu-Chu Lee	8f30a8dc47	[AOTInductor] Add grid information for Triton Kernels (#160131 ) Summary: Add grid information for Triton Kernels for profiling in Kineto. Test Plan: Before change: <img width="539" height="625" alt="Screenshot 2025-08-07 at 1 09 07 PM" src="https://github.com/user-attachments/assets/dd0778a9-2ff3-4819-acd3-de585cf7f9d1" /> After change: <img width="550" height="898" alt="Screenshot 2025-08-07 at 1 05 49 PM" src="https://github.com/user-attachments/assets/d84988df-bb83-41ed-80ac-8a6d843a1a9d" /> *Note we can extract grid size etc. from device side trace, but we're focusing host side specifically for this PR, mainly to add more host side information in the future needed for performance profiling. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/160131 Approved by: https://github.com/desertfire	2025-09-23 02:15:24 +00:00
Bob Ren	2c7959eee9	[ignore][codex-test] Add typing to simple library registry (#161367 ) ## Summary - add type annotations for simple library registry and dispatch rule holder - remove allow-untyped-defs directive ## Testing - `python -m mypy torch/_library/simple_registry.py` (fails: repo expects mypy==1.16.0) - `lintrunner -a torch/_library/simple_registry.py` (fails: attr-defined error in torchgen/gen_schema_utils.py) - `python test/test_torch.py TestTorch.test_dir` (fails: ModuleNotFoundError: No module named 'torch') ------ https://chatgpt.com/codex/tasks/task_e_68aa3cc210488326befdd992c79115a0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161367 Approved by: https://github.com/Skylion007	2025-09-23 02:08:55 +00:00
Colin Peppler	3ef1bef36c	[sdpa] make sure to recompile if alignment is different than before (#163083 ) ## Context An example from Qwen2-7B - This come from running torch.compile with a sequence length that is divisible by 8 (no padding needed). Call this `Run1`. - If we then run the compiled model with a difference length that isn't divisible by 8 (requires padding). Call this `Run2`. - Then we'll see this error. ``` File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48(512 // (512 // s19))(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845) File "torch/_ops.py", line 841, in __call__ return self._op(args, *kwargs) RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4. ``` - We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`. ### A bit more on why. Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`. `0897affcd5/torch/_inductor/lowering.py (L2687-L2694)` ## Fix Size hint doesn't guard, so the fix is to use `guard_or` to guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083 Approved by: https://github.com/eellison	2025-09-23 01:33:33 +00:00
Zhengxu Chen	539e84e289	[precompile] Add option to disable guard check on aot-compiled function. (#163432 ) Summary: Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result. When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart. Test Plan: CI Differential Revision: D82904540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432 Approved by: https://github.com/dolpm	2025-09-23 01:00:05 +00:00
Svetlana Karslioglu	68e75be86a	Update pytorch_sphinx_theme2 to latest hash (#163269 ) The updated theme: - Fixes articleBody in the json+ld that caused previous Google Search issues - Other minor fixes - 404.html fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269 Approved by: https://github.com/albanD	2025-09-22 23:20:23 +00:00
Yuanyuan Chen	8da008678f	Remove outdated commented CMake code (#163442 ) Policies `CMP0023` and `CMP0022` have been removed in CMake 4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163442 Approved by: https://github.com/janeyx99	2025-09-22 23:07:36 +00:00
Nikita Shulga	fa15fb01ab	[EZ] Remove XLA from unstable.yml (#163564 ) It runs for 30 min on linux.12xlarge and then fails and it has been like that since Aug 7th Besides, there are no more python-3.9 builds left. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163564 Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/huydhn	2025-09-22 22:11:50 +00:00
clr	33daaad7d0	dynamo: Handle objects in graph that do not support weakref (#163168 ) We are seeing crashes of the form ``` Traceback (most recent call last): File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1487, in run while self.step(): File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1348, in step self.dispatch_table[inst.opcode](self, inst) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2437, in LOAD_ATTR self._load_attr(inst) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2425, in _load_attr result = BuiltinVariable(getattr).call_function( File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 1347, in call_function return handler(tx, args, kwargs) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <lambda> tx, [v.realize() for v in args], kwargs File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <listcomp> tx, [v.realize() for v in args], kwargs File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 72, in realize self._cache.realize() File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 33, in realize self.vt = builder.VariableBuilder(tx, self.source)(self.value) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 445, in __call__ vt = self._wrap(value) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 1043, in _wrap torch._dynamo.utils.store_user_object_weakref(value) File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/utils.py", line 4694, in store_user_object_weakref user_obj_id_to_weakref[obj_id] = weakref.ref(obj) torch._dynamo.exc.InternalTorchDynamoError: TypeError: cannot create weak reference to 'torch.Event' object ``` This pull request makes us gracefully graph break, vs explicitly crashing. I've added a test which reproduces the issue. There is a side discussion re: how did torch.Event support ever work here, since it appears you cannot take a weakref to a torch.Event Pull Request resolved: https://github.com/pytorch/pytorch/pull/163168 Approved by: https://github.com/Lucaskabela, https://github.com/jansel	2025-09-22 22:11:09 +00:00
Yuanyuan Chen	60c2bdedcd	Replace Literal[None] with None in typing (#163489 ) This PR replaces Literal[None] with None in typing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-22 22:10:08 +00:00
Tugsbayasgalan Manlaibaatar	b756b580fb	Improve fake tensor leakage detection in export by not relying on gc too much (#163516 ) Previously we relied on gc to get the snapshot of fake tensors before and after export to get list of fake tensors that are created during export. This caused some flakiness in our test suite (https://github.com/pytorch/pytorch/issues/162232). it seems super hard to make gc deterministic, so we just instrument fake tensor creation which seems lot better. In addition, it is also quite faster than previous approach becuase we are no longer manually triggering garbage collector. Differential Revision: [D82966648](https://our.internmc.facebook.com/intern/diff/D82966648) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163516 Approved by: https://github.com/ezyang	2025-09-22 22:04:24 +00:00
Chang Pan	e0cbab46ad	[Inductor] avoid CUDA__equal when constant tensors are from different device (#163529 ) Summary: otherwise, may hit ``` Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal) ``` Test Plan: UTs Reviewed By: yushangdi Differential Revision: D82974062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529 Approved by: https://github.com/yushangdi, https://github.com/Skylion007	2025-09-22 22:04:11 +00:00
Jason Ansel	4fc271e559	[inductor] Don't require_dense for grid_sampler_2d_backward (#163415 ) Fixes #163372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163415 Approved by: https://github.com/Skylion007 ghstack dependencies: #163386, #163398, #163387, #163414	2025-09-22 21:53:01 +00:00
Jason Ansel	c8fd2b45e5	[inductor] Skip test_baddmm on XPU (#163414 ) Fixes #161484 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163414 Approved by: https://github.com/Skylion007 ghstack dependencies: #163386, #163398, #163387	2025-09-22 21:53:01 +00:00
Jason Ansel	a1bd9248eb	[inductor] Fallback on strided complex add (#163387 ) Fixes #163243 Fixes #162561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163387 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398	2025-09-22 21:52:53 +00:00
Jason Ansel	36c2a1325c	[inductor] Fix bug where viewed outputs get padded (#163398 ) Fixes #163328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163398 Approved by: https://github.com/eellison ghstack dependencies: #163386	2025-09-22 21:52:45 +00:00
Jason Ansel	7ea8998c0b	Better decomp for torch.eye (#163386 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163386 Approved by: https://github.com/eellison	2025-09-22 21:52:37 +00:00
PaulZhang12	2b036632ca	Allow add_persistent_r_block to scale up rblock up to a limit (#162296 ) <img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296 Approved by: https://github.com/eellison	2025-09-22 21:41:46 +00:00
can-gaa-hou	0256f91558	[BUG] MaxUnpool2d/3d should check output dim before accessing its elements (#163507 ) Fixes #163409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163507 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-09-22 21:36:48 +00:00
Nikita Shulga	da05aa7a9d	[BE] Use `output_t` directly (#163518 ) Rather than deref the safe tensor wrapped in `TensorArg` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163518 Approved by: https://github.com/Skylion007	2025-09-22 21:33:42 +00:00
PyTorch UpdateBot	e558f7a222	[vllm hash update] update the pinned vllm hash (#163463 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163463 Approved by: https://github.com/pytorchbot Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-22 21:24:56 +00:00
Edward Yang	09cb34c1dc	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-22 21:12:18 +00:00
Nikita Shulga	4027e97791	[BE] Delete `skipIfMPSOnMacOS13` (#163515 ) As PyTorch needs MacOS-14 or newer to use MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/163515 Approved by: https://github.com/Skylion007	2025-09-22 21:10:22 +00:00
Svetlana Karslioglu	8e62d01f7a	Add dynamic shapes doc (#159428 ) This PR adds new Dynamic Shapes documentation and expands on the existing one. - Adds a new structure with Intro, Core Concepts, Troubleshooting Pull Request resolved: https://github.com/pytorch/pytorch/pull/159428 Approved by: https://github.com/bobrenjc93 Co-authored-by: bobrenjc93 <bobren@meta.com>	2025-09-22 21:01:27 +00:00
Pearu Peterson	8abc2af9b9	[STABLE ABI] Add clone method to torch::stable::Tensor (#161896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161896 Approved by: https://github.com/janeyx99	2025-09-22 20:39:24 +00:00
drisspg	02da4753f5	Triton template IMA reads on B200 (#163460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163460 Approved by: https://github.com/eqy, https://github.com/alexsamardzic	2025-09-22 20:34:39 +00:00
Bob Ren	cf28ab2c88	remove allow-untyped-defs from ./torch/ao/quantization/pt2e/duplicate_dq_pass.py (#163470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163470 Approved by: https://github.com/aorenste ghstack dependencies: #163246, #163469	2025-09-22 20:29:09 +00:00
Bob Ren	46e1b7d70b	remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/fileopener.py (#163469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163469 Approved by: https://github.com/aorenste, https://github.com/Skylion007 ghstack dependencies: #163246	2025-09-22 20:29:09 +00:00
Aaron Gokaslan	e065d35fd3	[BE]: Add a few more missing move from return indices (#163456 ) @ezyang A follow up where I found a few more missing returns of this style in the codebase. Follow up to #163416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163456 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-09-22 20:24:23 +00:00
adabeyta	fd785b1762	Add NestedTensor dispatch for _is_any_true/_is_all_true (#162096 ) Fixes: https://github.com/pytorch/pytorch/issues/161818 ### Summary Add NestedTensor support for `_is_any_true` and `_is_all_true`. ### Changes - Register dispatch for `aten._is_any_true.default` and `aten._is_all_true.default` - Add CPU tests: - `test_is_any_true_jagged`: dispatch_matches_values_buffer, all_false_returns_false, one_true_returns_true - `test_is_all_true_jagged`: dispatch_matches_values_buffer, all_true_returns_true, any_false_returns_false ### Testing Before Fix: `pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v` Output: ``` FAILED [0.0129s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu - NotImplementedError: aten._is_all_true.default FAILED [0.0007s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu - NotImplementedError: aten._is_any_true.default ``` After Fix: `pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v` Output: ``` Running 2 items in this shard test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu PASSED [0.0277s] [ 50%] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu PASSED [0.0013s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162096 Approved by: https://github.com/jbschlosser	2025-09-22 20:22:44 +00:00
David Berard	d0086708dd	[triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382 ) Includes: * https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits * https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-09-22 20:20:59 +00:00
Sherlock Huang	6f9aef5fef	[2/n] Support module.to("cuda:0") in FakeTensorMode on cuda-less machine (#163433 ) Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Before Moving module.to("cuda:0") under fake tensor mode would have parameter on `meta` device. After parameters would be on "cuda:0" . Test Plan: buck2 run fbcode//caffe2/test:fake_tensor -- --r test_move_module Reviewed By: mikaylagawarecki Differential Revision: D80102876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163433 Approved by: https://github.com/albanD	2025-09-22 20:16:32 +00:00
angelayi	d15048493c	[opaque_obj] Add set_payload + docs (#163276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276 Approved by: https://github.com/zou3519 ghstack dependencies: #162660	2025-09-22 20:02:29 +00:00
Adrian Abeyta	bf28990c3d	Add support for NestedTensor share_memory_ (#162272 ) Fixes: https://github.com/pytorch/pytorch/issues/161915 ### Summary Implements share_memory_() support for NestedTensor! ### Changes - Added share_memory_() method to NestedTensor class. - Shares storage for all NestedTensor components: _values, _offsets, _lengths, and cached seqlen tensors. - Guard for CUDA Tensors. ### Testing Before Fix: `pytest -q test/test_nestedtensor.py -k "test_share_memory" -v` Output: ``` Running 1 items in this shard test/test_nestedtensor.py Fatal Python error: Segmentation fault ``` After Fix: `pytest -q test/test_nestedtensor.py -k "test_share_memory" -v` Output: ``` Running 1 items in this shard test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_share_memory_cpu PASSED [0.0753s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162272 Approved by: https://github.com/jbschlosser	2025-09-22 19:59:58 +00:00
PyTorch MergeBot	eaa613bf66	Revert "[opaque_obj] Add set_payload + docs (#163276 )" This reverts commit dd30667f6c2204a15e91eaeb61c84f9080be7748. Reverted https://github.com/pytorch/pytorch/pull/163276 on behalf of https://github.com/ZainRizvi due to Sorry but this fails lint on trunk: [GH job link](https://github.com/pytorch/pytorch/actions/runs/17924886989/job/50968430537) [HUD commit link](`dd30667f6c`) ([comment](https://github.com/pytorch/pytorch/pull/163276#issuecomment-3321054061))	2025-09-22 19:32:30 +00:00
Kathryn-cat	1818c36d6e	[Fix] Restrict stride normalization to 1D tensors on export (#163282 ) This change restricts the DLPack stride normalization to apply only to 1D tensors of shape (1,). ### Rationale The previous implementation normalized the strides for any multi-dimensional tensor containing a dimension of size 1. While well-intentioned, this "over-normalization" discards critical memory layout information, causing issues for downstream consumers who rely on strides to infer alignment and contiguity. For example: * A row-major tensor with `shape=(1, 128)` and `stride=(128, 1)` would be incorrectly normalized to `stride=(1, 1)`. * A column-major tensor with `shape=(1024, 1)` and `stride=(1, 1024)` would also be normalized to `stride=(1, 1)`. This loss of stride information makes it impossible for consumers to detect the original memory layout (e.g., row-major vs. column-major) and breaks assumptions about memory alignment needed for optimized indexing or specialized hardware APIs like GPU TMA. The original intent of the normalization was to handle the simple case of a 1D tensor with shape=(1,) and a non-standard stride. This fix reverts to that specific, non-problematic behavior, ensuring that multi-dimensional tensors retain their precise stride information during DLPack export. ### Related Issues #163274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163282 Approved by: https://github.com/eqy	2025-09-22 19:10:05 +00:00
angelayi	7e9781174c	Fix lint (#163542 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163542 Approved by: https://github.com/malfet	2025-09-22 19:10:00 +00:00
Basil Wong	4941719061	Enable logging for absolute memory estimation (#158799 ) Summary: Update the Auto AC logging so that it also provides the absolute memory estimations for each node. Test Plan: (aps-gem_omnifm_v2_mwb_dynamic_005_budget-f23a84c3d8): https://fburl.com/ai_infra/0r738h5r {F1980393481} * Memory Recorded in bytes --- ``` buck2 test //caffe2/test/functorch:test_ac_logging ``` https://www.internalfb.com/intern/testinfra/testrun/14918173863021573 Rollback Plan: Differential Revision: D78580107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158799 Approved by: https://github.com/jansel	2025-09-22 18:36:49 +00:00
angelayi	dd30667f6c	[opaque_obj] Add set_payload + docs (#163276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276 Approved by: https://github.com/zou3519 ghstack dependencies: #162660	2025-09-22 18:30:28 +00:00
angelayi	3be9c86c74	[opaque obj] Initial OpaqueObject (#162660 ) A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators. Some still slightly annoying parts with this implementation: - The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type - `@torch.library.custom_op` doesn't work.. yet? UX: ```python from torch._library.opaque_object import make_opaque, get_payload # your custom python class class OpaqueQueue: def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None: super().__init__() self.queue = queue self.init_tensor_ = init_tensor_ def push(self, tensor: torch.Tensor) -> None: self.queue.append(tensor) def pop(self) -> torch.Tensor: if len(self.queue) > 0: return self.queue.pop(0) return self.init_tensor_ def size(self) -> int: return len(self.queue) queue = OpaqueQueue([], torch.zeros(3)) obj: torch._C.ScriptObject = make_opaque(queue) # obj.payload stores a direct reference to this python queue object self.assertEqual(get_payload(obj), queue) # This is able to be passed through the dispatcher torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3)) self.assertTrue(queue.size(), 1) ``` Authoring a custom op: ```python lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT") torch.library.define( f"_TestOpaqueObject::queue_push", "(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()", tags=torch.Tag.pt2_compliant_tag, lib=lib, ) @torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib) def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None: # We can get the payload directly by get_payload(q) queue = get_payload(q) assert isinstance(queue, OpaqueQueue) queue.push(b) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660 Approved by: https://github.com/zou3519	2025-09-22 18:30:28 +00:00
Yuanyuan Chen	bec967eaa4	Remove C++ and test branches for CUDA<12 (#163443 ) Remove conditional branches for CUDA<12. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163443 Approved by: https://github.com/eqy	2025-09-22 18:20:08 +00:00
Eli Uriegas	d279a6a6f1	ci: Add a way to lint all files in a PR from label (#163525 ) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163525 Approved by: https://github.com/ZainRizvi	2025-09-22 18:06:39 +00:00
Ben Niu	281f8f407e	Combine strong and weak refcounts in intrusive_ptr in a single refcount (#163394 ) Summary: Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior. This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_. In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing. Reviewed By: yfeldblum Differential Revision: D82869192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394 Approved by: https://github.com/Skylion007	2025-09-22 17:53:28 +00:00
Nikita Shulga	5e7be98800	[BE] Update Python min version to 3.10 (#162310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-09-22 17:04:21 +00:00
Xu Han	06fe5b9025	[AOTI] fix TestAOTInductorPackage temp file locked handler. (#163499 ) Fix `test\inductor\test_aot_inductor_package.py` common class `TestAOTInductorPackage`'s `check_model` function, temp file locked file handler on Windows. It would caused c++ backend open file failed: ```cmd FAILED [4.5918s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_add - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp21sjnnhl.pt2 cannot be opened. FAILED [4.1703s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_bool_input - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp5kd3apub.pt2 cannot be opened. FAILED [4.2266s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_linear - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmpkyy3pxow.pt2 cannot be opened. FAILED [4.2134s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_metadata - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmphyer7wi9.pt2 cannot be opened. ...... ``` Fix it via `WritableTempFile`, it can release file handler for backend use. After fixed: <img width="1904" height="176" alt="image" src="https://github.com/user-attachments/assets/e71b3182-0204-497b-9aca-cbbb33bc4687" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163499 Approved by: https://github.com/jansel, https://github.com/desertfire	2025-09-22 16:54:18 +00:00
Laith Sakka	9ca183e933	switch from stack based to graph based aproach (#163459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163459 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #163417	2025-09-22 16:41:35 +00:00
Chris Thi	e310cc5e06	Update fbgemm submodule (#163411 ) Test Plan: As titled, includes some new changes fbgemm to see if CUDA13 breakage is fixed. Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163411 Approved by: https://github.com/Skylion007	2025-09-22 15:46:11 +00:00
Xinya Zhang	eaac218b64	[ROCm] Fix environment variable AOTRITON_INSTALLED_PREFIX (#163373 ) Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163373 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-22 15:01:18 +00:00
henrylhtsang	509c4e8627	Update cutlass version for fbcode (#163091 ) Differential Revision: [D82567751](https://our.internmc.facebook.com/intern/diff/D82567751/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163091 Approved by: https://github.com/drisspg	2025-09-22 14:31:11 +00:00
PyTorch MergeBot	10adeb9044	Revert "[BE] Update Python min version to 3.10 (#162310 )" This reverts commit 9f5a644f0768258bc81f8b38492754d297399f74. Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Broke lint, but to the best of my knowledge it's no longer possible to run lint for all files on PRs ([comment](https://github.com/pytorch/pytorch/pull/162310#issuecomment-3319289031))	2025-09-22 14:13:59 +00:00
Nikita Shulga	9f5a644f07	[BE] Update Python min version to 3.10 (#162310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-09-22 13:37:02 +00:00
Isalia20	60b4791d08	[MPS] Fix compile linalg inv (#163452 ) Fixes #161969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163452 Approved by: https://github.com/Skylion007	2025-09-22 10:36:52 +00:00
Yuanyuan Chen	96a3afb8ec	Simplify BFLOAT16_AVAILABLE (#163445 ) Simplify `BFLOAT16_AVAILABLE` by using `torch.cuda.is_bf16_supported()` and `torch.xpu.is_bf16_supported()`. Outdated comments are also removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163445 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-09-22 07:31:46 +00:00
PyTorch MergeBot	edafc902d7	Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659 )" This reverts commit d1993c27ae59842c887d549a3f8936fbcd769498. Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/wdvr due to reverted internally, please see D82771705 @PaliC ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3317110247))	2025-09-22 06:22:37 +00:00
PyTorch MergeBot	ae5be038a6	Revert "Delete functorch C extension entirely. (#163340 )" This reverts commit 1faf6367e396b1d0894e8735912a47ac465f469d. Reverted https://github.com/pytorch/pytorch/pull/163340 on behalf of https://github.com/wdvr due to temporary revert to pull out #162659 ([comment](https://github.com/pytorch/pytorch/pull/163340#issuecomment-3317105243))	2025-09-22 06:20:04 +00:00
PyTorch MergeBot	f0078941cf	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))	2025-09-22 05:39:07 +00:00
PyTorch MergeBot	3a7db34cf9	Revert "[SymmMem] Promote `@requires_nvshmem` instead of `enable_triton` (#163423 )" This reverts commit 5d8a226e23339e7243a2a84afd174f685f145b68. Reverted https://github.com/pytorch/pytorch/pull/163423 on behalf of https://github.com/wdvr due to temporary reverting to back out #162594 ([comment](https://github.com/pytorch/pytorch/pull/163423#issuecomment-3317011500))	2025-09-22 05:35:41 +00:00
Yuanyuan Chen	281bb56cc5	Enable half precision types on test_conv_cudnn_nhwc_support (#163444 ) This PR adds flaot16 and bfloat16 cases to `test_conv_cudnn_nhwc_support` and removes outdated comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163444 Approved by: https://github.com/Skylion007	2025-09-22 04:11:20 +00:00
Yuanyuan Chen	01f927eb40	Remove workarounds for Python 3.6 (#163440 ) This PR removes tuple unpacking workarounds for Py 3.6 form two distributed files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163440 Approved by: https://github.com/ezyang	2025-09-22 04:08:04 +00:00
angelayi	0b59492853	[export] Fix wrap_with_set_grad_enabled retracing (#163295 ) Fixes https://github.com/pytorch/pytorch/issues/163294 The code `with torch.set_grad_enabled(enable_grad)` calls `torch._C._set_grad_enabled` three times -- (1) when [initializing set_grad_enabled](`bb7c9a2d41/torch/autograd/grad_mode.py (L187C9-L187C35)`), (2) when [entering the context](`bb7c9a2d41/torch/autograd/grad_mode.py (L194)`), and (3) when [exiting the context](`bb7c9a2d41/torch/autograd/grad_mode.py (L197)`). This results in the the retraced export module to have a duplicate `torch._C._set_grad_enabled` like: ``` def forward(self, arg0_1): add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None _set_grad_enabled = torch._C._set_grad_enabled(False); _set_grad_enabled = None _set_grad_enabled = torch._C._set_grad_enabled(False); _set_grad_enabled = None add_1 = torch.ops.aten.add.Tensor(add, 2); add = None _set_grad_enabled_1 = torch._C._set_grad_enabled(True); _set_grad_enabled_1 = None add_2 = torch.ops.aten.add.Tensor(add_1, 3); add_1 = None return (add_2,) ``` When export runs the `replace_set_grad_with_hop_pass`, it will look through the graph for `torch._C._set_grad_enabled` and create subgraphs. The duplicate `torch._C._set_grad_enabled` results in an empty submod in the graph, which resulted in an error in [this post](https://fb.workplace.com/groups/1028545332188949/posts/1844720036398281/?comment_id=1862175381319413). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163295 Approved by: https://github.com/yushangdi	2025-09-21 22:54:40 +00:00
Yuanyuan Chen	8a281d7214	[submodule] Bump libfmt to 12.0.0 (#163441 ) libfmt 12.0 brings new optimisations and fixes some compilation issues for clang 21 (https://github.com/fmtlib/fmt/pull/4477). For a detailed release log, see https://github.com/fmtlib/fmt/releases/tag/12.0.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163441 Approved by: https://github.com/Skylion007	2025-09-21 22:37:25 +00:00
Jiannan Wang	6ac2b3ae35	[BE] Adding aliases for CUDA and XPU API documentation (#162984 ) This PR reorganizes CUDA and XPU API documentation with additional aliases pages. Multiple entries of APIs under torch.cuda are thus removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162984 Approved by: https://github.com/janeyx99	2025-09-21 22:28:27 +00:00
Yedidya Feldblum	8b14f43da9	[torch] DRY a couple of lines in unpickler (#163447 ) Test Plan: CI. Reviewed By: dolpm Differential Revision: D82660989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163447 Approved by: https://github.com/Skylion007	2025-09-21 20:29:33 +00:00
Laith Sakka	4d3d32f14c	Add torchfuzz initial impl. (#163417 ) all details are in readme.md Note: one thing i want to do soonest is to switch to graph representation instead of stack representation for the fuzzed ops should make things easier as things get more complicated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163417 Approved by: https://github.com/bobrenjc93	2025-09-21 19:17:54 +00:00
Scott Wolchok	5599f487ef	Fully native DTensor.__new__ (#162508 ) Move the entirety of `__new__` into C++, saving a layer of disable_dynamo and making progress toward all-C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162508 Approved by: https://github.com/ezyang ghstack dependencies: #161695	2025-09-21 18:36:05 +00:00
Yuanyuan Chen	51152efa67	Remove autograd code for Python < 3.9 (#163313 ) As PyTorch is moving to Python 3.10, it is safe to remove code for Python < 3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163313 Approved by: https://github.com/ezyang	2025-09-21 15:35:06 +00:00
Markus Hoehnerbach	f34744d2a5	[inductor] bugfix: keep WeakDeps (WAR deps) during fusion (#162316 ) fixes #159855, was not triggered in other tests since it took more than one round of fusion to get to the problematic code which prunes WeakDeps. The WeakDeps are important to inhibit fusion of kernels that read/write data into mutated buffers with different indexing. We modify the code to a) always prune before fusion, rather than after, which improves its coverage and makes our basic vertical fusion tests surface this issue as well and b) check whether the weak dep is fusable before eliminating it (which basically means checking that the producing code and the consuming code are sufficiently compatible). The tests that trigger this with change (a) is: test_fusing_write_into_disjoint_read introduced in #118210. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162316 Approved by: https://github.com/eellison, https://github.com/mlazos, https://github.com/shunting314	2025-09-21 13:08:11 +00:00
Ke Wen	5d8a226e23	[SymmMem] Promote `@requires_nvshmem` instead of `enable_triton` (#163423 ) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163423 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152, #163194	2025-09-21 10:03:20 +00:00
FFFrog	d8cbbc0f70	[Easy][AMP] Refactor the AMP logic for getting dtype (#162796 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162796 Approved by: https://github.com/ezyang	2025-09-21 06:32:35 +00:00
orangeH25	9ba918082a	Add api info for torch._C._nn.pyi (#162707 ) Fix part of #148404 APis involved are as followed: - multilabel_margin_loss - multi_margin_loss - nll_loss_nd - relu6 - relu6_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162707 Approved by: https://github.com/ezyang	2025-09-21 06:17:15 +00:00
Edward Yang	1faf6367e3	Delete functorch C extension entirely. (#163340 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340 Approved by: https://github.com/aorenste ghstack dependencies: #160236	2025-09-21 06:02:21 +00:00
windsonsea	4a96a6fa4a	[Docs] Fix indentations in cond.md (#156147 ) This is a follow-up PR to fix indentations mentioned by https://github.com/pytorch/pytorch/pull/155653#issuecomment-2971660356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156147 Approved by: https://github.com/svekars, https://github.com/cyyever	2025-09-21 05:50:50 +00:00
Yuanyuan Chen	f591bb5056	Remove data_source argument from Sampler (#163134 ) `data_source` is declared being removed in PT 2.2 but not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163134 Approved by: https://github.com/ezyang	2025-09-21 05:44:41 +00:00
Aaron Gokaslan	1ca9445229	[BE][Ez]: Prevent copies of std::vector in CUDA ForeachOps (#163416 ) No need for unnecessary copy of std::vectors. This Tensor list is copied throughout the foreach paths and this code is on a hot path for torch optimizers. Auto move elision will not happen on the return statement since it's a subelement of a vector that needs to be copied out before the std::vector is dtor'd. This should reduce quite a few list copies along this path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163416 Approved by: https://github.com/ezyang	2025-09-21 05:24:13 +00:00
PyTorch UpdateBot	5b386ee16e	[vllm hash update] update the pinned vllm hash (#163392 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163392 Approved by: https://github.com/pytorchbot	2025-09-21 04:34:14 +00:00
Edward Yang	97eb7a281d	torchdim Python port (#160236 ) The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy. Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs. Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs. There are two major feature gaps in the implementation: - DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it. However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls. - Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something? Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236 Approved by: https://github.com/zdevito, https://github.com/albanD	2025-09-21 03:01:04 +00:00
Edward Yang	2887f3fde4	[BE] Slight improvements to documentation in python_dispatch (#162963 ) I was briefly confused which way I should iterate stack, here's the comments I wanted. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963 Approved by: https://github.com/albanD, https://github.com/SherlockNoMad	2025-09-21 01:45:46 +00:00
eqy	e37b600007	[CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354 ) @ngimel is right, `ciflow/h100` doesn't actually appear to test the PR :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/163354 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-09-21 00:55:12 +00:00
Yedidya Feldblum	8e3fd3d4f9	[AI Codemod][DevmatePerfOptimizationVectorReallocation] fbcode/caffe2/torch/csrc/jit/serialization/unpickler.cpp (#163240 ) Reviewed By: marksantaniello, yfeldblum Differential Revision: D82140619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163240 Approved by: https://github.com/Skylion007	2025-09-20 23:26:24 +00:00
Avik Chaudhuri	9e3725e8e5	make fullgraph_capture work on mod, args, kwargs (#162849 ) Summary: Today `fullgraph_capture` takes a frame, but clients usually take a callable (`nn.Module`, function, or method) and example inputs (args and kwargs) and then explicitly set up the frame to pass. This is boilerplate—and potentially tricky to get right—that can be hidden inside the API. The original `fullgraph_capture` now becomes `_fullgraph_capture_frame`. Test Plan: existing tests Rollback Plan: Differential Revision: D82339400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162849 Approved by: https://github.com/zhxchen17	2025-09-20 22:48:06 +00:00
Sherlock Huang	3938175ec1	[1/n] Support cpu_tensor.to("cuda:0") in FakeTensorMode on cuda-less machine (#160431 ) Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") call. This diff supports this. Notice that .to("cuda") doesn't work yet, as it enquery current device idx by calling cuda API. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Test Plan: buck2 run fbcode//caffe2/test:fake_tensor -- --r test_fake_gpu_no_init Rollback Plan: Differential Revision: D80101283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160431 Approved by: https://github.com/henryoier, https://github.com/ezyang	2025-09-20 21:33:53 +00:00
Avik Chaudhuri	d70c0babf5	minimize graph capture output (#162211 ) Currently OutputGraphGuardsState is separated out as a serializable interface for OutputGraph, but some of the typing around it is incorrect in dynamo's guards.py and output_graph.py: more fields are used by code than claimed by OutputGraphGuardsState, and it works because either the full OutputGraph is passed in or the parts that use those fields are dead when OutputGraphGuardsState is passed in. In this PR we try to further separate the necessary fields of OutputGraph that should be retained by a full graph capture mechanism, not just limited to dynamo (as it is currently) but also something like make_fx (in the future). Since these fields do not need to be serialized, the result is an intermediate "common" data structure that is between OutputGraphGuardsState and OutputGraph in the inheritance hierarchy. Differential Revision: D81718791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162211 Approved by: https://github.com/zhxchen17	2025-09-20 15:52:28 +00:00
Pearu Peterson	f9074c7332	[STABLE ABI] Add copy_ operation. (#161895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161895 Approved by: https://github.com/janeyx99	2025-09-20 10:30:33 +00:00
Nicolas De Carli	eb11d172e3	[Caffe2] Improve SVE batch box cox by 2% (#163360 ) Summary: Improve bound checking on exp computation, decreasing the longest dependency chain by 1. Box-cox benchmarks show about 2% of improved throughput. Precision remains unaltered. before: NonZeroLambdaBatch 155.30us 6.44K after: NonZeroLambdaBatch 151.78us 6.59K Test Plan: Correctness: buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test Performance: buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark Differential Revision: D82847111 Privacy Context Container: L1208939 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163360 Approved by: https://github.com/Skylion007	2025-09-20 06:42:26 +00:00
Menglu Yu	5050cfa363	[Opitmus] fix fp8 activation quatization for duplicates forward output (#163364 ) Summary: We observe a case then the fwd graph has duplicated return nodes, which will lead to errors due to fx renaming the node, thus we add poi info into the node name. Test Plan: ### unit test ``` CUDA_VISIBLE_DEVICES=3 buck2 test mode/opt -m ovr_config//triton:beta -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 //caffe2/test/functorch:test_aotdispatch -- test_quantize_activation_duplicate_nodes ``` Buck UI: https://www.internalfb.com/buck2/de5eccc6-4064-4214-843d-70b8e3829afe Test UI: https://www.internalfb.com/intern/testinfra/testrun/4503599937670844 Network: Up: 217KiB Down: 72KiB (reSessionID-73e5c269-4f4d-4a54-896a-79c077eea326) Executing actions. Remaining 0/2 0.1s exec time total Command: test. Finished 1 local Time elapsed: 45.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E before f798417700 after Differential Revision: D82844100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163364 Approved by: https://github.com/Yuzhen11	2025-09-20 06:33:20 +00:00
Chien-Chin Huang	d55c9d52cd	[CP] Fix cuDNN CP LSE dimension bug (#163231 ) We should only unsqueeze if necessary. Fix https://github.com/pytorch/pytorch/issues/162743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163231 Approved by: https://github.com/eqy ghstack dependencies: #162539, #162540, #162541, #163115, #163131	2025-09-20 06:13:45 +00:00
Ruben Rodriguez Buchillon	0ee331b523	[inductor][choices] move extra kwargs out of get_template_configs (#163209 ) # why - extra kwargs are input/op dependent and not config dependent. We don't plan to serialize/deserialize them, and so they need to be fed in later beore making the KTC, rather than when getting the config values directly # what - move extra_kwargs into the KTC and get_ktc interface directly # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "_addmm" ``` Differential Revision: [D82871310](https://our.internmc.facebook.com/intern/diff/D82871310) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163209 Approved by: https://github.com/nmacchioni ghstack dependencies: #163305	2025-09-20 05:30:40 +00:00
Ruben Rodriguez Buchillon	df5d6d57c9	[inductor][triton heuristics] move allow tf32 out of config params (#163305 ) # why - this is not directly controlled by the config arg but rather by the input and by the inductor wide setting - it's always the same for every choice - we want the config kwargs to be programable and this is not programable in that sense but rather needs to use inductor config # what - move generating the ALLOW_TF32 kwarg in Triton templates into get_extra_kwargs # testing with some annotations, this is now the kwargs and extra_kwargs on addmm ``` {'EVEN_K': True, 'USE_FAST_ACCUM': False, 'ACC_TYPE': 'tl.float32', 'num_stages': 1, 'num_warps': 2, 'BLOCK_M': 32, 'BLOCK_N': 32, 'BLOCK_K': 16, 'hint_override': None, 'GROUP_M': 8} # choice/config kwargs {'ALLOW_TF32': True, 'epilogue_fn': <function addmm_epilogue.<locals>.epilogue at 0x7f64d54ff600>, 'epilogue_fn_hash': "['addmm_epilogue', torch.float32, 1, 1]", 'prefix_args': 1} # extra kwargs ``` they're both passed onto the template Differential Revision: [D82871312](https://our.internmc.facebook.com/intern/diff/D82871312) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163305 Approved by: https://github.com/nmacchioni	2025-09-20 05:30:40 +00:00
Parshant Sharma	0b5a99be88	remove duplicate import for defaultdict (#160519 ) Fixes #160518 This PR aims to remove the duplicate import of defaultdict in the following file: `ecde76c764/functorch/op_analysis/gen_data.py (L36)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160519 Approved by: https://github.com/malfet	2025-09-20 04:06:39 +00:00
dsashidh	a87aea03f7	Update RandomSampler docstring. data_source must be Sized not Dataset (#158857 ) Fixes #158631 The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857 Approved by: https://github.com/divyanshk	2025-09-20 04:05:25 +00:00
Blaine Burton Rister	e56dd5d770	[Inductor-FX] Support torch.cond (#163234 ) # Feature Support `torch.cond` in the FX converter. The generated FX IR is conceptually indentical to what would come from `torch.export`: - Submodules as stored as attributes, and accessed via `getattr`. - The conditional is represented as `torch.ops.higher_order.cond`, which takes in the subgraphs, a predicate and submodule inputs. # Implementation overview The FX backend generates code for subgraphs using the following steps: 1. When `codegen_conditional` is called in `WrapperFxCodegen`, we emit a `ConditionalLine`. a. We also codegen the true/false subgraphs at this time, storing their subgms for later. 2. At the beginning of FX conversion, generate `get_attr` nodes accessing each subgraph. It's important to do this at the start, before registering the node metadata hook. This also matches the convention followed by torch.export. 3. When we see the `ConditionalLine` in the FX converter, we generate a corresponding `torch.ops.higher_order.cond`. # Implementation details This ended up being a substantial change, as wrapper codegen has some special logic for subgraphs. Certain methods of `PythonWrapperCodegen` are overridden by `SubgraphPythonWrapperCodegen`. To apply these overrides, we use multiple inheritance with the registered subclass of `WrapperFxCodegen`. Unlike most other wrapper codegen methods, which map 1:1 to Wrapper IR lines, subgraph codegen generates a number of wrapper lines including `EnterSubgraphLine` and `ExitSubgraphLine`, along with Python or C++ code calling the subgraph as a function. These lines are used for some backends' memory planning. In contrast, FX IR typically represents a subgraph call as a single HOP node, or a `call_module` op. To account for this difference, this PR introduces a new wrapper IR line called `ConditionalLine`, which is only used by the FX backend. We override the `codegen_conditional` method to emit this line. This sidesteps having to port the existing subgraph codegen and associated memory planning to Wrapper IR. (In principle, it seems possible to adapt the existing backends to `ConditionalLine`, but it could be a larger refactor, since we'd also have to update the memory planning.) Some of the lower-level subgraph codegen methods are still shared between the FX and Python backends, such as `generate_subgraph_common`. Those were easier to port to Wrapper IR. This also required generalizing the way the FX converter handles graph inputs and outputs. Previously, it assumed the IO signature was the same as `V.graph.module`, but this is only true for the parent graph, and not subgraphs. Instead, we need to call `get_graph_inputs` and `get_graph_outputs` to populate the inputs and outputs for subgraphs. # Test plan This PR adds a couple of tests using torch.cond. Here's an example graph generated by one of them: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0] %false_graph_0 : [num_users=1] = get_attr[target=false_graph_0] %cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (%arg0_1, %true_graph_0, %false_graph_0, (%arg1_1,)), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%cond, 0), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 6, constant_args_idx: 6, grid: [(1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 6, XBLOCK: 8}}) return buf1 ``` It also removes an existing negative test which checked that a certain error was raised when subgraphs were encountered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163234 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-20 03:52:31 +00:00
Huy Do	a31acf32bd	Clean up obsoleted vLLM tests (#163383 ) They have been removed in https://github.com/vllm-project/vllm/pull/25117 and https://github.com/vllm-project/vllm/pull/22772, thus failing in trunk at the moment after the latest pin commit update Pull Request resolved: https://github.com/pytorch/pytorch/pull/163383 Approved by: https://github.com/wdvr, https://github.com/seemethere, https://github.com/malfet	2025-09-20 02:48:36 +00:00
Sherlock Huang	a1df0b42ce	Lazy import to avoid circular import issue for DebugMode (#163381 ) as title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163381 Approved by: https://github.com/dolpm	2025-09-20 01:54:57 +00:00
James Wu	bfe9e60ffb	Simplify PrecompileContext to no longer be a CacheArtifactManager (#162886 ) Summary: This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache. This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries. The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries(). Saving single entries and such with DynamoCache still works normally. Test Plan: All existing unit tests pass. Rollback Plan: Differential Revision: D82380307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886 Approved by: https://github.com/zhxchen17	2025-09-20 01:24:37 +00:00
Jason Ansel	8225a26835	[dynamo] Fix issue with namedtuple slicing (#163351 ) Fixes #163253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163351 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2025-09-20 00:42:02 +00:00
Chien-Chin Huang	093f0642aa	[CP][BE] Correct an incorrect docstring (#163131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163131 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #162539, #162540, #162541, #163115	2025-09-19 23:55:03 +00:00
rzou	ee7bdd8f2f	[graph partition] Add way to register custom rule (#163310 ) This PR adds an experimental way to register a custom rule for if inductor should partition the graph around an operator. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/163310 Approved by: https://github.com/ProExpertProg, https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #162117, #162307, #162651	2025-09-19 23:28:03 +00:00
Nikita Shulga	0098e5636d	[CI] Move Windows build/tests to Python-3.10 (#162862 ) What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Copied test type dependencies from `be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16)` into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update `4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1` that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862 Approved by: https://github.com/wdvr, https://github.com/huydhn ghstack dependencies: #163339, #163341	2025-09-19 22:51:38 +00:00
Aart J.C. Bik	9b5ec0ff7c	Use computed buffer sizes of torch for cusparseLt metadata (#163125 ) Making sure buffer allocation matches what is computed by cusparseLt compression Pull Request resolved: https://github.com/pytorch/pytorch/pull/163125 Approved by: https://github.com/jcaip	2025-09-19 22:12:40 +00:00
Svetlana Karslioglu	e6a9db58d7	Add analytics ID to cpp docs (#163370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163370 Approved by: https://github.com/albanD	2025-09-19 21:45:19 +00:00
Mikayla Gawarecki	fab8455943	Don't use declarations in global namespace in stable headers (#163352 ) Fixes https://github.com/pytorch/pytorch/issues/163338 Configured https://clang.llvm.org/extra/clang-tidy/checks/google/global-names-in-headers.html for torch/csrc/stable Note that doesn't error for the DeleterFnPtr case, but will generate the following for the `using torch::stable::Tensor;` ``` >>> Lint for torch/csrc/stable/ops.h: Error (CLANGTIDY) [google-global-names-in-headers,-warnings-as-errors] using declarations in the global namespace in headers are prohibited 10 \|#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h> 11 \|#include <torch/headeronly/core/ScalarType.h> 12 \| >>> 13 \|using torch::stable::Tensor; 14 \| 15 \|namespace torch::stable { 16 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163352 Approved by: https://github.com/janeyx99	2025-09-19 21:15:52 +00:00
xinan.lin	9f8a311af0	[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315 ) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-09-19 21:06:56 +00:00
Samuel Park	df9a4824e6	Bugfix for doing negative padding (#161639 ) Fixes #161014 This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor. In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero. But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well. This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you! @malfet @manuelcandales @janeyx99 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639 Approved by: https://github.com/manuelcandales	2025-09-19 20:57:05 +00:00
Shunting Zhang	248156ed06	[Inductor] do loop reordering in a separate final round (#162355 ) Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: `a1f7639922/test/inductor/test_loop_ordering.py (L612-L641)` Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162101, #162126	2025-09-19 20:21:33 +00:00
Shunting Zhang	e88460f453	[Inductor] don't call sympy_str when not needed (#162126 ) I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162101	2025-09-19 20:21:33 +00:00
Shunting Zhang	466122b92c	[inductor] avoid creating LoopBody twice (#162101 ) Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y*2 + 1 >>> (x + y + x x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops Pull Request resolved: https://github.com/pytorch/pytorch/pull/162101 Approved by: https://github.com/jansel, https://github.com/eellison	2025-09-19 20:21:33 +00:00
ankushwahaRH	ba3c2c80ab	SDP Backend function fix (#161169 ) The issue cannot be reproduced using the original repro code provided in the issue description. However, the underlying issue mentioned by the maintainer (missing functions in `builder.py` and `trace_rules.py`) was never addressed and can still be reproduced with this test case: ```python import torch from torch.nn.attention import _cur_sdpa_kernel_backends @torch.compile(fullgraph=True) def test_function_that_triggers_error(): return _cur_sdpa_kernel_backends() print("Calling torch.compile function...") try: result = test_function_that_triggers_error() print(f"Success: {result}") except Exception as e: print(f"ERROR: {e}") print(f"Error type: {type(e)}") ``` The original repro likely no longer triggers the issue due to code path changes in the SDPA implementation, while the direct call to `_cur_sdpa_kernel_backends()` exposes the underlying problem where certain torch._C functions returning non-Tensor values aren't properly handled by dynamo tracing. I have implemented the changes by adding the missing functions to both `builder.py` and `trace_rules.py` to properly handle these cases during compilation. @guilhermeleobas Pull Request resolved: https://github.com/pytorch/pytorch/pull/161169 Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi	2025-09-19 20:19:59 +00:00
Ke Wen	7130b174e0	[SymmMem] Fix memory allocation hold-up (#162680 ) Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298	2025-09-19 20:19:47 +00:00
Ke Wen	f8fb437197	[SymmMem] Barrier on team instead of world (#163298 ) As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298 Approved by: https://github.com/fegin	2025-09-19 20:19:47 +00:00
PyTorch MergeBot	2a308c7dee	Revert "Improve device info with new flops and bandwidth formula based on hardware libraries (#162245 )" This reverts commit 35d7b321597ed00245aad533a8fa6b7fdadd73ea. Reverted https://github.com/pytorch/pytorch/pull/162245 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162245#issuecomment-3313669412))	2025-09-19 20:09:12 +00:00
thenumberouscode	4a160dae3c	[CUDA] revert PR 130472 (#162950 ) This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed	2025-09-19 19:50:44 +00:00
Nikita Shulga	a273475b01	[BE] Introduce `CONDA_ROOT_DIR` (#163341 ) Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341 Approved by: https://github.com/clee2000 ghstack dependencies: #163339	2025-09-19 19:45:32 +00:00
Lucas Kabela	979e10f7d6	[Bugfix] Match eager stride semantics for cloned tensors with preserve_format in compile (#163017 ) Fixes #161010 by making `clone_meta` match the semantics of strides for eager mode. This is: * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017 Approved by: https://github.com/williamwen42, https://github.com/xmfan Co-authored-by: morrison-turnansky <mturnans@redhat.com>	2025-09-19 19:41:33 +00:00
Guilherme Leobas	bc7b17a36d	Realize LazyVariableTracker before raising exception (#163350 ) Improves error message reported on #163321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan	2025-09-19 19:25:17 +00:00
dsashidh	03f34fd307	Add explicit typing to nn.Module.__init__() parameters (#157389 ) Fixes #156740 Adds explicit `Any` typing to `args` and `*kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157389 Approved by: https://github.com/Skylion007, https://github.com/Raman-RH	2025-09-19 19:02:28 +00:00
Nikita Shulga	52dd7a898c	Move ROCM trunk wheel builds to 3.10 (#163339 ) This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see https://github.com/pytorch/pytorch/pull/162297 ) sometimes in shell script (see https://github.com/pytorch/pytorch/pull/162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163339 Approved by: https://github.com/clee2000	2025-09-19 18:52:00 +00:00
Nikita Shulga	b8c5ec582f	[CD] Simplify NVIDIA driver installation step (#163349 ) Undo changes introduced in https://github.com/pytorch/pytorch/pull/160956 as driver has been updated to 580 for both fleets Fixes https://github.com/pytorch/pytorch/issues/163342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163349 Approved by: https://github.com/seemethere	2025-09-19 18:50:47 +00:00
arkadip-maitra	a0d2d84846	Handling overflow for long int overflow for the product of kernel_hei… (#155989 ) …ght and kernel_width that overflows to be exactly 0 Fixes [#155981](https://github.com/pytorch/pytorch/issues/155981) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155989 Approved by: https://github.com/malfet	2025-09-19 18:15:01 +00:00
PyTorch MergeBot	607469bdad	Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (#162590 )" This reverts commit c9b80c4d4b48deb1931e5f8641ab345d7cc7b639. Reverted https://github.com/pytorch/pytorch/pull/162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](https://github.com/pytorch/pytorch/pull/162590#issuecomment-3313263772))	2025-09-19 18:13:00 +00:00
PyTorch MergeBot	a3b68c7c57	Revert "Fix boxcox to return same result for same input in one batch (#162772 )" This reverts commit 49d30f9a234f0816a1ece278c8450d119e417714. Reverted https://github.com/pytorch/pytorch/pull/162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162772#issuecomment-3313213011))	2025-09-19 17:58:29 +00:00
Catherine Lee	2984bfe3da	[ez][CI] Run vllm workflow on vllm pin updates (#163353 ) As in title The auto pin update was merged without running vllm workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/163353 Approved by: https://github.com/malfet, https://github.com/wdvr	2025-09-19 17:32:49 +00:00
Janani Sriram	3e663ce5da	[Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + TMA template for GEMMs (#163147 ) Summary: X-link: https://github.com/meta-pytorch/tritonbench/pull/432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Benchmarking command: ``` TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/fp8_shapes_testing.log ``` Rollback Plan: Differential Revision: D82597111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163147 Approved by: https://github.com/njriasan	2025-09-19 17:23:37 +00:00
Boyuan Feng	4967ad8baa	[Graph Partition] improve custom op output alias (#163227 ) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. `72fedf0575/torch/_inductor/ir.py (L7976-L7982)` According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227 Approved by: https://github.com/zou3519	2025-09-19 17:01:36 +00:00
drisspg	e631d76002	[Flex] Changing how bwd configs are setup and updating default b200 config (#163318 ) ```Shell Up to 4x perf boost 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline): shape: (5, 7) ┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 267.502004 ┆ 250.728732 ┆ 0.937297 ┆ -6.270335 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128) ┆ 248.510516 ┆ 235.210874 ┆ 0.946483 ┆ -5.351742 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295 ┆ 271.806926 ┆ 0.960936 ┆ -3.906354 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64) ┆ 282.212695 ┆ 280.519092 ┆ 0.993999 ┆ -0.600116 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073 ┆ 294.477894 ┆ 0.995315 ┆ -0.468519 │ └───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘ 📊 Performance Summary: ============================================================ Baseline: base Change: better_configs Geometric Mean Speedup (change over baseline): 1.9954x Geometric Mean % Change: +99.54% Median Speedup (change over baseline): 2.1590x Speedup Std Dev: 0.9800 Valid Comparisons: 60/60 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163318 Approved by: https://github.com/BoyuanFeng	2025-09-19 16:57:21 +00:00
Eddie Yan	f8f230a801	[FP8][cuBLAS][H100] only test fp32 outputs for rowwise `_scaled_mm` on H100 (#162022 ) only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0 Intended to land after #161305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162022 Approved by: https://github.com/ngimel	2025-09-19 15:18:13 +00:00
Jagadish Krishnamoorthy	264e7f68a0	[ROCm] Fix mx fp8 and fp4 code after scaling refactor changes. (#163127 ) PR #151360 added mx fp8 and fp4 support on ROCm. 1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures. This patch corrects is_blockwise_1x32_scaling function code. 2. Fixes the m, n, k dimensions for ROCm mx case. 3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2. This resulted in higher SQNR value for mx fp4 test. Testing result on gfx950 w/ ROCm7.0 PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s OK passed 111 This is same as before. (when PR 151360 was merged) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-19 12:29:52 +00:00
Prachi Gupta	bee362c381	[ROCm][SymmMem] Fix skip condition for PLATFORM_SUPPORTS_SYMM_MEM (#163205 ) It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163205 Approved by: https://github.com/ezyang Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-19 12:12:47 +00:00
dependabot[bot]	33e6c5a93d	[Dependabot] Update(deps): Bump transformers from 4.54.0 to 4.56.0 in /.ci/docker/ci_commit_pins (#162063 ) * [Dependabot] Update(deps): Bump transformers Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.54.0...v4.56.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Refresh results Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of updates Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Hopefully the last round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Plz Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-19 02:50:36 -07:00
Xiao, Wang	ab5086a7ae	[WOQ] Add XPU kernel for _weight_int8pack_mm (#160938 ) Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as https://github.com/pytorch/pytorch/pull/159325. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168	2025-09-19 07:37:14 +00:00
Chien-Chin Huang	0815091d86	[CP][BE] Cosmetic refactors for CP code base (#163115 ) Summary: This PR is extracted from https://github.com/pytorch/pytorch/pull/162542, to make the original PR easier to review. This PR only contains cosmetic changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163115 Approved by: https://github.com/tianyu-l ghstack dependencies: #162539, #162540, #162541	2025-09-19 07:21:46 +00:00
PyTorch MergeBot	32ad29b72a	Revert "[dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695 )" This reverts commit a8432bcaadd6dea52a94429dced1fb4550f2f560. Reverted https://github.com/pytorch/pytorch/pull/162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](https://github.com/pytorch/pytorch/pull/162695#issuecomment-3310757225))	2025-09-19 06:18:27 +00:00
PyTorch MergeBot	1302637a23	Revert "[dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525 )" This reverts commit 5f630d28d7ff9fdd8bd6cdbe2438e5c821007845. Reverted https://github.com/pytorch/pytorch/pull/162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](https://github.com/pytorch/pytorch/pull/162525#issuecomment-3310748980))	2025-09-19 06:15:28 +00:00
Xiaotian Hu	e0bcd58f57	[MTIA] Add MTIA dispatch for kernel foreach_maximum(Add D80022242 back) (#161571 ) Summary: dispatch MTIA to function foreach_tensor_maximum_scalar_kernel_mtia_ Test Plan: CI Rollback Plan: Differential Revision: D81086607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161571 Approved by: https://github.com/malfet	2025-09-19 05:57:09 +00:00
PyTorch MergeBot	17081209e5	Revert "[CI] Move Windows build/tests to Python-3.10 (#162862 )" This reverts commit 2dcd153342d27b0981ff79eb2ccb8d8962e79c48. Reverted https://github.com/pytorch/pytorch/pull/162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](https://github.com/pytorch/pytorch/pull/162862#issuecomment-3310606135))	2025-09-19 05:16:49 +00:00
PyTorch MergeBot	578047838c	Revert "[BE] Update Python min version to 3.10 (#162310 )" This reverts commit 3016616ccbba3dc9bb6a80eb4a81a846ddf49cc9. Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](https://github.com/pytorch/pytorch/pull/162862#issuecomment-3310606135))	2025-09-19 05:16:49 +00:00
can-gaa-hou	ce5637be29	Fix invalid indices bug for max_unpool2d/3d on MPS (#163036 ) Fixes #163035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163036 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-19 05:13:21 +00:00
Edward Z. Yang	c91f59b1a0	Fix performance regression when indexing by Numpy arrays (#163280 ) Benchmark script: ``` import time import numpy as np import torch def main() -> None: for i in range(10): block_indices = np.arange(16384, dtype=np.int32) block_indices = block_indices.reshape(-1).clip(max=255) batch_indices = np.zeros(16384, dtype=np.int64) virtual_batches = 32 block_table = torch.randn(32, 256) start = time.perf_counter() block_table[batch_indices, block_indices].view(virtual_batches, -1) end = time.perf_counter() time_elapsed_ms = (end - start) * 1000 print(f"Function execution time: {time_elapsed_ms:.1f}ms") if __name__ == "__main__": main() ``` Before: ``` (a) [ezyang@devvm006.dkl0 ~/local/b/pytorch] python ben.py Function execution time: 28.5ms Function execution time: 12.9ms Function execution time: 12.6ms Function execution time: 13.5ms Function execution time: 12.0ms Function execution time: 13.4ms Function execution time: 12.9ms Function execution time: 12.9ms Function execution time: 13.1ms Function execution time: 13.0ms ``` After: ``` Function execution time: 17.8ms Function execution time: 2.5ms Function execution time: 1.3ms Function execution time: 2.5ms Function execution time: 2.3ms Function execution time: 1.3ms Function execution time: 2.4ms Function execution time: 2.5ms Function execution time: 2.5ms Function execution time: 2.4ms ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163280 Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever	2025-09-19 05:02:58 +00:00
Nikita Shulga	3016616ccb	[BE] Update Python min version to 3.10 (#162310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162862	2025-09-19 04:28:56 +00:00
PyTorch UpdateBot	46c647d1ee	[vllm hash update] update the pinned vllm hash (#163304 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163304 Approved by: https://github.com/pytorchbot	2025-09-19 04:25:43 +00:00
Scott Wolchok	76a841fd47	Port OpSchema.__post_init__ and OpSchema._recompute_comparison_key to C++ (#161695 ) I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking. Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161695 Approved by: https://github.com/ezyang	2025-09-19 04:07:30 +00:00
Animesh Jain	bd964cbbfb	[functionalize] Avoid one more call to custom get_device on FunctionalTensorWrapper (#163019 ) Trying to reduce the number of `__torch_dispatch__` calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163019 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #162987	2025-09-19 02:52:08 +00:00
Jae Ku	5f25dbe7fd	Rm pytorch deps platform args (#163086 ) Summary: Platform args was a buck1 concept that we decided to port over to buck2 in order to make the migration easier. However, platforms args existing in the repo blocks some buck modernization like modefile free efforts, so we're trying to get rid of the usage. Test Plan: CI Rollback Plan: Differential Revision: D82470032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163086 Approved by: https://github.com/malfet, https://github.com/8Keep	2025-09-19 02:13:03 +00:00
Han Chao	e134bb340a	Update torch-xpu-ops commit pin (#163244 ) Update the torch-xpu-ops commit to `24fab67b6e`, includes: - Clean up getDeviceIndexOfCurrentQueue - Fix hardswish gradients corner case - Fix xccl contiguous check - Move checks from nonzero kernel to operator - support high priority stream for xccl Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244 Approved by: https://github.com/EikanWang	2025-09-19 02:04:40 +00:00
Xuan Zhang	6e680ae8de	add more restriction to fusion with large accumulate reads (#163163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163163 Approved by: https://github.com/yf225	2025-09-19 01:20:30 +00:00
James Wu	3c9e220f34	Refactor PrecompileContext to be considerably more debuggable (#162740 ) Summary: This diff does a few things: - It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it - It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized. Here's an example of what that JSON looks like: ``` { "artifacts": { "precompile_aot_autograd": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "precompile_dynamo": [ { "backend_ids": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "fn_name": "TorchBenchmarkRunner.forward_and_backward_pass", "num_codes": "10", "python_version": "3.12.11+meta", "torch_version": "2.10.0a0+fb" } ] }, "num_entries": 1 } ``` Test Plan: Existing tests pass. NanoGPT tlparse showing the new debug: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR. Rollback Plan: Differential Revision: D82232574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162740 Approved by: https://github.com/zhxchen17	2025-09-19 01:14:28 +00:00
Prachi Gupta	c9b80c4d4b	[ROCm] Bump FBGEMM commit to avoid CK errors (#162590 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162590 Approved by: https://github.com/jeffdaily	2025-09-19 01:14:20 +00:00
Chien-Chin Huang	cd4303afc6	[CP][BE] Correct the names of some tests (#162541 ) We are not doing ring attention but only using allgather to do CP for Flex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162541 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #162539, #162540	2025-09-19 00:38:04 +00:00
Nikita Shulga	2dcd153342	[CI] Move Windows build/tests to Python-3.10 (#162862 ) What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase - Copied test type dependencies from `be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16)` into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update `4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1` that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862 Approved by: https://github.com/wdvr, https://github.com/huydhn	2025-09-19 00:33:03 +00:00
Sherlock Huang	04842ac2b0	Change DebugMode record_torchfunction default to False (#163293 ) Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163293 Approved by: https://github.com/zpcore	2025-09-19 00:30:53 +00:00
Jacob Szwejbka	17c16537e2	Deprecate Lite Interpreter (#163289 ) Summary: Point people lowering to lite interpreter to the existence of ExecuTorch. Added the typing deprecation, a warnings deprecation Test Plan: Try using it, see deprecation warning Reviewed By: lucylq Differential Revision: D82759566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163289 Approved by: https://github.com/larryliu0820	2025-09-18 23:56:21 +00:00
Animesh Jain	ddc56f6f92	[functional] Use the saved device on storage instead for device_custom (#162987 ) Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162987 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519	2025-09-18 23:43:20 +00:00
Chien-Chin Huang	096d35c44c	[CP] Remove the need of recording cp_dim in the global var (#162540 ) This information can be obtained during the dispatching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162540 Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #162539	2025-09-18 23:40:48 +00:00
Pian Pawakapan	4c007073e6	[dynamic shapes] DynamicInts prototype (#162194 ) Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime. Current behavior: - Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3. - Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions. - Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed? - Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed. - We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed. - DynamicInts as nn.Module attributes are handled. - We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?) ```python x = DynamicInt(4) f(x) f(1) f(DynamicInt(3)) # same as f(3) ``` Follow-up work: - Specifying shape constraints, either at the int-level, e.g. ```python DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"] ``` or at the compilation level, e.g. something like ```python s0 = DynamicInt(64, name="s0") s1 = DynamicInt(128, name="s1") with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]): f(s0, s1) ``` This should subsume the need for specifying derived SymInts? - SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor. - Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`. Differential Revision: D81698719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162194 Approved by: https://github.com/bobrenjc93	2025-09-18 23:26:28 +00:00
Mergen Nachin	f4eca0e3b3	Try updating ET pin in PT/PT (#159664 ) Looking into resolving this: https://github.com/pytorch/pytorch/issues/159599 Test Plan: Wait for executorch CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/159664 Approved by: https://github.com/malfet	2025-09-18 21:55:16 +00:00
bobrenjc93	ed3438ff13	Turn on capture_dynamic_output_shape_ops when fullgraph=True (#163123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163123 Approved by: https://github.com/laithsakka ghstack dependencies: #163121	2025-09-18 21:24:15 +00:00
bobrenjc93	7dcb568c8f	Turn on capture_scalar_outputs when fullgraph=True (#163121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163121 Approved by: https://github.com/laithsakka	2025-09-18 21:24:15 +00:00
Sherlock Huang	bb7c9a2d41	[DTensor] Fix DTensor.mean with uneven sharding (#163241 ) Fixes #162692 When input is uneven sharded, redistribute input as Replicated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163241 Approved by: https://github.com/dcci	2025-09-18 19:53:51 +00:00
Edward Yang	159c2140f7	test: ensure editable cached wrapper is respected (#160943 ) ## Summary - add a test verifying that editing the local cache wrapper is picked up after Dynamo reset ## Testing - `lintrunner -a` (fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure) - `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v` ------ https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943 Approved by: https://github.com/xmfan, https://github.com/albanD	2025-09-18 19:24:51 +00:00
Jeff Daily	62a746f62c	[ROCm] update ci_expected_accuracy for dynamo benchmarks (#163256 ) Some tests that were already failing changed status to skipped. Some model entries were missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163256 Approved by: https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-18 19:05:19 +00:00
Edward Yang	8627454c84	Add local file path to inductor_output_code trace metadata (#160920 ) ## Summary - include local file path in `inductor_output_code` structured trace metadata - adjust structured trace tests for new `file_path` field ## Testing - `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization` - `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` (fails: MYPY failure) ------ https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c Pull Request resolved: https://github.com/pytorch/pytorch/pull/160920 Approved by: https://github.com/oulgen	2025-09-18 18:39:46 +00:00
thenumberouscode	93964ed6ab	[unit test] correct wrong input shape in test_flop_fx (#163148 ) The input tensor shape does not match the weight tensor shape, which was detected by the validation logic implemented in my other PR(https://github.com/pytorch/pytorch/pull/160408). The input tensor should have a shape of (2, 2, 3), since dimension 1 of the input (representing input channels) must match dimension 0 of the weight tensor (representing input channels). ref https://docs.pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/163148 Approved by: https://github.com/eellison	2025-09-18 18:38:01 +00:00
Ke Wen	80f8be9840	[SymmMem] Fix put_signal + wait_until hang (#163194 ) The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152	2025-09-18 18:18:58 +00:00
Edward Z. Yang	e36a6fcf0f	Massive hack to make autograd shut up about threaded PG mutations (#163238 ) See the Note for explanation. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163238 Approved by: https://github.com/albanD	2025-09-18 18:12:57 +00:00
Benji Beck	23af32a078	[WOQ] Integrate CUDA support for concat linear int8pack_mm woq optimization pattern (#161848 ) Summary: What: Enables CUDA support for concat linear int8_mm woq optimization pattern by: - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Differential Revision: D80884518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161848 Approved by: https://github.com/jerryzh168	2025-09-18 18:08:07 +00:00
henrylhtsang	a81a2e54ed	[submodule] CUTLASS upgrade to 4.2.0 and change cutlass to cutlass_cppgen (#163092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163092 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-09-18 18:03:51 +00:00
PyTorch MergeBot	4b7aed89d8	Revert "[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 )" This reverts commit 627482a7b7780752c0e7aea034a2eb2db5899fcc. Reverted https://github.com/pytorch/pytorch/pull/162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](https://github.com/pytorch/pytorch/pull/162942#issuecomment-3308784448))	2025-09-18 17:49:16 +00:00
Robert Hardwick	1aeac304b8	Move prioritized text linker optimization code from setup.py to cmake (#160078 ) Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078 Approved by: https://github.com/seemethere	2025-09-18 17:09:48 +00:00
Tugsbayasgalan Manlaibaatar	56893ca1f6	Don't register wrong overload to prim decomp (#163138 ) These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where. Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163138 Approved by: https://github.com/henryoier, https://github.com/ezyang	2025-09-18 17:01:19 +00:00
Catherine Lee	af8c232b75	[CI] reuse old whl: fix metadata file not getting version replaced (#163214 ) In the .dist-info/METADATA file, the version was not being written with the new sha. On python <3.11 (I think), the glob `*` will only match directories, so change this to ``, which I checked that it will match both files and directories on py3.9 and py3.13 There's probably also a bunch of mismatches in RECORD but thats a problem for later Pull Request resolved: https://github.com/pytorch/pytorch/pull/163214 Approved by: https://github.com/huydhn	2025-09-18 16:08:29 +00:00
Catherine Lee	4908fb53c3	[testing] Add test owner labels for some ao sparse tests (#163203 ) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: https://github.com/pytorch/pytorch/pull/163203 Approved by: https://github.com/jcaip	2025-09-18 16:08:13 +00:00
Colin Peppler	3c8b90542c	support unbacked softmax / logsoftmax (#162216 ) ### DDE ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3u0, 0) (unhinted: Eq(3u0, 0)). (Size-like symbols: u0) Caused by: (_decomp/decompositions.py:1185 in _softmax) ``` ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)). (Size-like symbols: u0) Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0) # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax) ``` ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)). (Size-like symbols: u0) Caused by: (_refs/__init__.py:2218 in _reduction) ``` ### Cannot convert symbols to int ``` File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "orch/_inductor/sizevars.py", line 591, in size_hint return int(out) ^^^^^^^^ File "sympy/core/expr.py", line 342, in __int__ raise TypeError("Cannot convert symbols to int") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162216 Approved by: https://github.com/laithsakka, https://github.com/eellison	2025-09-18 15:43:20 +00:00
morrison-turnansky	1f21f8544c	fixing graph break for namedtuple._replace (#160139 ) Fixes #158772 _replace works without graph break Pull Request resolved: https://github.com/pytorch/pytorch/pull/160139 Approved by: https://github.com/mlazos	2025-09-18 14:32:36 +00:00
Saman Khatir	1330c638be	Update Microsoft C++ Redistributable to the latest version (#161430 ) Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430 Approved by: https://github.com/malfet	2025-09-18 14:22:03 +00:00
Jagadish Krishnamoorthy	8bc4a467a7	[ROCm] test_aot_inductor: Enable fp8 tests. (#163050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163050 Approved by: https://github.com/jeffdaily	2025-09-18 14:05:21 +00:00
Xinya Zhang	e769026bcb	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094 Fixes #157093 Fixes #157092 Fixes #157091 Fixes #157064 Fixes #157063 Fixes #157062 Fixes #157061 Fixes #157042 Fixes #157041 Fixes #157039 Fixes #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-18 13:53:48 +00:00
Natalia Gimelshein	14f8d86136	Reland #161649 , vectorize stored in cat for all dtypes (#162440 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/162440 Approved by: https://github.com/Skylion007	2025-09-18 13:50:44 +00:00
PaliC	c43ccfbc2d	[BE] Remove bottleneck (#163210 ) Some cleanup related to this RFC: https://github.com/pytorch/pytorch/issues/68742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163210 Approved by: https://github.com/ezyang	2025-09-18 12:08:13 +00:00
Wei Feng	cfb8aec1a4	[FSDP2] idempotent reset_sharded_param: no-op if _local_tensor is already padded (#163130 ) resolves https://github.com/pytorch/torchtitan/issues/1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163130 Approved by: https://github.com/tianyu-l	2025-09-18 09:20:37 +00:00
rzou	98ce93db0b	[DTensor] Add guide for what to do about mixed torch.Tensor and DTensor operations (#162651 ) Also updates the error message to point to the guide. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162651 Approved by: https://github.com/ezyang ghstack dependencies: #162117, #162307	2025-09-18 06:41:02 +00:00
vandrei	627482a7b7	[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 ) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942 Approved by: https://github.com/ngimel	2025-09-18 06:40:07 +00:00
Markus Hoehnerbach	c5e7bb08b0	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-18 06:35:28 +00:00
Dmitry Nikolaev	6f9b4ccf8f	Fix SEMI_STRUCTURED_SUPPORTED_BACKENDS selection on CUDA and ROCm (#163223 ) It should work with the current CUDA/ROCm device_capability enumeration anyway. But it will help to avoid unexpected triggering in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/163223 Approved by: https://github.com/jeffdaily	2025-09-18 06:29:29 +00:00
Chien-Chin Huang	708dc6e3cd	[CP][BE] Remove _AttentionContextParallel (#162539 ) This is not an API we want to support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162539 Approved by: https://github.com/ezyang, https://github.com/tianyu-l	2025-09-18 06:20:18 +00:00
Anshul Sinha	7803d2c244	[FSDP][Replicate] tests replicate synchronization after optimizer states (#162785 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event Pull Request resolved: https://github.com/pytorch/pytorch/pull/162785 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658	2025-09-18 04:47:09 +00:00
Sherlock Huang	033b7d1e1a	[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available (#163187 ) Reland of #160532 Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016 Approved by: https://github.com/huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187 Approved by: https://github.com/angelayi	2025-09-18 04:46:26 +00:00
PyTorch UpdateBot	d734b26141	[vllm hash update] update the pinned vllm hash (#163218 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163218 Approved by: https://github.com/pytorchbot	2025-09-18 04:31:47 +00:00
Nick Riasanovsky	a27c002186	[BE] [Triton] [Inductor] Add an assert for store_output val_shape to use a tuple (#162887 ) Summary: Updates the remaining tests to all ensure val_shapes is always passed a tuple and not a list. Enables adding an assert consistent with the other function arguments. Test Plan: Depends on CI. Rollback Plan: Differential Revision: D82383319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162887 Approved by: https://github.com/NikhilAPatel	2025-09-18 04:30:36 +00:00
Laith Sakka	0f462740a0	replace more // with FloorDiv in inductor code (#162969 ) see this https://github.com/pytorch/pytorch/pull/162869 for more context, sympy div representation can make reasoning fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162969 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel	2025-09-18 03:28:31 +00:00
Ruben Rodriguez Buchillon	d6aaf08344	[inductor][heuristics] add kernel template params (#162781 ) # why - enable a clear interface for kernel templates to declare all their instantiation parameters and any potential defaults - simplify KernelTemplateChoice to just have a single params, and not kwargs and extra_kwargs # what - KernelTemplateParams interface - placeholder implementation where we just pass through a dict # testing - existing ci tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/162781 Approved by: https://github.com/jansel	2025-09-18 02:15:42 +00:00
Er-Xin (Edwin) Shang	13304401df	Port 4 dynamo test files for the intel XPU (#160953 ) # Description Fixes #114850, we will port dynamo tests to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type from accelerator method. 2. Replace the requires cuda statement with requires_gpu. 3. Add HAS_XPU_AND_TRITON into the scope. 4. Add several wrapper methods in cuda module into the accelerator. # Notify Pull Request resolved: https://github.com/pytorch/pytorch/pull/160953 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-09-18 01:54:45 +00:00
Huy Do	8e48d1ba25	Skip reuse PyTorch wheel when building vLLM (#163232 ) This issues starts surfacing in [trunk](`b26d4c9a7a/1`). When building vLLM, uv doesn't like that we rename CI wheel without changing its metadata to match it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163232 Approved by: https://github.com/izaitsevfb	2025-09-18 01:42:32 +00:00
Zhengxu Chen	6189a5f731	[dynamo][ez] Initialize tracer_output to None by default. (#163169 ) Summary: In edge cases, tracer_output can be left unset if there's double exception raised which causes the following issue: ``` UnboundLocalError: local variable 'tracer_output' referenced before assignment ``` Default initialize this variable so that it's always present. Test Plan: CI Rollback Plan: Differential Revision: D82652815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163169 Approved by: https://github.com/tugsbayasgalan	2025-09-18 01:30:23 +00:00
Xia, Weiwen	48a7e8cc70	[CPU][GEMM Template] Improve A16W8 performance (#162479 ) Summary Improve A16W8 performance by 1. supporting GQA concat linear 2. using smaller cache blocking size 3. improving code for dequantization of weight (reducing instructions and adding prefetch) We saw > 5% E2E next token performance gain when running Llama3.1-8B-instruct. Test plan Already covered by UT Pull Request resolved: https://github.com/pytorch/pytorch/pull/162479 Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/jansel	2025-09-18 01:28:37 +00:00
Anshul Sinha	f17e2ab1f9	[FSDP][Replicate] tests replicate with prefetching (#162658 ) Summary: Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching Pull Request resolved: https://github.com/pytorch/pytorch/pull/162658 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656	2025-09-18 01:05:16 +00:00
Anshul Sinha	e14b290d1e	[FSDP][Replicate] tests replicate module functionality when used multiple times in a forward pass (#162656 ) Summary: Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: https://github.com/pytorch/pytorch/pull/162656 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654	2025-09-18 01:02:08 +00:00
Laith Sakka	04ddea44fd	Fix: ShapeEnv not propagated properly to inductor SizeVars (#162927 ) Summary: I am really skeptical about inductor sizevars creating an empty shape env when not provided with one i think we should fail there if the graph has dynamic shapes and no shape env is provided. however i wonder if there are actually use cases that depends on the shape env not being there? Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols. Test Plan: Fix the bug reported in creating simple e2e unit test is not trivial https://www.internalfb.com/diff/D82337184 Rollback Plan: Differential Revision: D82412384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162927 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel	2025-09-18 00:56:22 +00:00
Ke Wen	57a54a04b6	[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152 ) 1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong). 2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152 Approved by: https://github.com/ngimel ghstack dependencies: #163025	2025-09-18 00:50:22 +00:00
Anshul Sinha	6edfb3062c	[FSDP][Replicate] tests replicate runs forward/backward for root and non-root module (#162654 ) Summary: Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/162654 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650	2025-09-18 00:47:19 +00:00
Tugsbayasgalan Manlaibaatar	72fedf0575	Move export_db to use new tracer, remove restriction on optional inputs (#162993 ) Differential Revision: [D82478644](https://our.internmc.facebook.com/intern/diff/D82478644) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162993 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162557, #162558, #162559, #162682, #162992	2025-09-18 00:43:32 +00:00
Tugsbayasgalan Manlaibaatar	b26d4c9a7a	Make dynamo preserving stack trace work with inlined nn modules (#162992 ) Differential Revision: [D82478646](https://our.internmc.facebook.com/intern/diff/D82478646) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162992 Approved by: https://github.com/williamwen42 ghstack dependencies: #162557, #162558, #162559, #162682	2025-09-18 00:43:23 +00:00
Anshul Sinha	bb25c60945	[FSDP][Replicate] tests replicate parity for single and multigroup (#162650 ) Summary: The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/162650 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636	2025-09-18 00:38:49 +00:00
Nicolas Macchioni	fdcef1477c	[pcache] Generalize testing + All Caches thread-safe (#163173 ) Summary: 1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class - Now checks that all Cache types are thread-safe - Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value) 2. All Caches are thread-safe - InMemoryCache was the only one not thread-safe, so added a lock for access - Realized that to implement MultiCache we should just have this requirement. * Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class Test Plan: ``` [nmacchioni@* / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0 [nmacchioni@* / ()\|remote/fbcode/warm_gpu_od_stable...)]$ ``` Rollback Plan: Differential Revision: D82660240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163173 Approved by: https://github.com/masnesral	2025-09-18 00:32:46 +00:00
xinan.lin	e93706c2c8	[Intel GPU][pre_compile] Add XPU toolkit version and hardware info in compiled model check. (#162951 ) Following #162438, this PR generalized the origin CUDA only check, and add XPU check. Fixes #162939, Fixes #162938, Fixes #163032，Fixes #163045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162951 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-09-18 00:04:22 +00:00
ghostspiders	26eefd5ae2	Fix windows path escape characters (#162761 ) Fixes #135954 Torch Inductor Windows Path Escape Characters Pull Request resolved: https://github.com/pytorch/pytorch/pull/162761 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-09-17 23:39:39 +00:00
Mark Saroufim	28c42cc280	compile_kernel: Add DLPack test (#163166 ) Note to self: i should probably. start using gh stack This is rebased on top of https://github.com/pytorch/pytorch/pull/163165 so you only need to review this commit `7387c1becf` This test doesn't add any new functionality it just ensures DLPack conversion is working well Pull Request resolved: https://github.com/pytorch/pytorch/pull/163166 Approved by: https://github.com/janeyx99, https://github.com/albanD	2025-09-17 22:55:48 +00:00
bobrenjc93	0661ecdb38	add support for hint_override in mark_unbacked (#162652 ) Very similar to https://github.com/pytorch/pytorch/pull/161007 except now for mark_unbacked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162652 Approved by: https://github.com/laithsakka	2025-09-17 22:29:54 +00:00
Anshul Sinha	7a0f93344e	[FSDP][Replicate] tests replicate casting module after init (#162636 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision, Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init Pull Request resolved: https://github.com/pytorch/pytorch/pull/162636 Approved by: https://github.com/mori360 ghstack dependencies: #162631	2025-09-17 20:36:13 +00:00
Shaobin Ma	63276edb7c	[Inductor] support mixed dtype in the native_layer_norm_backward meta function (#159830 ) Fixes #159829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159830 Approved by: https://github.com/albanD	2025-09-17 20:29:12 +00:00
Hanchen Zhang	dfda2dfd53	very small typo in fsdp2 comment (#163155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163155 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-09-17 20:19:41 +00:00
Tugsbayasgalan Manlaibaatar	876824f174	Make inline tests to use new exporter and fix some issues around it (#162682 ) inline_and_install_module export variant is our long term state so it is better to use the new tracer for this. It also uncovered bunch of minor bugs because with inline_and_install_module, the nn_module_stack generation is changed a bit. Differential Revision: [D82478648](https://our.internmc.facebook.com/intern/diff/D82478648) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162682 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162557, #162558, #162559	2025-09-17 20:09:28 +00:00
Mark Saroufim	a89d5e97ec	compile_kernel remove header_code arg (#163165 ) We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward ## BC note The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter. Note this only affects torch.cuda._compile_kernel, which is a private API. Example: Before ```python kernel = compile_kernel( kernel_source="global void my_kernel() { ... }", kernel_name="my_kernel", header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }" ) ``` After ```python kernel_source = """ #define SCALE 2.0f device float scale(float x) { return x * SCALE; } global void my_kernel() { ... } """ kernel = _compile_kernel(kernel_source, "my_kernel") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165 Approved by: https://github.com/janeyx99, https://github.com/albanD	2025-09-17 19:47:32 +00:00
Shangdi Yu	4660e38e5a	write conv1d decomposition (#163080 ) In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it. it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet. To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163080 Approved by: https://github.com/angelayi	2025-09-17 19:22:38 +00:00
Kurt Mohler	5236007806	[MPS] Add `embedding_bag` forward pass (#163012 ) Part of #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163012 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-09-17 19:00:47 +00:00
Filip	167ad09be5	[optim] override SWALR.state_dict and load_state_dict (#163122 ) Fixes #163105 Note that the new `SWALR.load_state_dict` is not backwards compatible: ```python @override def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ self.__dict__.update(state_dict) self._set_anneal_func(self._anneal_strategy) ``` If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines: ```python @override def load_state_dict(self, state_dict: dict[str, Any]) -> None: """Load the scheduler's state. Args: state_dict (dict): scheduler state. Should be an object returned from a call to :meth:`state_dict`. """ anneal_func = state_dict.pop("anneal_func", None) strategy = state_dict.get("_anneal_strategy") self.__dict__.update(state_dict) if anneal_func is not None: state_dict["anneal_func"] = anneal_func if strategy is None: if anneal_func == self._linear_anneal: strategy = "linear" elif anneal_func == self._cosine_anneal: strategy = "cos" if strategy is None: strategy = getattr(self, "_anneal_strategy", "cos") self._set_anneal_func(strategy) ``` But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163122 Approved by: https://github.com/janeyx99	2025-09-17 18:17:26 +00:00
Jane Xu	bcbb45b746	remove tolerance override for dynamo test_mixed_device_dtype in SGD (#163088 ) In reaction to https://github.com/pytorch/pytorch/issues/116202#issuecomment-3145929113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163088 Approved by: https://github.com/albanD	2025-09-17 18:17:23 +00:00
Ruben Rodriguez Buchillon	3cad2403cb	[inductor][choices] pass through annotations from KTC to ChoiceCaller (#163117 ) # why - KTC might regenerate a choicecaller e.g. through FlexibleLayout optimization. This in turn would delete any annotations # what - provide an annotations dict inside KTC - forward that dict towards the ChoiceCaller's annotations - ChoiceCaller users e.g. in selectalgorithm now have access to the KTC and can register handlers do record/make decisions based on the KTC # testing n/a Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163117 Approved by: https://github.com/nmacchioni	2025-09-17 18:06:50 +00:00
Nakul Iyer	e63476b236	[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732 ) Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass. Test Plan: CIs created in D81274963 and D81286593 pass. Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests. https://www.internalfb.com/intern/testinfra/testrun/15481123829281588 Rollback Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732 Approved by: https://github.com/danielhou0515	2025-09-17 17:44:03 +00:00
Amandeep Chhabra	4f641aa1a2	capturing exit codes after sigterm/sigkill from torch elastic. (#160908 ) Summary: Background Torch Elastic sends SIGKILL/SIGTERM signals if any process fails while others are still running. However, processes terminated by these signals do not generate termination logs, causing confusion. Solution Capture exit codes after SIGTERM signals to ensure complete and accurate termination logging. Test Plan: unit tests https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79584569?job_attempt=1&version=0&tab=summary&env=PRODUCTION Rollback Plan: Differential Revision: D79584569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160908 Approved by: https://github.com/d4l3k	2025-09-17 17:41:35 +00:00
Nikita Shulga	8dbac62edb	[CI] Update NVIDIA driver to `580.82.07` (#163111 ) To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with `6e08c9d08e` This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111 Approved by: https://github.com/huydhn	2025-09-17 17:37:06 +00:00
Tugsbayasgalan Manlaibaatar	7a1e267d4a	Fix set_grad_enabled HOP in strict mode with new tracer (#162559 ) previous graph seems wrong probably because dynamo bytecode running might be changing the grad state unintentionally. Differential Revision: [D82478643](https://our.internmc.facebook.com/intern/diff/D82478643) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162559 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4 ghstack dependencies: #162557, #162558	2025-09-17 17:13:03 +00:00
Mu-Chu Lee	2291199e9b	[AOTInductor] Use CudaCachingAllocator for memory allocation (#162893 ) Summary: Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer allocation. Test Plan: Activate test under test/cpp/aoti_inference/test.cpp Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162893 Approved by: https://github.com/desertfire	2025-09-17 17:08:20 +00:00
Tugsbayasgalan Manlaibaatar	0e9f9c3a61	Fix inconsistent test and add new tracer as config (#162558 ) It is better to have the new tracer as global config that can be manipulated easily. Also I believe dynamo-like config infra is useful instead of relying on custom way of patching stuff. Differential Revision: [D82478649](https://our.internmc.facebook.com/intern/diff/D82478649) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162558 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162557	2025-09-17 17:01:48 +00:00
Tugsbayasgalan Manlaibaatar	0e9e3cf996	Don't skip register_dataclass unflatten in dynamo (#162557 ) We changed how we are tracing, as a result, we need to trace into register_data_class now. Differential Revision: [D82478651](https://our.internmc.facebook.com/intern/diff/D82478651) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162557 Approved by: https://github.com/zhxchen17	2025-09-17 16:53:02 +00:00
Animesh Jain	c5c9e20f11	[dtensor][compile] Disable proxy mode in sharding prop rules (#163126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163126 Approved by: https://github.com/bdhirsh	2025-09-17 16:49:35 +00:00
Sahan Paliskara	d1993c27ae	[BE] Make PyObjectSlot use a global PyInterpreter (#162659 ) This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process Gonna ask for review from @huydhn as there are some changes to CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659 Approved by: https://github.com/albanD, https://github.com/huydhn	2025-09-17 16:40:55 +00:00
Syed Tousif Ahmed	928ac57c2a	Upgrades dlpack to v1.1 to include fp8/fp4 (#162195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162195 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/Skylion007, https://github.com/rgommers	2025-09-17 16:39:11 +00:00
zpcore	f2206b1ed8	fix wait() missing in redistribute tensor (#162749 ) We notice that the wait() op is missing after collective op call: https://github.com/pytorch/pytorch/pull/162665#discussion_r2338460562. The issue is that `_maybe_warp_tensor` calls AsyncCollectiveTensor in `3ad3bfe11d/torch/distributed/_functional_collectives.py (L829)` We need to check whether the wait() is required after collective op call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162749 Approved by: https://github.com/ezyang, https://github.com/SherlockNoMad, https://github.com/wconstab	2025-09-17 16:24:26 +00:00
PyTorch MergeBot	4ca3f435fb	Revert "[CI] Update NVIDIA driver to `580.82.07` (#163111 )" This reverts commit 16475a829f7fe3b1dc3c74573740df09ffaec650. Reverted https://github.com/pytorch/pytorch/pull/163111 on behalf of https://github.com/malfet due to It started to fail now, but worked just fine in PR CI ([comment](https://github.com/pytorch/pytorch/pull/163111#issuecomment-3303707671))	2025-09-17 16:20:31 +00:00
PyTorch MergeBot	79fd497423	Revert "[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016 )" This reverts commit f1eb99e2e4363f20eb5896433e1eb7f7500aadea. Reverted https://github.com/pytorch/pytorch/pull/163016 on behalf of https://github.com/jeffdaily due to broke rocm CI, see export/test_export_opinfo.py::TestExportOnFakeCudaCUDA::test_fake_export_nonzero_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17787208381/job/50564369696) [HUD commit link](`f1eb99e2e4`) ([comment](https://github.com/pytorch/pytorch/pull/163016#issuecomment-3303707552))	2025-09-17 16:17:53 +00:00
Eddie Yan	9b7a8c4d05	[cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104 ) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-17 15:48:54 +00:00
Nikita Shulga	16475a829f	[CI] Update NVIDIA driver to `580.82.07` (#163111 ) To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with `6e08c9d08e` This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111 Approved by: https://github.com/huydhn	2025-09-17 14:44:06 +00:00
Nikita Shulga	6cfb080d84	[CD] Do not enable GenAI on Windows (#163116 ) Follow up after https://github.com/pytorch/pytorch/pull/162209 as looks like it causes some of the Windows builds to fail with ``` C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found ``` May be fixes https://github.com/pytorch/pytorch/issues/162881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163116 Approved by: https://github.com/wdvr, https://github.com/danielvegamyhre	2025-09-17 14:09:10 +00:00
Filip	bc38c5baa1	[optim] prevent problematic tensor aliasing in lr_scheduler (#163098 ) Prevents edge cases in SequentialLR and ReduceLROnPlateau which could corrupt learning rates or trigger recompilation. Supersedes #162360 Fixes #162359 Fixes #163093 While putting #162360 together, I noticed the class of issue I was fixing (i.e. unintended aliasing in lr_schedulers when using Tensor lrs) appeared in several other places. @janeyx99 suggested I put together a follow-up PR. There are several bugs resembling the one fixed in #162360. I added a helper to fix these: ```python def _update_param_group_val(param_group: dict[str, Any], key: str, val: float \| Tensor): """Set param_group[key] to val without aliasing or assignment when they're both tensors. Raises a KeyError if param_group[key] does not exist. """ if isinstance(param_group[key], Tensor): param_group[key].fill_(_to_scalar(val)) else: param_group[key] = val ``` And applied it to fix bugs in `SequentialLR.__init__` and `LRScheduler._update_lr`. I also added it to `CyclicLR.__init__` which was using an equivalent pattern, and `CosineAnnealingWarmRestarts.step` which should have had a similar issue: ```python for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()): param_group["lr"] = lr ``` But did not, because `get_lr()` actually returns tensors when using a tensor lr (despite its `list[float]` return type annotation). Relying on this propagation seems fragile, so I conservatively added the method here as well. I'll be fixing the type annotations and several related issues in followup PRs built off of this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163098 Approved by: https://github.com/janeyx99	2025-09-17 13:40:23 +00:00
Amandeep Chhabra	607489f3d0	logging exit code for failures to ease debugging (#160907 ) Summary: Problem Some processes are terminated by other processes using signals. These signal terminations often lack stack traces, causing confusion during debugging. Solution Log exit codes to simplify and improve the debugging process failures. Test Plan: unit tests https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79777290?version=0&env=PRODUCTION Rollback Plan: Differential Revision: D79777290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160907 Approved by: https://github.com/d4l3k	2025-09-17 12:52:48 +00:00
Chris Sidebottom	89a6dbe73a	Filter out local timer tests which are unimplemented in Python on AArch64 (#158342 ) This stems from using a conda build of Python, which incorrectly detects this as unimplemented: https://github.com/conda-forge/python-feedstock/issues/804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158342 Approved by: https://github.com/malfet	2025-09-17 11:31:57 +00:00
Zeng, Xiangdong	c6392fcc06	[2/N] Port 3 fsdp distributed test cases to Intel GPU (#160940 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This is the second PR for fsdp distributed test cases, the first is https://github.com/pytorch/pytorch/pull/160158. We could enable Intel GPU with following methods and try the best to keep the original code styles: - Use "torch.accelerator.current_accelerator()" to determine the accelerator backend - Enabled XPU for some test path Pull Request resolved: https://github.com/pytorch/pytorch/pull/160940 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-17 10:45:28 +00:00
Benji Beck	c52c4052d8	[WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (#161680 ) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161680 Approved by: https://github.com/jerryzh168	2025-09-17 10:24:13 +00:00
Simon Fan	175299416b	[mypy] add some import ignores to onnx (#163133 ) these keep appearing when I run `lintrunner` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163133 Approved by: https://github.com/justinchuby ghstack dependencies: #161458, #162702	2025-09-17 09:32:38 +00:00
Simon Fan	a97cefac15	[dtensor] do not mutate specs when doing sharding prop (#162702 ) Because these specs are cached by reference. So by reusing them and mutating them, we're overwriting the cached specs of another op. I'm just fixing these 2, there are more instances, we'll need to do an audit separately. This fixes a few opinfo tests, but side note that `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_multi_head_attention_forward_cpu_float32` fails for me locally even on the base commit, but it is not marked as xfail NOTE: I am renaming `_wrap_output_spec_tensor_meta` so that external libraries will loudly fail. You should migrate to the functional `_create_output_spec_with_new_tensor_meta` or create your own mutation wrapper and take responsibility for the cache! This should be improved in https://github.com/pytorch/pytorch/issues/162731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162702 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #161458	2025-09-17 09:32:38 +00:00
Simon Fan	821458d97a	[dynamo][hop] Introduce Local Map HOP (#161458 ) Can't actually deploy it because of: https://github.com/pytorch/pytorch/issues/161456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161458 Approved by: https://github.com/ydwu4	2025-09-17 09:32:38 +00:00
Shangdi Yu	c9f16f201a	[inductor] Fix convolution autotune check when groups != 1 (#163094 ) When generating the triton template for convolution, we check `V.graph.sizevars.statically_known_equals(in_chan * groups, x.get_size()[1]) `. Note that in this check, we should consider the groups. This check verifies, at compile time, that the total number of input channels expected by the convolution weights (in_chan * groups) exactly matches the number of channels in the input tensor (x.get_size()[1]). This fix is good in general as it allows for conv triton template to be generated when `groups> 1`. It's also required for unified runtime to use AOTI as a backend delegate, because unified runtime is libtorch-free, so we cannot use the ATEN fallback of conv2d. ``` python test/inductor/test_select_algorithm.py -k test_convolution2_group ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163094 Approved by: https://github.com/SherlockNoMad	2025-09-17 09:09:32 +00:00
Georgia Phillips	b229455ddd	Update placement utils and weights to handle meta device (#162842 ) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge\|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162842 Approved by: https://github.com/henryoier	2025-09-17 08:12:32 +00:00
PyTorch MergeBot	a5419743c6	Revert "remove unnecessary sync point in AveragedModel update (#158017 )" This reverts commit cb7f45fd34b890fa7665837573ebb25744889568. Reverted https://github.com/pytorch/pytorch/pull/158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](https://github.com/pytorch/pytorch/pull/158017#issuecomment-3301790645))	2025-09-17 08:02:02 +00:00
Scott Wolchok	a63221a335	Fix TODO in make_tensor_for_subclass_helper (#162336 ) The constructor does accept a DataPtr (had to fix the DataPtr variant not accepting a SymInt, though). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162336 Approved by: https://github.com/ezyang ghstack dependencies: #162298	2025-09-17 06:46:34 +00:00
Deng, Daisy	c9485f8ff3	[Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-17 06:42:27 +00:00
Edward Yang	71b272e4a3	[BE] Use init_device_mesh over DeviceMesh (#162960 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162960 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/dcci	2025-09-17 06:12:19 +00:00
xinan.lin	39450e7b00	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#162933 ) Fixes #162937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162933 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-09-17 05:35:06 +00:00
Sherlock Huang	f1eb99e2e4	[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016 ) Reland of #160532 Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016 Approved by: https://github.com/huydhn	2025-09-17 05:01:33 +00:00
PyTorch UpdateBot	bb635a11f8	[vllm hash update] update the pinned vllm hash (#163128 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163128 Approved by: https://github.com/pytorchbot	2025-09-17 04:26:07 +00:00
Blaine Burton Rister	3bfa35d62e	[AOTI-FX] Solve for undefined symbols in dynamic input shapes (#163044 ) # Problem When dynamic shapes are passed to AOTInductor, they usually have a very basic form like `(s0, 5, 27)`. In these cases it's straightforward to generate code defining the symbol `s0` as a specific dimension of the input tensor. However, AOTI can handle slightly more generic expressions than this, such as `(2 * s0, 5, 27)`. In these cases, we don't immediately know the value of `s0`, but we need to solve for it, since it may be referenced in other parts of the program such as kernel call arguments, launch grids, etc. # Feature This PR adds support for more generic dynamic input expressions in the FX backend, following the implementation already present in AOTI's C++ backend: 1. Check if the expression contains one undefined symbol, as multiple variables would make the equation underdetermined. Let's call this `s0`. (We could potentially generalize this, but this PR focuses on cases AOTI can already handle.) 2. Generate a new symbol for the relevant size or stride of the input tensor. Let's call this `size`. This is computed with FX nodes just as a normal symbol would be. 3. Use sympy to solve for `s0` in terms of `size`. Let's call the resulting expression `solution`. 4. Since we know `s0` is an integer, `solution == floor(solution)`. Take the floor and then convert division to `FloorDiv`. This is required to trace through the expression, since the return value of regular division is not guaranteed to be an integer. 5. Generate FX for the modified `solution`, which defines the value `s0`. 6. Override the relevant method of `PythonWrapperCodegen` to a no-op, since the FX converter handles the above on its own. # Test plan In addition to the existing dynamic shapes tests, this PR adds new test cases where the input shape contains a non-trivial expression. This dynamic input dimension is then multiplied by other dimensions to form the argument to a `reshape`. Here's an example graph from one of the CI tests. In this case, the input expression was `2*x + 1`, and the solution is `x = (sym_size_int - 1) / 2`: ``` graph(): %arg0_1 : [num_users=2] = placeholder[target=arg0_1] %sym_size_int : [num_users=1] = call_function[target=torch.ops.aten.sym_size.int](args = (%arg0_1, 0), kwargs = {}) %sym_sum : [num_users=1] = call_function[target=torch.sym_sum](args = ([-1, %sym_size_int],), kwargs = {}) %floordiv : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum, 2), kwargs = {}) %mul : [num_users=2] = call_function[target=operator.mul](args = (8, %floordiv), kwargs = {}) %sym_sum_1 : [num_users=2] = call_function[target=torch.sym_sum](args = ([4, %mul],), kwargs = {}) %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([%sym_sum_1], [1]), kwargs = {dtype: torch.float32, device: cuda:0}) %sym_sum_2 : [num_users=1] = call_function[target=torch.sym_sum](args = ([35, %mul],), kwargs = {}) %floordiv_1 : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum_2, 32), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(%floordiv_1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, out_ptr0: %buf0, xnumel: %sym_sum_1, XBLOCK: 32}}) return buf0 ``` The `sym_size_int` node returns the first dimension of the input tensor. Next, `floordiv` computes the input symbol in terms of the input size. Then, the launch grid is computed by `floordiv_1`, the kernel argument by `sym_sum_1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163044 Approved by: https://github.com/jansel	2025-09-17 04:12:03 +00:00
Daniel Galvez	7a3791c5d0	Make torch.cuda.rng_set_state() and torch.cuda.rng_get_state() work during stream capture. (#162505 ) Note that this works only in a limited case, where you don't change the seed, but change only the offset of the philox generator. This captures the main use case we're interested in: Rewinding the RNG to a previous state. This is done by torch.utils.checkpoint.checkpoint in particular. Calls to increase() change only the offset, not the seed. Thus, we allow for "no-op" calls to set_seed where the new seed is the same as the old seed. If a user does happen to try to change the seed during stream capture, they will receive an error. Fixes #162504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162505 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/eellison, https://github.com/eee4017, https://github.com/cyyever	2025-09-17 03:57:34 +00:00
Tugsbayasgalan Manlaibaatar	e28983be76	Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008 ) Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008 Approved by: https://github.com/huydhn	2025-09-17 03:49:41 +00:00
Nicolas De Carli	794b48c9f4	[PyTorch] Compile SVE's box-cox only when building targeting SVE (#163078 ) Summary: Internally, we are building PyTorch on the compat layer. Need to avoid compiling sve's box-cox, as sve is not marked as build target. Rollback Plan: Reviewed By: rraometa, YifanYuan3 Differential Revision: D82544412 Privacy Context Container: L1208939 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163078 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-09-17 03:35:11 +00:00
Nikita Shulga	65845d7291	Update Gloo submodule (#163112 ) Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker ``` docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel" ``` Fixes https://github.com/pytorch/pytorch/issues/156595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112 Approved by: https://github.com/huydhn	2025-09-17 03:04:09 +00:00
Anshul Sinha	3009b6959a	[FSDP][Replicate] tests replicate parameter registration (#162631 ) Summary Tests parameter state management after forward and backward passes for single and multiple replicate groups Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward 2. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/162631 Approved by: https://github.com/mori360	2025-09-17 02:46:30 +00:00
Sherlock Huang	df4ebddbe0	DisableTorchFunction in debug_string (#163096 ) debug_string() invokes some torch functions under the hood. Use DisableTorchFunction() to avoid re-invoking __torch_function__ when calling debug_sting() inside DebugMode() Pull Request resolved: https://github.com/pytorch/pytorch/pull/163096 Approved by: https://github.com/zpcore	2025-09-17 00:19:49 +00:00
PyTorch MergeBot	e13cf68d03	Revert "[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688 )" This reverts commit 082d3dd9d53a60deb022e203892f0c492cf2cce7. Reverted https://github.com/pytorch/pytorch/pull/162688 on behalf of https://github.com/mlazos due to H100 tests didn't run internally for some reason, rerun with ciflow/h100 ([comment](https://github.com/pytorch/pytorch/pull/162688#issuecomment-3300634763))	2025-09-16 23:17:14 +00:00
Huy Do	814338826e	Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018 ) The build is ok, but uploading is failing at the moment https://github.com/pytorch/pytorch/actions/runs/17734972779/job/50416387786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163018 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-09-16 22:26:22 +00:00
Nikita Shulga	c527292c43	[CI] Remove functorch doc build jobs (#163101 ) As repo has been archived, there couldn't be any doc updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/163101 Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/ZainRizvi	2025-09-16 22:25:59 +00:00
PyTorch MergeBot	d4554bc284	Revert "Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018 )" This reverts commit 61be0f1c11ef59ff8cf39138b594efe3672816c0. Reverted https://github.com/pytorch/pytorch/pull/163018 on behalf of https://github.com/huydhn due to Missed another update on the environment ([comment](https://github.com/pytorch/pytorch/pull/163018#issuecomment-3300444271))	2025-09-16 21:44:11 +00:00
can-gaa-hou	f6ea41ead2	[CPU] Adding missing brackets in native MaxUnpool log (#163039 ) As stated in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163039 Approved by: https://github.com/Skylion007	2025-09-16 21:28:15 +00:00
Tugsbayasgalan Manlaibaatar	489860f3c2	Prefer_deferred_runtime_asserts should be propagated to new tracer (#162556 ) Differential Revision: [D82478650](https://our.internmc.facebook.com/intern/diff/D82478650) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162556 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #162487	2025-09-16 21:25:00 +00:00
Angel Li	9494b09549	bf16 support for fused_moving_avg_obs_fake_quant() op (#162620 ) enabling bf16 support for `torch.fused_moving_avg_obs_fake_quant()` op on cuda testing `python test/quantization/pt2e/test_quantize_pt2e.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162620 Approved by: https://github.com/andrewor14, https://github.com/jerryzh168	2025-09-16 21:22:44 +00:00
Ruben Rodriguez Buchillon	c230ac7300	[inductor][ez] add ChoiceCaller annotations (#162672 ) # why - enable ChoiceCaller generation to provide extra information that feedback_saver_fns (functions registered to run at the bench of benchmarking) can use afterwards - users that extend ChoiceCaller creation e.g. by creating their own InductorChoices can use this to shuttle through information # what - add an annotations dictionary to ChoiceCaller class # testing n/a Pull Request resolved: https://github.com/pytorch/pytorch/pull/162672 Approved by: https://github.com/nmacchioni	2025-09-16 20:49:55 +00:00
Hari Krishna Sai Kodali	77cafe105a	enable sync batchnorm for HPU device (#163047 ) Add HPU to list of supported devices for SyncBatchNorm Pull Request resolved: https://github.com/pytorch/pytorch/pull/163047 Approved by: https://github.com/albanD	2025-09-16 20:45:38 +00:00
PyTorch MergeBot	66308fb470	Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 )" This reverts commit cef815dc2ce37f98e01a6469a15b69f15995c1f9. Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))	2025-09-16 20:39:41 +00:00
fduwjj	232dd65c15	[CuTe] Change the logic of pycute manipulation ops like coalesce, complement from co-lex to lex (#162690 ) PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162690 Approved by: https://github.com/ezyang ghstack dependencies: #162413, #162534, #162414	2025-09-16 19:53:45 +00:00
Boyuan Feng	505ee42570	[Graph Partition] allow sharing default device context (#162873 ) Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context. ## Trace Use vLLM as an example. The first trace shows dynamo graph partition. <img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" /> The second trace shows inductor graph partition prior to this PR. <img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" /> Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context. The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead. <img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162873 Approved by: https://github.com/shunting314	2025-09-16 19:36:42 +00:00
Henry	9babcae1ed	fix f-string in errors.py (#163074 ) Add missing "f" for formatted f-string in UnsupportedOperandError, change "op_name" (undefined) to "name" for more descriptive error message in case of an unsupported operand with an unrecognized namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163074 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-09-16 19:19:30 +00:00
Shangdi Yu	69a5a5ac02	Add to inductor provenance tracking doc (#162975 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/162975 Approved by: https://github.com/desertfire, https://github.com/mlazos	2025-09-16 19:09:06 +00:00
Tugsbayasgalan Manlaibaatar	a4e74f416b	Fix error message (#162487 ) More proper fix here should be that we directly replace shape_env with correct sources but it is bit involved as we have to manually construct dynamo sources by hand (need to handle list/dict etc) but it is quite easy if we are operating on a string so i do this as post-processing step for now. Differential Revision: [D82478647](https://our.internmc.facebook.com/intern/diff/D82478647) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162487 Approved by: https://github.com/zhxchen17	2025-09-16 19:06:30 +00:00
Gael Le Lan	cb7f45fd34	remove unnecessary sync point in AveragedModel update (#158017 ) Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158017 Approved by: https://github.com/janeyx99	2025-09-16 18:57:55 +00:00
Aidyn-A	5937861eba	[TEST][CUDA] Use proper dtype in test_cuda_tensor_pow_scalar_tensor_cuda (#163070 ) The test `test_binary_ufuncs.py::TestBinaryUfuncsCUDA::test_cuda_tensor_pow_scalar_tensor_cuda` fails with a mismatched `dtype`: ```Python AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64. ``` This PR forces both arguments to use the same `dtype` to fix the test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163070 Approved by: https://github.com/eqy	2025-09-16 18:28:50 +00:00
Zhengxu Chen	bb3f3cc65e	[precompile] Store traced file information with CompileArtifacts. (#162983 ) Summary: Add some metadata to CompileArtifacts, so that it contains the source code information about the original code while they are being traced. For now, we will not provide a verification method to end user and instead we just provide which files are inlined. It's up to user to verify the content from these files are not changed (because it's optional for many users to validate source code changes anyway in aot precompile) Test Plan: buck run @mode/opt test/dynamo:test_dynamo -- -k test_file_change buck run @mode/opt test/dynamo:test_dynamo -- -k test_aot_compile_source_info Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162983 Approved by: https://github.com/yushangdi	2025-09-16 18:27:48 +00:00
Yu, Guangye	0819de412d	Add a new API torch.xpu.can_device_access_peer for Intel GPU (#162705 ) # Motivation Aligned with other backends, this PR introduces an new API `torch.xpu.can_device_access_peer`, which is used in vllm distributed [scenarios](`2048c4e379/vllm/distributed/device_communicators/custom_all_reduce.py (L37)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162705 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2025-09-16 18:00:22 +00:00
Isalia20	6db37d7206	[MPS] zeros like, narrow and enable tests (#163011 ) zeros like, narrow and enable tests for SparseMPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/163011 Approved by: https://github.com/malfet	2025-09-16 17:48:04 +00:00
joshuamarkovic	559e8d1c20	[doc]: Small typos (#162982 ) Small typo fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-16 17:42:19 +00:00
Alexander Grund	6702f545d8	Restore environment after NcclUserBufferRegistrationTest (#163063 ) This test sets "NCCL_ALGO=NVLS" in NcclUserBufferRegistrationTest which affects tests run in the same process such as `test_on_completion_hook_*` that fail with > invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.26.2 > ncclInvalidUsage: This usually reflects invalid usage of NCCL library. > Last error: > Error : no algorithm/protocol available for function Broadcast with datatype ncclInt8. NCCL_ALGO was set to NVLS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163063 Approved by: https://github.com/ezyang	2025-09-16 17:37:09 +00:00
Anshul Sinha	ddf3124b05	[FSDP][Replicate] tests replicate input device movements (#162629 ) Summary: This test verifies that the replicate function automatically moves forward pass inputs to the correct device. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device Pull Request resolved: https://github.com/pytorch/pytorch/pull/162629 Approved by: https://github.com/mori360	2025-09-16 17:35:27 +00:00
Anshul Sinha	457b27f92f	[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#162021 ) Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. Test Cases 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: https://github.com/pytorch/pytorch/pull/162021 Approved by: https://github.com/mori360	2025-09-16 17:18:07 +00:00
jiannanWang	b6a48ff69f	[BE] Add Documentation for Device APIs (#162834 ) Added documentation for torch.cuda APIs. Fixed docstring for xpu and mtia is_bf16_supported API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162834 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-09-16 17:01:06 +00:00
Howard Huang	9de22bc5da	Inspect schedule IR comms (#162996 ) Small change to util to allow us to see comms (e.g. `SEND`, `RECV`, etc.) in the schedule IR Pull Request resolved: https://github.com/pytorch/pytorch/pull/162996 Approved by: https://github.com/fegin	2025-09-16 16:59:06 +00:00
Prachi Gupta	f638854e1d	[ROCm][SymmMem] re-enable UTs (#162811 ) After the UT suite moved to `MultiProcContinuousTest`, `skipIfRocm` decorator started failing rather than skipping UTs because now we spawn multiple threads before the skip decorator is taken into account and the skip decorator was raising an exception to exit the process. But, the parent process treated the child process exiting as a crash rather than a skip. Additionally, in `MultiProcContinuousTest`, if one UT fails all subsequent ones are also skipped which makes sense since there's one setup for the entire suite. However, this showed up as many failing/skipped UTs in the parity. I added multiprocess version of skip decorators for ROCm, including, `skip_if_rocm_arch_multiprocess` and `skip_if_rocm_ver_lessthan_multiprocess`. These are needed as symmetric memory feature is only supported on MI300 onwards and we need to skip them for other archs and some UTs only work after ROCm7.0. Fixes #161249 Fixes #161187 Fixes #161078 Fixes #160989 Fixes #160881 Fixes #160768 Fixes #160716 Fixes #160665 Fixes #160621 Fixes #160549 Fixes #160506 Fixes #160445 Fixes #160347 Fixes #160203 Fixes #160177 Fixes #160049 Fixes #159921 Fixes #159764 Fixes #159643 Fixes #159499 Fixes #159397 Fixes #159396 Fixes #159347 Fixes #159067 Fixes #159066 Fixes #158916 Fixes #158760 Fixes #158759 Fixes #158422 Fixes #158138 Fixes #158136 Fixes #158135 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162811 Approved by: https://github.com/jeffdaily	2025-09-16 15:35:39 +00:00
James Wu	3ee071aa85	Allow aot_module_simplified to return a serializable output (#162527 ) This PR refactors AOTAutograd slightly: - It adds `simple_wraps` to various wrappers so that the reference to inner functions is stored in the output of AOTAutograd. - It saves a `serialize()` method on the result of `aot_stage2`, in the event of an eager backward compile. I discussed the lazy backward case with @bdhirsh, and we agreed that serialization in that case would probably use a different, more AOT API anyway, so we do not implement a serialize function for the lazy backward case. AOT precompile, at least initially, will always eagerly compile the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162527 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162171	2025-09-16 15:22:05 +00:00
PyTorch MergeBot	e7c3f802ff	Revert "[dynamo][hop] Introduce Local Map HOP (#161458 )" This reverts commit 505458db803e1ffabac08a2fc150b566d3ea3a57. Reverted https://github.com/pytorch/pytorch/pull/161458 on behalf of https://github.com/jeffdaily due to broke rocm tests ([comment](https://github.com/pytorch/pytorch/pull/161458#issuecomment-3299230458))	2025-09-16 15:14:36 +00:00
PyTorch MergeBot	4db203f875	Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659 )" This reverts commit 05ee8114f818a95745c812c3cd7aa8e784e61a9a. Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/jeanschmidt due to seems to have introduced errors in linting see https://github.com/pytorch/pytorch/actions/runs/17750689989/job/50444910643 ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3298626136))	2025-09-16 12:52:57 +00:00
Xinya Zhang	cef815dc2c	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-16 12:48:45 +00:00
zeshengzong	fa127d9b20	Fix `LBFGS` wolfe max iteration (#161488 ) Fixes #91581 , based on #135026 ## Test Result ```bash pytest test/test_optim.py ......... ========================== 1473 passed, 242 skipped in 2412.49s (0:40:12) =========================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161488 Approved by: https://github.com/albanD	2025-09-16 12:07:50 +00:00
Aidyn-A	6926710adf	[ATen][CUDA] CUTLASS matmuls: add sm_103a flag (#162956 ) This PR adds an `sm_103a` flag for GroupMM and RowwiseScaledMM. Contrary to just #161399, this simply adds the flag as the support for `sm_103a` matmuls is going to be added to CUTLASS v4.2 (see https://github.com/pytorch/pytorch/pull/161399#issuecomment-3252892937). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162956 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-09-16 10:29:55 +00:00
zeshengzong	e3783a9575	Replace `std::runtime_error` with `TORCH_CHECK` (#159344 ) Fixes part of #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159344 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-09-16 09:00:06 +00:00
Blaine Burton Rister	9aca0ba027	[Inductor-FX] Support IndexPutFallback (#162863 ) # Feature This PR supports lowering `IndexPutFallback` through Inductor's FX converter. The approach is very similar to the one taken in https://github.com/pytorch/pytorch/pull/162686. Compared to `ScatterFallback`, this required one additional change: the value of `self.op_overload` for `IndexPutFallback` was inaccurate. Previously, it used `aten.index_put`, which would result in unsound FX IR. The existing Python/C++ codegen use `aten.index_put_`, since the fallback mutates its input. This PR changes `self.op_overload` to match that. # Test plan Added a CI test lowering deterministic index put via the FX converter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162863 Approved by: https://github.com/angelayi	2025-09-16 08:52:47 +00:00
FFFrog	de143bf79b	[C10d] Code clean for torch.distributed.init_process_group (#163038 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163038 Approved by: https://github.com/msaroufim	2025-09-16 08:15:25 +00:00
Eddie Yan	fb1e0321da	[CUDA] fix shared memory race in `reduce_kernel` (#162995 ) Reported by compute-sanitizer, otherwise it looks like `block_y_reduce` and `block_x_reduce` both use `shared_memory` for temporaries without synchronization between them reproduces in e.g., `compute-sanitizer --tool=racecheck python test/test_matmul_cuda.py -k test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_128_rhs_block_1_cuda` (note that this test requires H100 to run unless only the non-emulated (cuBLAS impl.) is commented out) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162995 Approved by: https://github.com/msaroufim	2025-09-16 07:53:21 +00:00
Sherlock Huang	f8d379d29e	[DTensor] Introduce DebugMode (#162665 ) Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor. - Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)` - Optionally tracks torch-level functions, via `__torch_function__` - Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation - Optionally tracks real tensor operations, including functional c10d op, and regular ops - Calls are shown in the hierarchical structure! - shorthand representation - dt: DTesnor, ft: FakeTensor, t: Tensor - DM(2, 2) == DeviceMesh(shape = [2, 2]) - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)] - f32[8,8] == float32 with shape[8, 8] ``` debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True) with debug_mode: torch.mm(x_dtensor, y_dtensor) print(debug_mode.debug_string()) ``` produces: ``` torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) redistribute_input(1, [S(0)], [R]) _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0) _c10d_functional::wait_tensor(t: f32[8, 32]) aten::mm(t: f32[1, 8], t: f32[8, 32]) ``` Another example, for torch.einsum ``` torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P]) aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3) aten::unsqueeze(t: f32[16, 6, 8], 3) aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4) aten::unsqueeze(t: f32[16, 6, 8, 1], 4) aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2]) aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3) aten::unsqueeze(t: f32[8, 4, 4], 3) aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4) aten::unsqueeze(t: f32[8, 4, 4, 1], 4) aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0]) aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0]) aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3]) aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3]) aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8]) aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8]) aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1]) aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1]) aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16]) aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16]) aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P]) redistribute_input(0, [P, R], [S(2), S(2)]) aten::chunk(t: f32[1, 96, 8], 4, 2) aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]']) _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2) aten::clone(t: f32[1, 96, 1]) redistribute_input(1, [R, P], [S(1), S(1)]) aten::chunk(t: f32[1, 8, 16], 4, 1) aten::clone(t: f32[1, 2, 16]) aten::chunk(t: f32[1, 2, 16], 2, 1) aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]']) _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3) _c10d_functional::wait_tensor(t: f32[1, 1, 16]) aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16]) aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4]) aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4]) aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2]) aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4]) aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665 Approved by: https://github.com/ezyang	2025-09-16 07:30:05 +00:00
Nicolas De Carli	2459da4a64	[Caffe2] Add float batch box cox SVE128 implementation (#159778 ) Introduce SVE128 SIMD batch box-cox computation. We've seen about 65% throughput improvement. Privacy Context Container: L1196524 This is a no-op from OSS point of view, therefore it could be landed without tests (see precedence set by https://github.com/pytorch/pytorch/pull/143627), but we should delete those at some point Pull Request resolved: https://github.com/pytorch/pytorch/pull/159778 Approved by: https://github.com/malfet	2025-09-16 07:25:04 +00:00
angelayi	76fa381eef	[mps] Take into account offset (#163021 ) Fixes issue when running AOTI + MPS on voxtral model Pull Request resolved: https://github.com/pytorch/pytorch/pull/163021 Approved by: https://github.com/malfet	2025-09-16 07:14:33 +00:00
can-gaa-hou	29ea6254a0	[Bug] Add more boundary check for FractionalMaxPool3d (#161876 ) This PR aims to fix the bug mentioned at [#161853](https://github.com/pytorch/pytorch/issues/161853#issuecomment-3240695121) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161876 Approved by: https://github.com/malfet	2025-09-16 06:59:02 +00:00
Kevin Fu	d2ecddf1a3	[PT2]: Overriding Tensor device by SubmodNameToDevice (#162144 ) Summary: A temporarily solution mainly for weights that are not moved to cuda in fake mode during publishing, but runs on cuda in serving. This has some overlap with placement, but with 2 differences: 1. OverrideWeightsDevice only changes weights, not graph. 2. Placement only handles mapping between non-empty cuda indices, while here we override everything as submodNameToDevice is the ground truth. Test Plan: ICE replayer with custom package: https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/model_infra_cost_estimation/?e[select_ESTIMATION_RUN_ID]=ICE_kevinqfu_1756939411c164_replayeripnext_00 Rollback Plan: Differential Revision: D81284723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162144 Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad	2025-09-16 06:56:06 +00:00
Shangdi Yu	1115749da7	Fix provenance tracking kernel name for fallback kernels (#162628 ) Summary: as title `kernel.cpp_kernel_name` is something like `at::_ops::_scaled_dot_product_efficient_attention::call`, but the actual kernel name we want is `aoti_torch_cuda__scaled_dot_product_efficient_attention` Differential Revision: D82142287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162628 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-09-16 06:56:00 +00:00
Cui, Yifeng	9786243b64	Update torch-xpu-ops commit pin (#162804 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@d8c3ee](`d8c3eefc29`), includes: - Optimize adaptive average pool for channel-last memory format - Add unregister wait_tensor - Replace deprecated `[[intel::reqd_sub_group_size(SgSize)]]` with `[[sycl::reqd_sub_group_size(SIMD)]]` and remove unnecessary attributes - Revert "Roll back to original usage of sycl::get_kernel_bundle" Pull Request resolved: https://github.com/pytorch/pytorch/pull/162804 Approved by: https://github.com/EikanWang	2025-09-16 06:30:48 +00:00
Animesh Jain	9009c4da39	[functional] Avoid duplicate custom get_device call in constructor (#162889 ) Trying to reduce the number of `__torch_dispatch__` calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162889 Approved by: https://github.com/Lucaskabela, https://github.com/zou3519	2025-09-16 05:00:19 +00:00
Scott Rostrup	b68a5115a4	Workaround for mtia double init issue in has_triton (#162974 ) Summary: This change adds a new environment variable (`TORCHINDUCTOR_TRITON_DISABLE_DEVICE_DETECTION`) and configuration in `torch._inductor.config` which can be set to `"1"` to allow a user to disable triton's device detection logic in [torch/utils/_triton.py:has_triton()](`c9e57d7e9f/torch/utils/_triton.py (L128)`). This function is used at import scope in several places but the function has a side effect of initializing the mtia device if it is available which is causing some of our autotuning workflows to crash. Worth noting that when enabled this configuration disables all device detection not just mtia and this is because the logic in has_triton will initialize the mtia device as a side effect even when checking for a cuda or other device via the [get_interface_for_device()](`c9e57d7e9f/torch/_dynamo/device_interface.py (L570)`) function. I've tagged it `topic: not user facing` since I don't anticipate any outside of meta users making use of this, however this is my first PR here, so please indicate if it should be handled differently. Test Plan: This has been tested in the context of internal workflows. Differential Revision: D82347853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162974 Approved by: https://github.com/xmfan	2025-09-16 04:46:11 +00:00
Phillip Liu	2c45628813	[Flight Recorder][WP] Added mismatch tail as an arg (#162991 ) Summary: Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble. Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print` Test Plan: `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err` Confirm no error and output 20 mismatches. Rollback Plan: Differential Revision: D82335995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991 Approved by: https://github.com/fduwjj	2025-09-16 04:46:05 +00:00
PyTorch UpdateBot	6c0fd747af	[vllm hash update] update the pinned vllm hash (#162928 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162928 Approved by: https://github.com/pytorchbot	2025-09-16 04:25:04 +00:00
Nicolas Macchioni	d172d0231b	[pcache] Cache and AsyncCache implementations (#162777 ) Summary: Implemented caching abstractions: `Cache` and `AsyncCache`. `Cache` provides an abstraction for defining simple key -> value stores with get and put functionality. We propose using `Cache` for implementations with very low (microseconds) overhead, for example an in-memory cache. `AsyncCache` provides an abstraction for defining simple key -> value stores with asynchronous get and put functionality. We propose using `AsyncCache` for implementations with medium to high (> millisecond) overhead, for example an on-disk cache. We provide an initial extension of `Cache` in the form of `InMemoryCache`. `InMemoryCache` provides fast, in-memory caching that can be later used to memoize more expensive cache accesses. `InMemoryCache` also provides a custom constructor `InMemoryCache.from_env_var` that can be used to pre-populate the in-memory cache, which will be helpful for enabling determinism in the future. We also provides extensions of `AsyncCache`. `OnDiskCache` subclasses `AsyncCache` and serves as a generic on-disk caching implementation with atomic, write-once guarantees. `OnDiskCache` is semi-generic, allowing subclassing to alter the output directory. `InductorOnDiskCache` subclasses `OnDiskCache` to create an Inductor-specific on-disk cache that outputs to Inductor's default caching directory. Test Plan: `Cache` Tests: 1. Get -> Set -> Get - Checks that `get(key)` returns `None` when `key` is not cached, and that after calling `put(key, value)` subsequent `get(key)` calls return `value` 2. Set -> Set - Checks that with duplicated `set(key, value)` calls only the initial call is successful 3. From env var - Checks that constructing an `InMemoryCache` from an environment variable works. `AsyncCache` Tests: 1. Get -> Set -> Get - Same as `Cache` test, but checks both with synchronous and asynchronous execution 2. Set -> Set - Same as `Cache` test, but checks both with synchronous and asynchronous execution 3. Set -> Set Concurrent - Checks that of two concurrent `set(key, value)` operations, only one passes ``` cd ~/fbsource/fbcode && buck test mode/opt //caffe2/test/inductor:pcache ``` {F1981926248} Rollback Plan: Differential Revision: D82269762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162777 Approved by: https://github.com/masnesral, https://github.com/aorenste	2025-09-16 04:07:12 +00:00
Justin Chu	fdf68fa5d7	[ONNX] Fix rotary_embedding_23 implementation (#162865 ) The implementation of rotary_embedding_23 when input is 3D was incorrect. ## Tested Locally with ```py import onnx_ir as ir import onnx import torch import os import numpy as np base_path = "/home/justinchu/dev/onnx/onnx/backend/test/data/node" test_names = [ "test_rotary_embedding", "test_rotary_embedding_3d_input", "test_rotary_embedding_interleaved", "test_rotary_embedding_no_position_ids", "test_rotary_embedding_no_position_ids_interleaved", "test_rotary_embedding_no_position_ids_rotary_dim", "test_rotary_embedding_with_interleaved_rotary_dim", "test_rotary_embedding_with_rotary_dim", ] model_paths = [os.path.join(base_path, name) for name in test_names] for path in model_paths: print(f"Checking {path} for issues...") model = onnx.load(os.path.join(path, "model.onnx")) input0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_0.pb")) ).numpy() input1 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_1.pb")) ).numpy() input2 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_2.pb")) ).numpy() if os.path.exists(os.path.join(path, "test_data_set_0", "input_3.pb")): input3 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_3.pb")) ).numpy() else: input3 = None output0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "output_0.pb")) ).numpy() m = ir.from_proto(model) node = m.graph[-1] print(node) assert node.op_type == "RotaryEmbedding" interleaved = node.attributes.get_int("interleaved", 0) num_heads = node.attributes.get_int("num_heads", 0) rotary_embedding_dim = node.attributes.get_int("rotary_embedding_dim", 0) torch_out = torch.onnx.ops.rotary_embedding( torch.tensor(input0), torch.tensor(input1), torch.tensor(input2), position_ids=torch.tensor(input3) if input3 is not None else None, interleaved=bool(interleaved), num_heads=num_heads, rotary_embedding_dim=rotary_embedding_dim, ) torch_out = torch_out.detach().cpu().numpy() np.testing.assert_allclose(torch_out, output0) ``` Fix https://github.com/pytorch/pytorch/issues/162848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162865 Approved by: https://github.com/kunal-vaishnavi, https://github.com/titaiwangms	2025-09-16 03:30:05 +00:00
Ke Wen	7924b083c1	[CI] disable rerun of distributed tests (#163025 ) #162978 identified an issue that distributed test failures were wrongly muted. Per discussion with @malfet, one solution is to disable rerun of distributed tests in `run_test.py`. The PR makes use of the `is_distributed_test` flag to identify those tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163025 Approved by: https://github.com/malfet	2025-09-16 03:11:50 +00:00
Kevin Tang	3ae31782cc	[DCP] Add timeout for checkpoint background process join (#162828 ) Summary: Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout). This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period. Rollback Plan: Differential Revision: D82268979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828 Approved by: https://github.com/meetv18	2025-09-16 02:32:50 +00:00
Jeff Daily	c7fa16a05c	[ROCm][CI] update _rocm-test.yml based on _linux-test.yml (#163014 ) Fixes missing huggingface secrets and aligns _rocm-test.yml with other updates from _linux-test.yml that it was initially based on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163014 Approved by: https://github.com/huydhn	2025-09-16 02:14:38 +00:00
CaoE	1aa41eccc2	[Inductor][CPP] Reuse the pre-existing kernel for the same kernels (#158404 ) Reuse the pre-existing kernel to avoid defining redundant kernels. Inductor CPP will generate same kernels. For example: ``` # Example class Model(torch.nn.Module): def __init__(self, K, N): super().__init__() self.linear0 = torch.nn.Linear(K, N) self.linear1 = torch.nn.Linear(N, K) self.linear2 = torch.nn.Linear(K, N) def forward(self, input): out = self.linear0(input) out = self.linear1(out) out = self.linear2(out) return out ``` For the above example, linear2 is same as linear0, and Inductor CPP generates 2 same kernels: cpp_fused_addmm_0 and cpp_fused_addmm_2. ``` # Generated code: ... cpp_fused_addmm_0 = async_compile.cpp_pybinding(['const at::BFloat16', 'const at::BFloat16', 'const at::BFloat16', 'at::BFloat16'], ''' ... extern "C" void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y) { constexpr int64_t num_threads = 32; constexpr int64_t N = 1024; constexpr int64_t K = 2048; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; ... cpp_fused_addmm_1 = async_compile.cpp_pybinding(['const at::BFloat16', 'const at::BFloat16', 'const at::BFloat16', 'at::BFloat16'], ''' ... extern "C" void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y) { constexpr int64_t num_threads = 32; constexpr int64_t N = 2048; constexpr int64_t K = 1024; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; ... cpp_fused_addmm_2 = async_compile.cpp_pybinding(['const at::BFloat16', 'const at::BFloat16', 'const at::BFloat16', 'at::BFloat16'], ''' extern "C" void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y) { constexpr int64_t num_threads = 32; constexpr int64_t N = 1024; constexpr int64_t K = 2048; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; ... def call(self, args): arg6_1, = args args.clear() buf0 = empty_strided_cpu((1024, 1024), (1024, 1), torch.bfloat16) cpp_fused_addmm_0(arg6_1, constant6, _frozen_param6, buf0) del arg6_1 buf1 = empty_strided_cpu((1024, 2048), (2048, 1), torch.bfloat16) cpp_fused_addmm_1(buf0, constant6_0, _frozen_param8, buf1) buf2 = buf0; del buf0 # reuse cpp_fused_addmm_2(buf1, constant6_1, _frozen_param10, buf2) return (buf2, ) ``` After reusing the pre-existing kernel, Inductor CPP will reuse cpp_fused_addmm_0. ``` cpp_fused_addmm_0 = async_compile.cpp_pybinding(['const at::BFloat16', 'const at::BFloat16', 'const at::BFloat16', 'at::BFloat16'], ''' ... extern "C" void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y) { constexpr int64_t num_threads = 32; constexpr int64_t N = 1024; constexpr int64_t K = 2048; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; ... cpp_fused_addmm_1 = async_compile.cpp_pybinding(['const at::BFloat16', 'const at::BFloat16', 'const at::BFloat16', 'at::BFloat16'], ''' ... extern "C" void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y) { constexpr int64_t num_threads = 32; constexpr int64_t N = 2048; constexpr int64_t K = 1024; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; ... def call(self, args): arg6_1, = args args.clear() buf0 = empty_strided_cpu((1024, 1024), (1024, 1), torch.bfloat16) cpp_fused_addmm_0(arg6_1, constant6, _frozen_param6, buf0) del arg6_1 buf1 = empty_strided_cpu((1024, 2048), (2048, 1), torch.bfloat16) cpp_fused_addmm_1(buf0, constant6_0, _frozen_param8, buf1) buf2 = buf0; del buf0 # reuse cpp_fused_addmm_0(buf1, constant6_1, _frozen_param10, buf2) return (buf2, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158404 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-09-16 01:54:24 +00:00
Huy Do	61be0f1c11	Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018 ) The build is ok, but uploading is failing at the moment https://github.com/pytorch/pytorch/actions/runs/17734972779/job/50416387786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163018 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-09-16 01:46:59 +00:00
Laith Sakka	48dbd60df4	are_strides_like_channels_last_or_false (#162354 ) Note this could change suggest_memory_format behaviour for unbacked we used to return True for are_strides_like_channels_last sometimes even when results undecided now when its not decided we return False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162354 Approved by: https://github.com/aorenste	2025-09-16 00:49:05 +00:00
Simon Fan	505458db80	[dynamo][hop] Introduce Local Map HOP (#161458 ) Can't actually deploy it because of: https://github.com/pytorch/pytorch/issues/161456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161458 Approved by: https://github.com/ydwu4	2025-09-16 00:37:40 +00:00
PaliC	05ee8114f8	[BE] Make PyObjectSlot use a global PyInterpreter (#162659 ) This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process Gonna ask for review from @huydhn as there are some changes to CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659 Approved by: https://github.com/albanD, https://github.com/huydhn	2025-09-16 00:37:09 +00:00
Michael Kelly	e900a274e5	Add `CUDA_KERNEL_ASSERT_PRINTF`, a more flexible `CUDA_KERNEL_ASSERT_MSG` (#160129 ) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129 Approved by: https://github.com/ngimel	2025-09-16 00:23:48 +00:00
drisspg	d08cabe314	[BC Breaking] Remove flex + njt code paths (#161734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161734 Approved by: https://github.com/jbschlosser	2025-09-16 00:13:56 +00:00
Chien-Chin Huang	dac6a4bf6c	[CP] Fix the CP FlexAttention test (#162518 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162518 Approved by: https://github.com/XilunWu, https://github.com/drisspg	2025-09-16 00:12:26 +00:00
Kushagra Rastogi	cfc539fe15	Improved error lr last epoch (#162368 ) Fixes #160626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162368 Approved by: https://github.com/janeyx99	2025-09-15 23:33:14 +00:00
Nick Riasanovsky	955e195c7d	[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul (#162916 ) Summary: This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR. This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon. Test Plan: Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162916 Approved by: https://github.com/NikhilAPatel	2025-09-15 23:23:04 +00:00
Isuru Fernando	c77726b1d7	[inductor] fix expand_shape when copy_shape is not a string (#162739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162739 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-09-15 23:22:07 +00:00
Scott Wolchok	6b608dfe81	Add DISABLE_JUSTKNOBS to torch/_utils_internal.py and use it for dynamo _maybe_set_eval_frame (#162298 ) If JustKnobs is disabled (as it always is in OSS), we can easily avoid an extra layer of Python function call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162298 Approved by: https://github.com/ezyang	2025-09-15 23:00:39 +00:00
Mark Saroufim	090e6838a0	compile_kernel enable pch (#162972 ) Enabling automatic pre compiled headers per https://docs.nvidia.com/cuda/nvrtc/index.html#example-automatic-pch-cuda-12-8 I'm seeing large speedups in compilation times using PCH on average but the max compilation time with PCH is worst which is why I can't enable it by default. `load_inline()` also supports precompiled headers and does not enable them by default ``` Without PCH: 270.58 ms average With PCH: 115.27 ms average ``` ``` Without PCH: Max: 337.99 ms With PCH: Max: 383.82 ms ``` ```python source) [marksaroufim@devgpu005]~/pytorch% python simple_pch_benchmark.py ============================================================ Simple PCH Compilation Benchmark ============================================================ Device: NVIDIA B200 Iterations: 100 Testing WITHOUT PCH: ------------------------------ Compiling kernel 100 times WITHOUT PCH... Completed 10/100 compilations Completed 20/100 compilations Completed 30/100 compilations Completed 40/100 compilations Completed 50/100 compilations Completed 60/100 compilations Completed 70/100 compilations Completed 80/100 compilations Completed 90/100 compilations Completed 100/100 compilations Average: 270.58 ms (±6.99 ms) Min: 264.09 ms Max: 337.99 ms Testing WITH PCH: ------------------------------ Compiling kernel 100 times WITH PCH... Completed 10/100 compilations Completed 20/100 compilations Completed 30/100 compilations Completed 40/100 compilations Completed 50/100 compilations Completed 60/100 compilations Completed 70/100 compilations Completed 80/100 compilations Completed 90/100 compilations Completed 100/100 compilations Average: 115.27 ms (±27.32 ms) Min: 110.65 ms Max: 383.82 ms ``` ## Benchmarking script ```python #!/usr/bin/env python3 import argparse import os import sys import time from statistics import mean, stdev import torch from torch.cuda._utils import _nvrtc_compile def benchmark_compilation(use_pch, iterations=100): """Compile the same kernel many times with or without PCH.""" # CUB kernel that benefits from PCH kernel_source = """ #include <cub/block/block_reduce.cuh> #include <cub/block/block_scan.cuh> #include <cub/warp/warp_reduce.cuh> extern "C" __global__ void test_kernel(const float* input, float* output, int n) { using BlockReduce = cub::BlockReduce<float, 256>; using BlockScan = cub::BlockScan<float, 256>; using WarpReduce = cub::WarpReduce<float>; __shared__ union { typename BlockReduce::TempStorage reduce; typename BlockScan::TempStorage scan; typename WarpReduce::TempStorage warp[8]; } temp_storage; int idx = blockIdx.x * blockDim.x + threadIdx.x; float val = (idx < n) ? input[idx] : 0.0f; float sum = BlockReduce(temp_storage.reduce).Sum(val); __syncthreads(); float scan_result; BlockScan(temp_storage.scan).ExclusiveSum(val, scan_result); __syncthreads(); int warp_id = threadIdx.x / 32; float warp_sum = WarpReduce(temp_storage.warp[warp_id]).Sum(val); if (threadIdx.x == 0) { output[blockIdx.x] = sum + scan_result + warp_sum; } } """ device = torch.cuda.current_device() major, minor = torch.cuda.get_device_capability(device) compute_capability = f"{major}{minor}" compile_times = [] print( f"Compiling kernel {iterations} times {'WITH' if use_pch else 'WITHOUT'} PCH..." ) for i in range(iterations): # Use unique kernel name to avoid caching between iterations kernel_name = f"test_kernel_{i}" unique_source = kernel_source.replace("test_kernel", kernel_name) start = time.perf_counter() ptx, mangled_name = _nvrtc_compile( unique_source, kernel_name, compute_capability, header_code="", nvcc_options=["-std=c++17"], auto_pch=use_pch, ) elapsed = time.perf_counter() - start compile_times.append(elapsed * 1000) # Convert to ms # Progress indicator if (i + 1) % 10 == 0: print(f" Completed {i + 1}/{iterations} compilations") return compile_times def main(): parser = argparse.ArgumentParser(description="Simple PCH Compilation Benchmark") parser.add_argument("--pch", action="store_true", help="Test with PCH only") parser.add_argument("--no-pch", action="store_true", help="Test without PCH only") parser.add_argument( "--iterations", type=int, default=100, help="Number of compilations" ) args = parser.parse_args() print("=" * 60) print("Simple PCH Compilation Benchmark") print("=" * 60) print(f"Device: {torch.cuda.get_device_name()}") print(f"Iterations: {args.iterations}") print() # Determine what to test test_both = not args.pch and not args.no_pch results = {} # Test without PCH if args.no_pch or test_both: print("Testing WITHOUT PCH:") print("-" * 30) times_no_pch = benchmark_compilation(use_pch=False, iterations=args.iterations) if times_no_pch: avg_no_pch = mean(times_no_pch) std_no_pch = stdev(times_no_pch) if len(times_no_pch) > 1 else 0 print(f"Average: {avg_no_pch:.2f} ms (±{std_no_pch:.2f} ms)") print(f"Min: {min(times_no_pch):.2f} ms") print(f"Max: {max(times_no_pch):.2f} ms") results["no_pch"] = avg_no_pch print() # Test with PCH if args.pch or test_both: print("Testing WITH PCH:") print("-" * 30) times_with_pch = benchmark_compilation( use_pch=True, iterations=args.iterations ) if times_with_pch: avg_with_pch = mean(times_with_pch) std_with_pch = stdev(times_with_pch) if len(times_with_pch) > 1 else 0 print(f"Average: {avg_with_pch:.2f} ms (±{std_with_pch:.2f} ms)") print(f"Min: {min(times_with_pch):.2f} ms") print(f"Max: {max(times_with_pch):.2f} ms") results["pch"] = avg_with_pch print() if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162972 Approved by: https://github.com/albanD, https://github.com/janeyx99	2025-09-15 22:55:39 +00:00
Scott Wolchok	cf7873ea8b	Placement: make is_shard/is_replicate/is_partial more straightforward (#162619 ) We already have method dispatch based on actual type, so just provide appropriate base class and subclass method implementations. (This is not motivated by any particular performance profiling, just seems more straightforward to me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162619 Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/zpcore	2025-09-15 22:54:06 +00:00
Jeff Daily	0def79fdd9	[ROCm] fix conv relu fusion (#162856 ) Fixes #162816. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162856 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-15 22:49:32 +00:00
Davide Italiano	8590c3a66b	[DTensor] Add _foreach_pow to sharding propagation list. (#162895 ) Fixes #152696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162895 Approved by: https://github.com/ezyang	2025-09-15 21:14:06 +00:00
Shivam Raikundalia	dae5beae8e	[RecordFunction] Add Scope for Record Function Fast (#162661 ) Differential Revision: D82164587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162661 Approved by: https://github.com/davidberard98	2025-09-15 21:01:47 +00:00
Jagadish Krishnamoorthy	01c3c891c1	[ROCm] Enable test_fixed_striding (#162787 ) Enable the distributed test test_fixed_striding on gfx arch which supports fp8. Test command: python test/distributed/test_c10d_functional_native.py -k test_fixed_striding Pull Request resolved: https://github.com/pytorch/pytorch/pull/162787 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-15 20:23:43 +00:00
Edward Yang	1247dde1f2	[BE] Improve pytest summary display for OpInfo tests (#162961 ) pytest summarizes test failures by printing a truncated first line of the test of the OUTERMOST wrapped exception. Prior to this PR, it looked like this: ``` FAILED [0.0454s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_H_cpu_float32 - Exception: Caused by sample input at index 0: SampleInput(input=Tensor[size=(12, 12), device="cpu", dtype=torch.float32], args=(), kwargs={}, ... ``` I argue this is not so useful. If I have a lot of test failures, I look to the test summary to understand what /kind/ of errors I have, so I can assess which ones I should look at first. In other words, this is better: ``` FAILED [0.1387s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db__softmax_backward_data_cpu_float32 - Exception: Tensor-likes are not close! ``` Now I know specifically this is a numerics problem! This PR does it by prepending the old exception text to the wrapped exception. This is slightly redundant, as we are exception chaining, but it does the job. Open to bikeshedding. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162961 Approved by: https://github.com/malfet	2025-09-15 19:58:19 +00:00
Arijit Mukhopadhyay	de3a863cd8	AMD CPU CI - Add freezing + fix label trigger (#162176 ) Added the following changes: 1. Added freezing by default for AMD CPU based CI (to follow pattern introduced by https://github.com/pytorch/pytorch/pull/152298 ) 2. Fixed issue with label based CI triggers Addresses code review comment in https://github.com/pytorch/pytorch/pull/161155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162176 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2025-09-15 19:29:35 +00:00
PyTorch MergeBot	fa919feab6	Revert "[lint][CI] Don't checkout submodules for lintrunner-noclang (#162844 )" This reverts commit 6b231af23d63ee543a81c32952138090bebcf61d. Reverted https://github.com/pytorch/pytorch/pull/162844 on behalf of https://github.com/wdvr due to seems to be needed after all - failing lint ([comment](https://github.com/pytorch/pytorch/pull/162844#issuecomment-3293465058))	2025-09-15 18:43:53 +00:00
PenXLa	8e05749d5c	Fix integer overflow bug in triu/tril for large diagonal values (#153240 ) This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values. Example of the original incorrect behavior: ```python a = torch.ones(5,5) torch.triu(a, 9223372036854775807) # Before: # tensor([[0., 0., 0., 0., 0.], # [1., 1., 1., 1., 1.], # [1., 1., 1., 1., 1.], # [1., 1., 1., 1., 1.], # [1., 1., 1., 1., 1.]]) ``` The new implementation guards against overflow and produces correct results for all valid input values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240 Approved by: https://github.com/albanD	2025-09-15 18:07:19 +00:00
Jeff Daily	b334a5a379	[ROCm][benchmark] Add HF LLM benchmark expected accuracy (#162965 ) PR #156967 added HF LLM benchmarks but did not add the ci expected accuracy files for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162965 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-15 18:04:39 +00:00
Catherine Lee	6b231af23d	[lint][CI] Don't checkout submodules for lintrunner-noclang (#162844 ) Shouldn't be needed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/162844 Approved by: https://github.com/huydhn	2025-09-15 17:29:31 +00:00
fduwjj	19a4ef0256	[DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414 ) We create a wrapper class named "_MeshLayout" acting as a layout for device mesh so that we can add new methods more specific to DeviceMesh and keep the core logic of CuTe manipulation inside pycute module. This PR create the main body of the code and then next PR will come with actual implementation and unit test for device mesh layout. (Actual implementation can be found in https://github.com/pytorch/pytorch/pull/161016) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162414 Approved by: https://github.com/ezyang, https://github.com/fegin ghstack dependencies: #162413, #162534	2025-09-15 17:04:41 +00:00
Justin Chu	9cd54d3443	Clean up 'torch.onnx' entries from public API allowlist (#162850 ) Clean up entries related to 'torch.onnx' from the allowlist as the apis in onnx are properly configured. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162850 Approved by: https://github.com/albanD	2025-09-15 16:14:43 +00:00
Aaryaman Vasishta	0826aafa04	[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330 ) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>	2025-09-15 16:13:03 +00:00
Scott Wolchok	5dc4e78047	Fix excess refcounting in ObjLoaderFunc (#161528 ) expectRef is preferred over expect because it doesn't copy a std::shared_ptr. Differential Revision: [D81053710](https://our.internmc.facebook.com/intern/diff/D81053710/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161528 Approved by: https://github.com/Skylion007	2025-09-15 16:05:50 +00:00
atalman	c9e57d7e9f	[CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#162877 ) Related to https://github.com/pytorch/pytorch/pull/162862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162877 Approved by: https://github.com/malfet	2025-09-15 15:27:25 +00:00
James Wu	70337a066f	[easy] Handle Autotuners in get_triton_source_codes_for_gm (#161914 ) Some triton kernels are autotuners, in that case, grab the function from the autotuner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161914 Approved by: https://github.com/oulgen	2025-09-15 15:19:04 +00:00
James Wu	7d1bcd9aea	[easy] Fix unsigned long issue in static cuda launcher (#162920 ) Fixes https://github.com/pytorch/pytorch/issues/162430 It's a bit hard to come up with a unit test where the stream exceeds a C++ long, so just using existing unit tests for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162920 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-09-15 15:00:32 +00:00
albanD	09cbf34e93	[BE] Preserve caller source location in the error message (#162808 ) Summary: Currently the C10_CUDA_CHECK only shows source location in CUDAException like below: ``` Exception raised from c10_cuda_check_implementation at fbcode/caffe2/c10/cuda/CUDAException.cpp:44 ``` which is not terribly useful. By checking the original diff D39619861 that introduced c10_cuda_check_implementation, it seems the original macro would show the source location correctly but c10_cuda_check_implementation broke it. This diff will propagate caller source location to c10_cuda_check_implementation to fix the issue. Test Plan: CI Observed desired error message after the change: ``` CUDA error: an illegal memory access was encountered Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Device-side assertion tracking was not enabled by user. Exception raised from operator() at fbcode/sigrid/predictor/aed/AedContainer.cpp:659 (most recent call first): ``` Note the last line reports actual caller location. Rollback Plan: Reviewed By: Raymo111 Differential Revision: D81880552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162808 Approved by: https://github.com/janeyx99	2025-09-15 13:29:43 +00:00
PyTorch UpdateBot	456fbeaa6d	[xla hash update] update the pinned xla hash (#162947 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162947 Approved by: https://github.com/pytorchbot	2025-09-15 11:42:02 +00:00
PyTorch UpdateBot	a8c80f3fa9	Update slow tests (#162946 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162946 Approved by: https://github.com/pytorchbot	2025-09-15 11:31:37 +00:00
Natalia Gimelshein	bf6b40da3e	fix deterministic scatter_add path for multi-d tensors (#162866 ) PReviously for more than 2d tensor `select` didn't work correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866 Approved by: https://github.com/valentinandrei	2025-09-15 06:50:00 +00:00
Zeng, Xiangdong	814ba34fa6	[2/N] Port 5 _composable distributed test to Intel GPU (#159241 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This is the second PR for _composable cases, the first is https://github.com/pytorch/pytorch/pull/159118. We could enable Intel GPU with following methods and try the best to keep the original code styles: - Use "torch.accelerator.current_accelerator()" to determine the accelerator backend - Enabled XPU for some test path - Skip some test cases which Intel GPU does not support - Added "cpu:gloo,xpu:xccl" for distributed backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/159241 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-15 06:24:58 +00:00
Edward Yang	06bb32d55e	Skip empty tests, they don't make sense for numerics (#162932 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162932 Approved by: https://github.com/dcci	2025-09-15 06:20:26 +00:00
can-gaa-hou	b3ad8f4a9c	[BUG] Fix nonzero_static crash on CUDA when the input is a empty tensor (#162578 ) Fixes #162473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162578 Approved by: https://github.com/ngimel	2025-09-15 05:44:15 +00:00
Edward Yang	755cf90672	Redirect all use of filesystem to c10/utils/FileSystem.h (#162914 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162914 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever	2025-09-15 04:30:41 +00:00
Nikita Shulga	76e5df3866	[BE] Use `fmt::format` to define Conv key (#162925 ) Also use `getArrayRefString` instead of having separate cases for 2D and 3D Conv Pull Request resolved: https://github.com/pytorch/pytorch/pull/162925 Approved by: https://github.com/Skylion007 ghstack dependencies: #162921	2025-09-15 02:44:12 +00:00
Nikita Shulga	7fe1f5ea49	[BE] Delete [Ventura\|Sonoma]Ops header (#162921 ) Was a temp solution to make PyTorch+MPS buildable on MacOS-12, but it's no longer needed, as in 2.9+ MPS is only supported on MacOS Sonoma+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162921 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-09-15 02:44:12 +00:00
James Wu	e156a07171	[Precompile] [RFC] Implement aot_compile_module (#162171 ) This PR adds a new interface _aot_compile to `OptimizedModule`, so that the following is possible: ``` mod = SimpleLinearModule() inputs = [ ModelInput( args=(torch.randn(3, 3),), kwargs={}, contexts=[torch.no_grad(), eval_mode(model)], ), ModelInput( args=(torch.randn(3, 3),), kwargs={}, contexts=[train_mode(model)] ), ] assert isinstance(model, torch._dynamo.eval_frame.OptimizedModule) model._aot_compile( inputs, ) ``` After this PR, you can AOT precompile NanoGPT and use it to train directly. I'll share my fork of the repo to make this work. ## ModelInput The `ModelInput` API is a work in progress; for now it represents a set of inputs and contexts to instruct the compiler to compile. Most commonly, this is "compile an eval mode with no grad, and a training mode with grad", but also contains things like autocasting contexts, etc. ## Dispatch Dispatching is super simple here, we just iterate through all the precompiled fullgraphs and check guards for each one until there's one htat passes. I'm a bit worried that having this in python code is going to be too expensive. The guard checks are happening in C++ anyway, though, so the only python bottlenecked step here is just the for loop, so perhaps the overhead will not be high. I'll work on measuring this, though. ## TODOs This PR does not support `mod.compile()`, only `torch.compile(mod)`. In order to support `mod.compile()`, we'll need to update torch.nn.Module with an updated implementation — I can add that frontend later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162171 Approved by: https://github.com/zhxchen17	2025-09-14 23:32:28 +00:00
Isalia20	ba5ca31676	[MPS] sparse mps any (#162885 ) Add SparseMPS key for any op Pull Request resolved: https://github.com/pytorch/pytorch/pull/162885 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-09-14 18:57:53 +00:00
Isalia20	8e1db46493	[MPS] enable empty like and unsqueeze for SparseMPS (#162910 ) Enable empty like and unsqueeze for SparseMPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162910 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-09-14 17:47:06 +00:00
Edward Yang	aff2438554	QoL: add pip to requirements-build.txt (#162896 ) uv venvs by default don't come with pip, but for example setup.py assumes it is available. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162896 Approved by: https://github.com/Skylion007	2025-09-14 17:08:05 +00:00
Shen Zhang	3f8a2e62ea	Fix rebind_unbacked in torch.fx.experimental.symbolic_shapes (#162788 ) ## Description Fix a float type handling in `torch.fx.experimental.symbolic_shapes` function. [#162480](https://github.com/pytorch/pytorch/issues/162480) ## Issue When I use AOTInductor to compile the YOLOv10, I encounter the bug `'float' object has no attribute 'node'`. [Torch AOTInductor Ahead-Of-Time Compilation Fail](https://github.com/opendatalab/DocLayout-YOLO/issues/177) The problem is due to missing float type handling. https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/symbolic_shapes.py#L597 ``` if isinstance(u1, int): log.info( "rebind_unbacked: discard %s %s %s -> %s", n.target, raw_u0, path, u1, ) continue ``` ## Solution Change the code `if isinstance(u1, float)` to `if isinstance(u1, (int,float))` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162788 Approved by: https://github.com/ezyang	2025-09-14 17:07:14 +00:00
Clark Kang	6d64bc3990	[data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888 ) Summary: The check is introduced in D82262053 - `scalar_value` could be a numpy object - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`. Test Plan: ``` vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10 ``` Rollback Plan: Differential Revision: D82383428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888 Approved by: https://github.com/xush6528	2025-09-14 08:09:08 +00:00
angelayi	972140b7e9	[benchmark] Add HF LLM benchmarks (#156967 ) Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-14 07:41:06 +00:00
Thien Tran	84186c39ed	[NVRTC] Enable compiling templated kernels (#162875 ) Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps NVRTC side - (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>` - `nvrtcCompileProgram` - (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed - `nvrtcDestroyProgram` CUDA side - use mangled name instead of normal name -> profit - `extern "C"` is not even needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875 Approved by: https://github.com/msaroufim	2025-09-14 06:17:36 +00:00
Nick Riasanovsky	74a35c6344	[Triton] [Inductor] Enable TMA store for TMA mm templates (#160480 ) Summary: Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support. This also includes a couple notable changes: - Adds support in the TMA template support for checking the output layout. - Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation. - Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches. Test Plan: Tested with test_max_autotune.py unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480 Approved by: https://github.com/NikhilAPatel	2025-09-14 04:56:49 +00:00
PyTorch UpdateBot	d2f6daf6a7	[audio hash update] update the pinned audio hash (#162892 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162892 Approved by: https://github.com/pytorchbot	2025-09-14 04:27:37 +00:00
PyTorch UpdateBot	e74b21d66a	[vllm hash update] update the pinned vllm hash (#162891 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162891 Approved by: https://github.com/pytorchbot	2025-09-14 04:27:35 +00:00
Laith Sakka	f01bf0f64b	Do not use // but use CleanDiv or FloorDiv instead (#162869 ) Summary: When rewriting sympy expressions in the compiler codebase we want to generate FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1)) For symnodes we automatically handle that conversions in the symnode op dispatch. I will follow up with an issue to track all other usages of //. Block internal Model. Test Plan: add test run existing tests. dakechen1993 testing on the model. Rollback Plan: Differential Revision: D82362241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869 Approved by: https://github.com/ezyang	2025-09-14 01:30:33 +00:00
Ben Niu	886699bc5c	Port shared_ptr optimization in std::shared_ptr to intrusive_ptr (#162784 ) Summary: Please see D21021645 for details about the optimization and why it's beneficial. A similar change has been added to libstdc++ as well, see `dbf8bd3c2f` Rollback Plan: Reviewed By: yfeldblum Differential Revision: D81960754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784 Approved by: https://github.com/swolchok	2025-09-13 21:01:00 +00:00
Varun Patil	72b5159782	[flatbuffer] Fix compile error due to discarded result (#162767 ) Summary: One of our builds fails because the return value of fread is discarded. Explicit cast to void fixes the build. ```log In file included from fbcode/caffe2/torch/csrc/jit/mobile/import.cpp:15: fbcode/caffe2/torch/csrc/jit/mobile/file_format.h:156:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] 156 \| fread(data.get(), size, 1, f); \| ^~~~~ ~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. ... BUILD FAILED Failed to build 'fbcode//caffe2:libtorch (cfg:opt-linux-x86_64-clang19-no-san-opt-by-default#fef256f7ee896871)' ``` Test Plan: No runtime behavior change. CI. Rollback Plan: Differential Revision: D82265002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162767 Approved by: https://github.com/Skylion007	2025-09-13 20:24:43 +00:00
Nyakku Shigure	f37eaebed1	Add missing `tags` parameter to `custom_op` overload signatures (#162047 ) It appears to be an omission in #149782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162047 Approved by: https://github.com/zou3519, https://github.com/BoyuanFeng Co-authored-by: Boyuan Feng <fby.1994@gmail.com>	2025-09-13 19:57:23 +00:00
PyTorch MergeBot	5b9114bf19	Revert "[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330 )" This reverts commit 62843c14bbf694f5722fd6e1075da4792507fe42. Reverted https://github.com/pytorch/pytorch/pull/162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see https://github.com/pytorch/pytorch/issues/162881 ([comment](https://github.com/pytorch/pytorch/pull/162330#issuecomment-3288544921))	2025-09-13 15:43:50 +00:00
PyTorch MergeBot	deb7ebe0a3	Revert "[Reland] Use std::string_view in torchgen (#158625 )" This reverts commit 972e409829343cc2062aeee0994a9c1c735d216a. Reverted https://github.com/pytorch/pytorch/pull/158625 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break a couple of ExecuTorch tests for Vulkan backend ([comment](https://github.com/pytorch/pytorch/pull/158625#issuecomment-3287754275))	2025-09-13 07:52:50 +00:00
PyTorch MergeBot	9c93dc8123	Revert "Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532 )" This reverts commit a956c4ab1cb13079203a8f07eb26218724f54dc8. Reverted https://github.com/pytorch/pytorch/pull/160532 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160532#issuecomment-3287745165))	2025-09-13 07:42:12 +00:00
PyTorch MergeBot	31040b6357	Revert "port some distributed tensor test files for Intel GPU (#161703 )" This reverts commit 179f10621b418427fc6e92f58ea2b0bbe4cc9c52. Reverted https://github.com/pytorch/pytorch/pull/161703 on behalf of https://github.com/huydhn due to Sorry for reverting your change but these tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/161703#issuecomment-3287720713))	2025-09-13 07:22:14 +00:00
Edward Yang	aa41d3e49c	Claude loves making these files in top level, ignore them for sanity. (#162806 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162806 Approved by: https://github.com/albanD	2025-09-13 04:59:00 +00:00
PyTorch UpdateBot	f0fcf436c5	[audio hash update] update the pinned audio hash (#162864 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162864 Approved by: https://github.com/pytorchbot	2025-09-13 04:17:21 +00:00
PyTorch UpdateBot	5663910472	[vllm hash update] update the pinned vllm hash (#162751 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162751 Approved by: https://github.com/pytorchbot	2025-09-13 04:16:51 +00:00
Xuan Zhang	da669d51bf	fusion of large accumulated reads only at ir level (#161978 ) This is to revert some of the changes in https://github.com/pytorch/pytorch/pull/158667 In particular, we only disallow fusion of large accumulate read at IR level and not at scheduler level, as users can create their own custom fusion logics for the scheduler level. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161978 Approved by: https://github.com/yf225	2025-09-13 04:07:25 +00:00
Georgia Phillips	783985e9fe	kjt pytree registration (#161114 ) Differential Revision: D80656182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161114 Approved by: https://github.com/henryoier	2025-09-13 03:57:43 +00:00
Jimmy Lu	49d30f9a23	Fix boxcox to return same result for same input in one batch (#162772 ) Summary: The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results. Rollback Plan: Differential Revision: D82265247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772 Approved by: https://github.com/swolchok	2025-09-13 03:57:35 +00:00
Huy Do	66133b1ab7	Build vLLM aarch64 nightly wheels (#162664 ) PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664 Approved by: https://github.com/atalman	2025-09-13 03:43:55 +00:00
Chen	543d50db2b	Fix torch export with dict input nested in args (#162618 ) Investigated together with @pyemma and @taotaohuang001 ## Problem when calling exported module with dict nested in the args tuple, it will make following complaits ``` Traceback (most recent call last): File "/home/chzhu/infinitrain/test_torch_export.py", line 32, in <module> print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)})) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped return self._wrapped_call(self, args, kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 424, in __call__ raise e File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 411, in __call__ return super(self.cls, obj).__call__(args, *kwargs) # type: ignore[misc] File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl return inner() File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1806, in inner args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn return fn(args, *kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 81, in _check_input_constraints_pre_hook flat_args_with_path = _check_inputs_match(args, kwargs, self._in_spec) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 64, in _check_inputs_match raise ValueError( # noqa: B904 ValueError: Trying to flatten user inputs with exported input tree spec: TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a1', 'a2'], [, ])]), TreeSpec(dict, [], [])]) but actually got inputs with tree spec of: TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a2', 'a1'], [, ])]), TreeSpec(dict, [], [])]). Please check that the inputs have the same number and type of args and kwargs as the ones you used when tracing. ``` ## How to reproduce the issue ```python import torch # create a nn.Module with data_batch as input and output as output class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.linear = torch.nn.Linear(10, 1) def forward(self, data_batch): h1 = self.linear(data_batch["a1"]) h2 = self.linear(data_batch["a2"]) return h1 + h2 # torch export this module model = MyModel() example_args_forward = ( { "a1": torch.randn(10), "a2": torch.randn(10), }, ) exported_model = torch.export.export(model, example_args_forward, strict=True) # save the exported model torch.export.save(exported_model, "exported_model.pt2") # load the exported model exported_model = torch.export.load("exported_model.pt2").module() # run the exported model print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)})) ``` ## Root Cause Input spec is encoded as [TreeSpec](`582d278983/torch/utils/_pytree.py (L1059)`) in torch export. With (args, kwargs) at the top level. When we call the exported model, it has a pre-execution [hook](`582d278983/torch/export/_unlift.py (L66)`) to check the input TreeSpec matches the received TreeSpec, where in Treespec, the dict key order is preserved. Something like TreeSpec(dict, ['a2', 'a1'], [,*]) To workaround this, the input check reorders [kwargs](`582d278983/torch/export/_unlift.py (L67)`), that is why kwargs can be out of order. But the dict nested in the args is not re-ordered, so any re-ordering of the keys will throw errors. ## Solution Update eq_spec to handle the dict case, where we only guarantee that key set is the same without ordering constraints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162618 Approved by: https://github.com/angelayi	2025-09-13 03:24:30 +00:00
PyTorch MergeBot	7dd5f7b125	Revert "python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580 )" This reverts commit 4b2d297eec425475a82934a52e0edd96805524a1. Reverted https://github.com/pytorch/pytorch/pull/160580 on behalf of https://github.com/bdhirsh due to this broke shampoo, yanking ([comment](https://github.com/pytorch/pytorch/pull/160580#issuecomment-3287372891))	2025-09-13 02:04:36 +00:00
Sherlock Huang	a956c4ab1c	Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532 ) Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Test Plan: CI Rollback Plan: Differential Revision: D80181887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532 Approved by: https://github.com/henryoier, https://github.com/ezyang	2025-09-13 01:50:51 +00:00
Kevin Tang	0925c644ed	[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760 ) Summary: Sometimes checkpoint background process creation times out during gloo pg init. Attempting to destroy the process during that time can block the trainer thread until the timeout completes. This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time. Test Plan: CI Rollback Plan: Differential Revision: D81724668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760 Approved by: https://github.com/meetv18	2025-09-13 01:50:40 +00:00
Xu Han	b2553a6ec4	[AOTI] raise PyTorchStreamWriter open failed error code on windows (#162799 ) When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`. I found it didn't output the verbose error code, when PyTorchStreamWriter open failed. This PR add the verbose error code output for debug. Local test shows as below: <img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075" /> The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- ``` ERROR_SHARING_VIOLATION 32 (0x20) The process cannot access the file because it is being used by another process. ``` This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799 Approved by: https://github.com/jansel	2025-09-13 01:41:14 +00:00
Parshant Sharma	a749c40342	[Bilinear] move check to reset_parameters (#160952 ) Fixes #160407 ### Summary: Moved the check to reset_parameters to make `Bilinear` module lazy. Lazy modules have in_features initialized to 0 and a pre forward hook that initializes these to the appropriate shape, then calls reset parameters, ### Impact: module: nn, linear.py ### Test: <img width="903" height="182" alt="Screenshot From 2025-08-19 13-27-12" src="https://github.com/user-attachments/assets/bc04b0d6-5174-4dc9-8b21-9e019b3822a5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160952 Approved by: https://github.com/mikaylagawarecki	2025-09-13 01:17:10 +00:00
Nick Riasanovsky	595e13feb7	[BE] [Inductor] Update NoValidChoicesError logic (#162814 ) Summary: Updates the NoValidChoicesError logic to include some additional context for if not choices exists or if no choices compiled. Test Plan: NFC. Depending on CI. Rollback Plan: Differential Revision: D82312035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162814 Approved by: https://github.com/mlazos	2025-09-13 00:45:50 +00:00
Xuan Zhang	ddc5107601	An improved heuristic for operator reordering for peak memory + debugging logs (#161810 ) Revisiting the idea in https://github.com/pytorch/pytorch/pull/140195 For the lpmf algorithm in the memory reorder pass, in some cases, when all the nodes that can be scheduled are quite large, it is beneficial to switch the scheduling strategy. So instead of using size as the criterion, we choose a node that can unlock more nodes to become schedulable by analyzing their successor nodes. For an internal use case, we observe up to 20 GiB memory difference and here are the before and after memory snapshot. More information can be found in [D81270682](https://www.internalfb.com/diff/D81270682) (internal only). <img width="348" height="227" alt="image" src="https://github.com/user-attachments/assets/fb71e840-1508-44ed-bc9d-5eb4d364607d" /> In addition, add the functionality to upload the graph to tlparse for offline debugging. The format of the json is in consistency with the simulator [here](https://fburl.com/code/3l3d3qi4) (internal only). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161810 Approved by: https://github.com/yf225	2025-09-13 00:42:32 +00:00
FFFrog	a94ddd9b00	[OpenReg] Fix the docs of Accelerator Intergration (#162826 ) ---- - Fixed the redirect link about step 1 - Formatted the autoload and added necessary links Pull Request resolved: https://github.com/pytorch/pytorch/pull/162826 Approved by: https://github.com/albanD ghstack dependencies: #161917, #161918, #160101	2025-09-12 23:53:17 +00:00
FFFrog	29f84b0f61	[OpenReg] Improve the Event and Stream capabilities of DeviceGuardImplInterface (#160101 ) Changes: - Based on `OpenRegStream` and `OpenRegEvent`, we improve the implementation of Device Guard for `OpenReg` - Add some related testcases Pull Request resolved: https://github.com/pytorch/pytorch/pull/160101 Approved by: https://github.com/albanD ghstack dependencies: #161917, #161918	2025-09-12 23:53:17 +00:00
FFFrog	27daa6af6a	[OpenReg] Strengthen Openreg's execution limits to minimize the waste of computing resources (#161918 ) Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS). Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918 Approved by: https://github.com/albanD ghstack dependencies: #161917	2025-09-12 23:53:17 +00:00
FFFrog	9b429846e8	[OpenReg] Migrate OpenReg Tests from tests/test_openreg.py into torch_openreg/tests (#161917 ) Background: Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic. How to do: So how do we verify the quality of the third-party device integration mechanism? We will maintain a `test_openreg` entrypoint in `test/run_test.py`. This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available. Next: We will also improve `torch_openreg's` test coverage in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917 Approved by: https://github.com/albanD	2025-09-12 23:53:17 +00:00
PyTorch MergeBot	cdfa298a3b	Revert "[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732 )" This reverts commit a3f01f6418667f791f36d928f7e912eb89be2e67. Reverted https://github.com/pytorch/pytorch/pull/162732 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162732#issuecomment-3287163750))	2025-09-12 23:52:43 +00:00
Nikita Shulga	d25c35d2b2	[MPS] Fix `[nan]median` output for empty tensors (#162846 ) It should be `NaN` rather than 0 Added respective checks to `test_empty_tensor` Fixes https://github.com/pytorch/pytorch/issues/162798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846 Approved by: https://github.com/dcci	2025-09-12 22:26:29 +00:00
Dmitry Rogozhkin	ee53ad2dd0	xpu: test py_limited_api with SyclExtension (#162546 ) Commit extends existing CUDA test to cover XPU SyclExtension case for the same feature - `py_limited_api`. Commit required a fix for xpu to install some Aten header files (#145902) which got resolved after the merge of #159621. See: https://github.com/pytorch/pytorch/issues/145902 Requires: https://github.com/pytorch/pytorch/pull/159621 Requires: https://github.com/intel/torch-xpu-ops/pull/1743 CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162546 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/janeyx99	2025-09-12 21:57:01 +00:00
Haifeng Jin	0dcd9304aa	fix high=0 bug in nll_loss test (#162763 ) Minor bug fix for the `nll_loss` test. Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0. The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype. However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug. To my understanding, these OpInfo should be general enough to support different hardware. Fixing this obvious bug would make it more general cross different hardware. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763 Approved by: https://github.com/soulitzer	2025-09-12 21:48:18 +00:00
Ruben Rodriguez Buchillon	25f1a5d8d1	[inductor][ez] add src_hash property for Templates (#161468 ) # why enable caching/overriding/filtering based on src hash later # what - KernelTemplate has a src_hash that is None by default - sha256 on TritonTemplate of the template src code - None on ExternKernelChoice to have same API # testing n/a (not in use in this change) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350, #162293	2025-09-12 21:10:45 +00:00
Ruben Rodriguez Buchillon	269c9907a0	[inductor][choices] rename get_mm_configs to get_template_configs (#162293 ) # why - eventually we want all templates to go through this - we're exposing this through diode as a sort of interface/API - avoid later renaming # what - rename get_mm_configs to get_template_configs - rename _finalize_mm_configs to _finalize_template_configs # testing - lintrunner - ci Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350	2025-09-12 21:10:45 +00:00
Ruben Rodriguez Buchillon	a326ef37e6	[inductor] leverage template stacking in V.choices.get_mm_configs (#161350 ) # why - now everything is in place to just gather templates and run the V.choices.get_mm_configs once per op - enables any overrides inside V.choices.get_mm_configs to have a full view of the options for an op, not just for one template # what - replace multiple calls to V.choices.get_mm_configs with calls to gather the active templates, and then using those in a single call # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161351	2025-09-12 21:10:38 +00:00
Ruben Rodriguez Buchillon	cdb2d1838a	[inductor] FlexibleLayout for ExternKernelChoice for mms (#161351 ) # why - if we only use ExternKernelChoice we're not doing any codegen - if we're not doing any codegen, we can use a FlexibleLayout here, and provide deeper passes more chances to change it # what - if all the kernel template choices (KTC) are with a ExternKernelChoice template, we switch to a FlexibleLayout before generating the choice - add a test to make sure that works as intended (FlexibleLayout for only extern, and FixedLayout if Triton is involved) - caveats: - because CPP, CUTLASS, and CK are not using V.choices.get_mm_configs yet, we turn off the optimization if either of those backends are in use. This will be relaxed once they support this too - because Triton templates are still using their own calls (not a single call) to get_mm_configs, it's also turned off there. The next diff unifies Triton + ATEN to a single call to get_mm_configs and that in turn allows the optimization there too # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-12 21:10:31 +00:00
LifengWang	f7ea4975ab	update the baseline data for the operator benchmark (#162693 ) According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data. We use the average results from the four runs as the new baseline for the five models. And add a pull request trigger for the operator benchmark workflow Benchmarking Framework \| Benchmarking Module Name \| Case Name \| tag \| run_backward \| baseline old \| r1 \| r2 \| r3 \| r4 \| avg \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- PyTorch \| add \| add_M1_N1_K1_cpu \| short \| FALSE \| 3.9497 \| 2.57 \| 2.54 \| 2.38 \| 2.31 \| 2.45 \| 1.61 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.118 \| 50.02 \| 49.80 \| 46.78 \| 48.94 \| 48.88 \| 1.37 PyTorch \| relu6 \| relu6_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 68.739 \| 51.17 \| 51.19 \| 48.07 \| 50.42 \| 50.21 \| 1.37 PyTorch \| relu6 \| relu6_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 69.1875 \| 51.97 \| 52.77 \| 50.00 \| 51.24 \| 51.50 \| 1.34 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.436 \| 50.98 \| 51.69 \| 49.06 \| 49.87 \| 50.40 \| 1.34 @chuanqi129 @huydhn @desertfire @jainapurva Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693 Approved by: https://github.com/huydhn	2025-09-12 20:53:29 +00:00
Jeff Daily	65d642d6db	[ROCm] enable aoti tests, forward fix 162353 (#162827 ) Forward fix for tests added by #162353. Enables aoti tests on rocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162827 Approved by: https://github.com/dolpm, https://github.com/huydhn	2025-09-12 20:05:50 +00:00
karthickai	fa4d5e76ea	[Inductor] Fix ComboKernels failing due to missing helper functions (#162759 ) Fixes: #162756 Differential Revision: [D82257359](https://our.internmc.facebook.com/intern/diff/D82257359) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162759 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-09-12 20:01:06 +00:00
William Wen	38afeb2ba2	Fix markdown link syntax in graph breaks index (#162400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162400 Approved by: https://github.com/Skylion007	2025-09-12 19:29:49 +00:00
Isalia20	53b8bdb977	[MPS] enable cat op for sparse (#162007 ) Enable cat op for sparse on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007 Approved by: https://github.com/malfet	2025-09-12 19:07:39 +00:00
David Berard	cad052423b	[triton] Update 3.5 pin to 5ae38bdb0dc066c5823e34dc9797afb9de42c866 (#162821 ) Include @aakhundov's sam_fast patch, plus NVIDIA's sm88/sm110 patches (thanks @nWEIdia) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162821 Approved by: https://github.com/atalman	2025-09-12 18:34:22 +00:00
PyTorch MergeBot	b5f4a7dc14	Revert "[DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414 )" This reverts commit 195ac549d7d6538c4212ca73f69488e990b9527d. Reverted https://github.com/pytorch/pytorch/pull/162414 on behalf of https://github.com/malfet due to Looks like it broke test_circular_deps on Windows, see `d89189f289/1` ([comment](https://github.com/pytorch/pytorch/pull/162414#issuecomment-3286070938))	2025-09-12 16:57:09 +00:00
Jeffro	d89189f289	Fix inconsistent clock types in `ProcessGroupNCCL::runHookLoop` (#162543 ) ## Summary This PR fixes an inconsistency in `ProcessGroupNCCL::runHookLoop` when computing `timeStarted`. Both `timeFinished` and `timeStarted` in `WorkInfo` are expected to use `std::chrono::system_clock`, but previously the code was casting a duration from `steady_clock`. Reviewers suggested using `steady_clock` consistently for time measurement since it is appropriate for durations (see #153135 ). This PR updates both `timeStarted` and `timeFinished` in `WorkInfo`, and corresponding code in `runHookLoop`, to use `std::chrono::steady_clock`. ## Error message: ``` libcxx/include/__memory/allocator_traits.h:302:5: error: no matching function for call to '__construct_at' 302 \| std::__construct_at(__p, std::forward<_Args>(__args)...); \| ^~~~~~~~~~~~~~~~~~~ libcxx/include/__memory/shared_ptr.h:162:33: note: in instantiation of function template specialization 'std::allocator_traits<std::allocator<c10d::WorkInfo>>::construct<c10d::WorkInfo, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here 162 \| allocator_traits<_TpAlloc>::construct(__tmp, __get_elem(), std::forward<_Args>(__args)...); \| ^ libcxx/include/__memory/shared_ptr.h:736:51: note: in instantiation of function template specialization 'std::__shared_ptr_emplace<c10d::WorkInfo, std::allocator<c10d::WorkInfo>>::__shared_ptr_emplace<c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, std::allocator<c10d::WorkInfo>, 0>' requested here 736 \| ::new ((void)std::addressof(__guard.__get())) _ControlBlock(__a, std::forward<_Args>(__args)...); \| ^ libcxx/include/__memory/shared_ptr.h:744:15: note: in instantiation of function template specialization 'std::allocate_shared<c10d::WorkInfo, std::allocator<c10d::WorkInfo>, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here 744 \| return std::allocate_shared<_Tp>(allocator<__remove_cv_t<_Tp> >(), std::forward<_Args>(__args)...); \| ^ torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2674:32: note: in instantiation of function template specialization 'std::make_shared<c10d::WorkInfo, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here 2674 \| onCompletionHook_(std::make_shared<WorkInfo>( \| ^ libcxx/include/__memory/construct_at.h:44:58: note: candidate template ignored: substitution failure [with _Tp = c10d::WorkInfo, _Args = <c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>>]: no matching constructor for initialization of 'c10d::WorkInfo' 43 \| template <class _Tp, class... _Args, class = decltype(::new(std::declval<void>()) _Tp(std::declval<_Args>()...))> \| ~~~ 44 \| _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp __construct_at(_Tp* __location, _Args&&... __args) { \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162543 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-09-12 16:50:42 +00:00
Justin Chu	d71a6497b7	Fix typo in ONNX export error message (#162819 ) Fix another "summit" 😅 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162819 Approved by: https://github.com/cyyever, https://github.com/titaiwangms	2025-09-12 16:34:49 +00:00
Jeffro	a0dca0fc60	Fix protobuf test comparison by parsing proto instead of raw strings (#162644 ) The tests were comparing raw exported strings for protobuf comparison, which is not backward/forward compatible with different versions of protobuf. This PR parses the strings into protobuf and compares the protobufs directly, similar to what we did in assertImageProto. Our test failed because we used a different version of protobuf, which output 44100.0 instead of 44100, which resulted in an error. However, they are equal, but only different in the exported strings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162644 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-09-12 16:26:54 +00:00
Svetlana Karslioglu	e15686b40d	Remove actionable label from docathon label sync script (#155713 ) Make sure we don't propagate actionable label in docathon sync label script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155713 Approved by: https://github.com/clee2000	2025-09-12 15:36:50 +00:00
Jeff Daily	1e9ddf510f	[ROCm] fix hardsigmoid op (#162758 ) Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648 It can be fixed by explicit typing std::min<opmath_t> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162758 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-12 15:07:13 +00:00
Jeff Daily	7357eb66c5	[ROCm][CI] unskip some test_memory_format tests (#162766 ) Fixes #70125. Much of the work was done by #161687. This PR is additional test cleanup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162766 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-12 15:02:40 +00:00
Mwiza Kunda	03798b0f91	[inductor] Fix removal of constexpr args from the launcher signature (#161924 ) Fixes the case described below which occurs when: - A user `torch.compile`s a function that uses a triton kernel. - `TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1` . Problem: If the user defined triton kernel is not autotuned: ```python import os os.environ["TORCHINDUCTOR_DUMP_LAUNCH_PARAMS"] = "1" @triton.jit def kernel(..., BLOCK_SIZE: tl.constexpr): ... @torch.compile def fn(..) kernel[..](..., 128) fn(..) ``` Then In `triton_heuristics. _interpret_args_grid`, `filter_signature` function: ```python def filtered_signature() -> list[str]: # constexprs are not passed in as args return [ x for x in self.triton_meta["signature"].keys() if x not in cfg.kwargs.keys() ] ``` because `triton.autotune` is not used on the the `triton.jit` function, `cfg` above will be empty, and so `BLOCK_SIZE` will not be removed from the signature even though it is constexpr, even though it is removed from the arguments that are passed in to `interpret_args_grid`. This results in a mismatch between the number of parameters in the signature and the number of arguments, which leads to the error `NameError: name '_grid_2' is not defined`. Fix: Use the triton jit kernel `constexprs` for args to remove. Not sure if this is a good fix so suggestions are welcome. Test plan: Added a parameter to an existing triton kernel to test for this edge case Pull Request resolved: https://github.com/pytorch/pytorch/pull/161924 Approved by: https://github.com/davidberard98	2025-09-12 13:58:09 +00:00
Edward Yang	6c334885d4	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 10:54:42 +00:00
Blaine Burton Rister	a7bbc5fea7	[Inductor-FX] Support ScatterFallback (#162686 ) # Problem Inductor has a `ScatterFallback` op with custom Python and C++ wrapper codegen macros. This is used in certain situations where the default Triton codegen doesn't apply, and especially for reductions which need to be deterministic. Since this op used direct Python/C++ codegen, it wasn't compatible with the FX backend. # Feature This PR refactors the associated wrapper codegen to support `ScatterFallback`. This follows the same basic steps that were used for other fallback ops including `MultiOutput` and `ExternKernel`: 1. Create a new wrapper IR op called `ScatterFallbackLine`. Move the logic in `ScatterFallback.cogeden` to `ScatterFallbackLine.codegen`, to prevent it from affecting the FX backend. This logic is unsafe for FX because it may generate Python or C++ strings with methods like `codegen_reference()`. 2. To eleminate the dependence on `V.graph`, move language-specific logic to the respective wrapper codegen subclasses. In this case, C++ codegen has some special logic, which is moved to `CppWrapperCpu`. 3. Create a new method in `FXWrapperCodegen` to handle `ScatterFallbackLine`. # Test plan Added a couple of CI tests for the FX backend with scatter fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162686 Approved by: https://github.com/jansel	2025-09-12 08:41:50 +00:00
Zeng, Xiangdong	98e9440f30	[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-12 08:36:20 +00:00
Miroslaw Oksiucik	66c0f14ecc	Support XPU in --nproc-per-node option to torchrun (#159474 ) Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto Pull Request resolved: https://github.com/pytorch/pytorch/pull/159474 Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-09-12 08:32:04 +00:00
Yuanyuan Chen	972e409829	[Reland] Use std::string_view in torchgen (#158625 ) Reland of #157050, which is incidentally closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158625 Approved by: https://github.com/albanD	2025-09-12 08:31:54 +00:00
Aaryaman Vasishta	52af91e4c1	[ROCm/Windows] Support load_inline on windows (#162577 ) Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm. Tested on Windows with gfx1201. Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577 Approved by: https://github.com/ezyang	2025-09-12 08:10:07 +00:00
Liao, Wei	179f10621b	port some distributed tensor test files for Intel GPU (#161703 ) it's another pr to port distributed tensor test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/161604 We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161703 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-12 07:57:32 +00:00
fduwjj	195ac549d7	[DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414 ) We create a wrapper class acting as a layout for device mesh so that we can add new methods more specific to DeviceMesh and keep the core logic of CuTe manipulation inside pycute module. This PR create the main body of the code and then next PR will come with actual implementation and unit test for device mesh layout. (Actual implementation can be found in https://github.com/pytorch/pytorch/pull/161016) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162414 Approved by: https://github.com/ezyang ghstack dependencies: #162413, #162534	2025-09-12 07:32:56 +00:00
Shangdi Yu	636a511084	[aoti] add config for libtorch free so (#162655 ) Users can specify the following to get a libtorch_free `.so`. "aot_inductor.use_libtorch": False, The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files. "aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers) The main motivating use case is for executorch to produce a libtorch free `.so`. TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655 Approved by: https://github.com/angelayi	2025-09-12 07:31:04 +00:00
Michael Lazos	75de5b65b4	[Dynamo] Don't guard data ptrs by default with mark_static_address (#162208 ) Fixes https://github.com/pytorch/pytorch/issues/156377 Since we now re-record cudagraphs, it's not necessary to guard by default anymore and induce a full recompile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162208 Approved by: https://github.com/anijain2305	2025-09-12 07:15:10 +00:00
PyTorch MergeBot	6b59a19242	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880))	2025-09-12 06:52:03 +00:00
jainapurva	5f66902ecf	Fix operator benchmark issue#162708 (#162744 ) This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug Fixes https://github.com/pytorch/pytorch/issues/162708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744 Approved by: https://github.com/huydhn	2025-09-12 06:51:14 +00:00
PyTorch MergeBot	00e9ba75cd	Revert "[indexing] Prevent integer overflow from large step values in C++ (#161707 )" This reverts commit c140bf217f5ca5071ab9dbc1bcf9d4006242f44a. Reverted https://github.com/pytorch/pytorch/pull/161707 on behalf of https://github.com/huydhn due to Look like there is a land race as lots of jobs are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/161707#issuecomment-3283980465))	2025-09-12 06:49:36 +00:00
Boyuan Feng	333e546c02	[CUDAGraph][UX] warn many times for rerecording from dynamic shapes (#162696 ) Excessive re-recording CUDAGraphs lead to bad performance. We previously warns once if this happens. However, the limit (=50) is too high and users may just observe bad performance before actually seeing the warning message. Even worse, users may not see the warning message when there are many other logs. @anijain2305 reported that he never saw this warning message when using transformer library, but he DOES observe slowdown due to cudagraph re-recording & needs to turn off cudagraph. #162663 attempts to hard error when re-recording too many times due to dynamic shapes. But it is a bc-breaking change. Actually, hf-t5-generate model in torchbench failed due to 256 re-recordings. This PR a) reduces to smaller limit (=8); and b) makes the warning more spam, i.e., warn once for every distinct shapes once the limit is reached. Fixes #162299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162696 Approved by: https://github.com/mlazos	2025-09-12 06:38:32 +00:00
Mark Saroufim	f7e8321961	fix cpp extension distributed warning spew (#162764 ) With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead Test plan ```python # torchrun --nproc_per_node=2 demo_fix.py import os import logging logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG) import torch if 'RANK' in os.environ: torch.distributed.init_process_group('nccl') from torch.utils.cpp_extension import _get_cuda_arch_flags _get_cuda_arch_flags() print(f"Rank {os.environ.get('RANK', '0')} done") ``` Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* [rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override. Rank 0 done Rank 1 done ``` But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)` Then we get ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** Rank 0 done Rank 1 done (source) [marksaroufim@devgpu005]~% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-12 06:12:46 +00:00
dolpm	30e16d6389	[nativert] aoti (#162353 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D81731425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162353 Approved by: https://github.com/yiming0416	2025-09-12 05:56:25 +00:00
orangeH25	28e8531032	Add api info for torch._C._nn.pyi (#162361 ) Fix part of #148404 APis involved are as followed: - im2col - l1_loss - mish - mish_ - mse_loss Pull Request resolved: https://github.com/pytorch/pytorch/pull/162361 Approved by: https://github.com/ezyang	2025-09-12 05:56:22 +00:00
Zeng, Xiangdong	0babdfad63	[1/N] Port 6 fsdp distributed test cases to Intel GPU (#160158 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - Instantiate_device_type_tests() - Use "torch.accelerator.current_accelerator()" to determine the accelerator backend - Enabled XPU for some test path - Added allow_xpu=True for supported test class Pull Request resolved: https://github.com/pytorch/pytorch/pull/160158 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-12 05:52:08 +00:00
fduwjj	561430edcd	[CuTe] Add type for CuTe layout via claude (#162534 ) This PR mostly is a cosmetic change using Claude to add types for copied PyCute code. We removed all suppressions of linters and add type checker, type alias and mypy ignore(if needed) so that the pycute code will be checked by linter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162534 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #162413	2025-09-12 04:59:21 +00:00
Isuru Fernando	79d2418b5a	[inductor] Add FLOAT_IS_NAN and COMPLEX_IS_NAN guards (#162537 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162537 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #162528	2025-09-12 04:32:46 +00:00
Isuru Fernando	5dd84559a5	[dynamo] Add DUAL_LEVEL_MATCH C++ guard (#162528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162528 Approved by: https://github.com/anijain2305	2025-09-12 04:32:46 +00:00
fduwjj	5dd14f0b65	[CuTe] Copy code from pycute for device mesh bookkeeping (#162413 ) We copied the whole module and its unit test into pytorch codebase. (https://github.com/NVIDIA/cutlass/blob/main/python%2Fpycute%2Flayout.py). We did change the indentation of code from 2 spaces to 4 spaces. And add lint suppressor to make mypy happy. Also we need to make changes to unit test to include ownership and use `run_tests, TestCase` so that the test gets picked up by CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162413 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-09-12 04:28:03 +00:00
can-gaa-hou	95191522e0	[OpenReg] Implement device autoload mechanism (#158555 ) # Implement OpenReg device autoload mechanism ## Overview The Autoload mechanism in PyTorch simplifies the integration of third-party device backends by enabling automatic discovery and initialization at runtime. Traditionally, integrating a new backend required explicit imports or manual initialization, which could be cumbersome and error-prone. With Autoload, PyTorch dynamically detects and initializes device backends, providing a seamless user experience. This mechanism leverages Python entry points (e.g., `torch.backends`) and dynamic module loading. When PyTorch starts, it scans for registered entry points and invokes their initialization hooks, ensuring that all available backends are ready for use without requiring explicit imports. ## Motivation This PR aims to apply [device autoload mechanism](https://github.com/pytorch/pytorch/issues/122468) to the OpenReg module with some simple changes. ## Change ### Before ```python import torch import torch_openreg x = torch.tensor([1, 2, 3], device="openreg") print(x) ``` ### After ```python import torch # No need to import torch_openreg manually! x = torch.tensor([1, 2, 3], device="openreg") print(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158555 Approved by: https://github.com/FFFrog, https://github.com/albanD Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-09-12 04:24:11 +00:00
dependabot[bot]	da954f10d6	Bump protobuf from 5.29.4 to 5.29.5 in /.github/requirements (#160844 ) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.4 to 5.29.5. <details> <summary>Commits</summary> <ul> <li><a href="`f5de0a0495`"><code>f5de0a0</code></a> Updating version.json and repo version numbers to: 29.5</li> <li><a href="`85637662f7`"><code>8563766</code></a> Merge pull request <a href="https://redirect.github.com/protocolbuffers/protobuf/issues/21858">#21858</a> from shaod2/py-cp-29</li> <li><a href="`05ba1a8104`"><code>05ba1a8</code></a> Add recursion depth limits to pure python</li> <li><a href="`1ef3f01c46`"><code>1ef3f01</code></a> Internal pure python fixes</li> <li><a href="`69cca9b7f5`"><code>69cca9b</code></a> Remove fast-path check for non-clang compilers in MessageCreator. (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/21612">#21612</a>)</li> <li><a href="`21fdb7acdb`"><code>21fdb7a</code></a> fix: contains check segfaults on empty map (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20446">#20446</a>) (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20904">#20904</a>)</li> <li><a href="`03c50e3874`"><code>03c50e3</code></a> Re-enable aarch64 tests. (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20853">#20853</a>)</li> <li><a href="`128f0aafd9`"><code>128f0aa</code></a> Add volatile to featuresResolved (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20767">#20767</a>)</li> <li><a href="`bdd49bb141`"><code>bdd49bb</code></a> Merge pull request <a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20755">#20755</a> from protocolbuffers/29.x-202503192110</li> <li><a href="`c65946848f`"><code>c659468</code></a> Updating version.json and repo version numbers to: 29.5-dev</li> <li>See full diff in <a href="https://github.com/protocolbuffers/protobuf/compare/v5.29.4...v5.29.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobuf&package-manager=pip&previous-version=5.29.4&new-version=5.29.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160844 Approved by: https://github.com/msaroufim Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-09-12 04:23:03 +00:00
PyTorch UpdateBot	d959eb02cb	[audio hash update] update the pinned audio hash (#162752 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162752 Approved by: https://github.com/pytorchbot	2025-09-12 04:18:54 +00:00
dependabot[bot]	62f044e260	Bump setuptools from 72.1.0 to 78.1.1 in /.github/requirements (#162701 ) Bumps [setuptools](https://github.com/pypa/setuptools) from 72.1.0 to 78.1.1. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst) - [Commits](https://github.com/pypa/setuptools/compare/v72.1.0...v78.1.1) --- updated-dependencies: - dependency-name: setuptools dependency-version: 78.1.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-09-11 21:03:27 -07:00
Ti-Tai Wang	2335f90414	[ONNX] Support enable_gqa when dropout is non-zero (#162771 ) Fixes #162258 Related to https://github.com/microsoft/onnxscript/pull/2558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162771 Approved by: https://github.com/justinchuby	2025-09-12 04:00:57 +00:00
Edward Yang	6e8f17c580	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 03:56:18 +00:00
Klaus Zimmermann	31345fb4f7	Make functorch notebook symlinks PEP 517 valid (#157813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157813 Approved by: https://github.com/zou3519, https://github.com/atalman	2025-09-12 03:52:08 +00:00
Daniel Vega-Myhre	872ed60679	[mxfp8 torch._scaled_grouped_mm] fix meta registration for 3d tensor (#162765 ) Meta registration checks for torch._scaled_grouped_mm has a bug for 3d "B" tensors. Namely, the scale shape for such a tensor should be 2d with shape (G, blocked_K * blocked_N), but it currently enforces an expected 3d shape of (G, blocked_K, blocked_N). See Blas.cpp for correct validation logic [here](`8e217a9f6d/aten/src/ATen/native/cuda/Blas.cpp (L1622)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162765 Approved by: https://github.com/ngimel	2025-09-12 03:51:52 +00:00
atalman	e8eeb06034	Move inductor jobs 3.9->3.10 (#162323 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007 Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-12 03:43:06 +00:00
Yang Wang	3cd734584d	bring back the old vllm's use_existing_torch.py (#162747 ) vllm's pr will override our dependencies for torch. quick fix to add the use_existing_torch.py. syncing with vllm now regarding the uv approach they have Pull Request resolved: https://github.com/pytorch/pytorch/pull/162747 Approved by: https://github.com/huydhn	2025-09-12 03:41:39 +00:00
PyTorch MergeBot	222ec8d28e	Revert "AMD CPU CI - Add freezing + fix label trigger (#162176 )" This reverts commit 9cac1b92595ec7836101d51dbe1415081042c7a0. Reverted https://github.com/pytorch/pytorch/pull/162176 on behalf of https://github.com/huydhn due to Sorry for reverting this but hardcoding the input online 122 does not make sense ([comment](https://github.com/pytorch/pytorch/pull/162176#issuecomment-3283532452))	2025-09-12 03:39:13 +00:00
thenumberouscode	c140bf217f	[indexing] Prevent integer overflow from large step values in C++ (#161707 ) Fixes https://github.com/pytorch/pytorch/issues/160868 hmmm, I found an existing fix PR after I've finished this one. For reference, the old PR was https://github.com/pytorch/pytorch/pull/147433/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161707 Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/mlazos	2025-09-12 03:16:23 +00:00
Janani Sriram	7eb92b076f	[Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templates (#162678 ) Summary: Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested. Test Plan: ``` CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/exhaustive_ autotune_rowwise_persistent_tma/autotune/gpu0.log ``` Rollback Plan: Differential Revision: D82174075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162678 Approved by: https://github.com/coconutruben	2025-09-12 02:12:33 +00:00
Shangdi Yu	ccb450b190	[pre_compile] Add check for cuda and hardware version (#162438 ) if we detect compiled model is using cuda in meaningful way, we should store information about cuda + hardware Example: `SystemInfo(python_version='3.12.9', torch_version='2.9.0a0+gite02b0e6', cuda_version='12.6', triton_version=(3, 4), gpu_name='NVIDIA PG509-210')` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162438 Approved by: https://github.com/zhxchen17	2025-09-12 01:42:07 +00:00
Gabriel Ferns	ae97eb86f7	Reland "Fix conv exhaustive autotuning and expand Exhaustive test coverage" (#161957 ) reland https://github.com/pytorch/pytorch/pull/159387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161957 Approved by: https://github.com/coconutruben	2025-09-12 01:36:43 +00:00
mengph	7a9c4d794c	[BUG]Fixed handle cannot be hit in the cache in the IPC ExpandableSegment (#161885 ) Fixed the bug that handle cannot be hit in the ipcMemHandle_to_devptr cache in the IPC scenario of ExpandableSegment. Fixes #161884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161885 Approved by: https://github.com/albanD	2025-09-12 01:09:17 +00:00
Avik Chaudhuri	501e19137a	fix var args for shape guards (#162633 ) Summary: Fixes #162599 Test Plan: added test based on repro Rollback Plan: Differential Revision: D82144520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162633 Approved by: https://github.com/tugsbayasgalan	2025-09-12 00:33:35 +00:00
Aaryaman Vasishta	4a757e1e17	[ROCm] Support torch.cuda._compile_kernel (#162510 ) Supports `torch.cuda._compile_kernel` on ROCm. Related to https://github.com/pytorch/pytorch/pull/151484 Tested on Windows with gfx1201. Testing on Linux pending. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162510 Approved by: https://github.com/mycpuorg, https://github.com/msaroufim	2025-09-12 00:18:47 +00:00
Yuxingwang-intel	563921619b	Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168 ) This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168 Approved by: https://github.com/mingfeima, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-12 00:17:08 +00:00
Nikita Shulga	84d8ec73f1	[CD] Build Mac wheels using `setup-python` action (#162136 ) Biggest difference between both conda and homebrew CPython builds and one from python.org, is that later are universal binaries and they are always trying to build universal extension... Workaround lots of universal binary build attempts by explicitly specifying both `_PYTHON_PLATFORM` and `--plat-name` as well as `ARCH_FLAGS` Suppressed actionlint warning on use of `freethreaded` flag which is document in https://github.com/actions/setup-python/tree/v5 TODO: Remove lots of temporary workarounds when `3.14` is out in October 2025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162136 Approved by: https://github.com/atalman, https://github.com/huydhn ghstack dependencies: #162297, #162265	2025-09-12 00:16:31 +00:00
Ramya Ramineni	a956066b4e	[ROCm] Define uint32 t when ROCM_VERSION >= 70000 (#160587 ) This PR fixes the errors like below: ``` [rank3]: RuntimeError: The following operation failed in the TorchScript interpreter. [rank3]: Traceback of TorchScript (most recent call last): [rank3]: RuntimeError: /tmp/comgr-28f951/input/CompileSourceACC062:67:7: error: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'? [rank3]: 67 \| uint32_t int32; [rank3]: \| ^~~~~~~~ [rank3]: \| __hip_internal::uint32_t ``` Earlier uint32_t was defined in HIP headers in std namespace. Now it is moved to __hip_internal namespace in hip headers. This change is made in ROCm 7.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160587 Approved by: https://github.com/jeffdaily	2025-09-12 00:13:26 +00:00
David Berard	ff6870d134	[BE][flex attention] compute RMSE in float64 (#162088 ) I saw a failure where the reference error was 0.0, and the compiled error was 0.035. Although the failure still occurs with or without this change, it was confusing to see RMSE of 0.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162088 Approved by: https://github.com/drisspg	2025-09-11 23:53:31 +00:00
PyTorch MergeBot	92f9ed7ac3	Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473 )" This reverts commit fa1d409e83af93425a2672d62e134e8f20c5ccc0. Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an distributed tests ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3282999084))	2025-09-11 23:51:21 +00:00
Zhengxu Chen	8e217a9f6d	[precompile] Fix issues with guard serialization on distributed types. (#162418 ) Summary: Add more support for torch internal distributed data structures. Test Plan: test_guard_serialization.py Rollback Plan: Differential Revision: D81927732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162418 Approved by: https://github.com/dolpm	2025-09-11 23:09:55 +00:00
hanchchch	429052f151	fix: raise value error on init ParametrizationList if original.device != new.device (#162717 ) raise value error on init `ParametrizationList`, if `original.device != new.device`. currently `_maybe_set` will throw below error in such situations, which I think it's not convenient to debug. ``` [rank1]: RuntimeError: Attempted to set the storage of a tensor on device "cuda:1" to a storage on different device "cpu". This is no longer allowed; the devices must match. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162717 Approved by: https://github.com/lezcano	2025-09-11 23:07:58 +00:00
Nakul Iyer	a3f01f6418	[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732 ) Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass. Test Plan: CIs created in D81274963 and D81286593 pass. Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests. https://www.internalfb.com/intern/testinfra/testrun/15481123829281588 Rollback Plan: Differential Revision: D82161434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732 Approved by: https://github.com/danielhou0515	2025-09-11 22:47:03 +00:00
Aaryaman Vasishta	62843c14bb	[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330 ) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330 Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>	2025-09-11 22:35:09 +00:00
Nick Riasanovsky	082d3dd9d5	[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688 ) Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`. Test Plan: mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` Rollback Plan: Differential Revision: D82181924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos	2025-09-11 22:17:57 +00:00
PyTorch MergeBot	468c1f9e9d	Revert "[nn] Assert parsed iterable arguments are an appropriate length (#162340 )" This reverts commit b5e6e58050bd2a15f4173cfffa00c7e32e382b49. Reverted https://github.com/pytorch/pytorch/pull/162340 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an MPS tests on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/162340#issuecomment-3282676242))	2025-09-11 21:22:57 +00:00
Nick Riasanovsky	9614c2eb14	[Triton] [Inductor] Pruned failed compilations from Autotuning candidates (#162673 ) Summary: When exahaustively autotuning a new template you may hit situations that lead to compilation failures. This template will still attempt to autotune because nothing was marking this as failed and in my experiments lead to a crash/segfault if I didn't set `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`. To help eliminate this issue this PR marks any template that fails to compile as "failed" and then removes all of the failed templates from the choice candidates. In the case where it would have just failed to compile twice, this should at least reduce compilation time. Test Plan: Tested locally when experminenting with the new blackwell templates and a Triton version that contains a bug related to `num_warps < 4`. Rollback Plan: Differential Revision: D82172207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162673 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos	2025-09-11 21:22:36 +00:00
Janani Sriram	4c6a6c2db9	[Inductor][FP8] Add new scaled_mm and scaled_persistent_mm configs to Inductor FP8 Triton templates (#162699 ) Summary: Add new `scaled_mm` and `scaled_persistent_mm` configs to `template_heuristics.py` for Inductor FP8 Triton templates. These configs are a representative subset of the most performant configs generated from exhaustively autotuning FP8 Triton kernels with per-tensor and per-row scaling. See this [spreadsheet](https://docs.google.com/spreadsheets/d/1Fal1vhFUJIUcLpM2kJect6IkgeUFvCY-nUr3RTupM_4/edit?gid=1732602731#gid=1732602731) for benchmarks and performance metrics. Test Plan: Verify that configs do not error, i.e. ``` CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+i nductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only pt2_fp8_gemm --metrics tflops,accuracy --input-loader={input_path} --output="{output_csv}" --atol=1e-2 --rtol=0.5 2>&1 \| tee {log_file} ``` Rollback Plan: Reviewed By: NikhilAPatel, PaulZhang12 Differential Revision: D81651226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162699 Approved by: https://github.com/PaulZhang12	2025-09-11 21:21:06 +00:00
Rohit Manav	3ad3bfe11d	added example for torch.is_storage (#162614 ) Fixes #162613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162614 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-11 20:25:26 +00:00
PyTorch MergeBot	1c6dfbe557	Revert "[inductor] FlexibleLayout for ExternKernelChoice for mms (#161351 )" This reverts commit f08487aa8692751c36e608e338204490b0955583. Reverted https://github.com/pytorch/pytorch/pull/161351 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))	2025-09-11 20:24:15 +00:00
PyTorch MergeBot	934f878883	Revert "[inductor] leverage template stacking in V.choices.get_mm_configs (#161350 )" This reverts commit 623e623c821f639559248e9acd6084311c8fd3d5. Reverted https://github.com/pytorch/pytorch/pull/161350 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))	2025-09-11 20:24:15 +00:00
PyTorch MergeBot	cef05b1202	Revert "[inductor][choices] rename get_mm_configs to get_template_configs (#162293 )" This reverts commit 30191fcf03ddd6a09381a490096c4bb721874316. Reverted https://github.com/pytorch/pytorch/pull/162293 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))	2025-09-11 20:24:15 +00:00
Boyuan Feng	b500c166ef	[FlexAttention][Easy] turn off TMA when cannot use it (#162569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162569 Approved by: https://github.com/drisspg	2025-09-11 19:51:19 +00:00
Jeff Daily	d65ffdef3d	[ROCm] fix miopen batchnorm changing output format (#162112 ) It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-09-11 19:37:48 +00:00
Pian Pawakapan	ac72f81c12	[dynamic shapes] unbacked-safe should_swap (#160473 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160473 Approved by: https://github.com/laithsakka	2025-09-11 18:51:25 +00:00
Arijit Mukhopadhyay	9cac1b9259	AMD CPU CI - Add freezing + fix label trigger (#162176 ) Added the following changes: 1. Added freezing by default for AMD CPU based CI 2. Fixed issue with label based CI triggers Addresses code review comment in https://github.com/pytorch/pytorch/pull/161155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162176 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2025-09-11 18:41:29 +00:00
Isalia20	9bc648235d	[MPS] mps sparse mul op implementation (#162349 ) Implements mps sparse mul operation as well as enables other operations such as: 1. copy_ 2. div 3. sum 4. floor 5. power 6. sub 7. floor_divide Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349 Approved by: https://github.com/pearu, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-11 18:36:24 +00:00
David Berard	799471d92b	[triton] Update 3.5 pin (AMD compilation fix + warp spec) (#162733 ) Fixes #162390 Also adds warp spec (thanks @manman-ren!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162733 Approved by: https://github.com/atalman	2025-09-11 18:19:16 +00:00
justinchuby	43d9b5ecaa	[ONNX] Set fallback=False by default (#162726 ) This change addresses confusing error messages users encounter when using the ONNX exporter with default settings. Previously, `fallback=True` was the default, which would attempt to fall back to the TorchScript exporter when the dynamo path failed, leading to mixed error messages that obscured the actual issues. ## Problem When `fallback=True` by default: - Users get confusing error messages mixing dynamo and TorchScript export failures - Error messages tell users to provide the `f` argument unnecessarily - Dynamo error messages get flushed with TorchScript errors when both paths fail - Users expecting the dynamo path get unexpected fallback behavior ## Solution Changed the default from `fallback=True` to `fallback=False` in both: - `torch.onnx.export()` function - `torch.onnx._internal.exporter._compat.export_compat()` function ## Impact Before: ```python # Would fallback to TorchScript on dynamo failure, causing mixed error messages torch.onnx.export(model, args) ``` After: ```python # Clean dynamo-only errors by default torch.onnx.export(model, args) # Advanced users can still opt-in to fallback behavior torch.onnx.export(model, args, fallback=True) ``` Fixes #162697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162726 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2025-09-11 18:09:58 +00:00
Tugsbayasgalan Manlaibaatar	463fbc8ca0	Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240 ) This makes gemma3 exportable on transformers=4.55.4 In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation. Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically: 1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied 2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types. 3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator. 4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark) Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758 Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240 Approved by: https://github.com/ydwu4	2025-09-11 17:42:41 +00:00
Catherine Lee	2f53395943	[ez][CI] Fix docs push in nightly workflow (#162657 ) HUD metrics page says docs push hasn't happened in 21 days <img width="293" height="142" alt="image" src="https://github.com/user-attachments/assets/f930aab8-0503-4bf2-b962-8c375dec6b78" /> I guess main branch docs just haven't been updated? Did anyone notice? Do we care? Either way I think this should fix it Likely started after https://github.com/pytorch/pytorch/pull/161182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162657 Approved by: https://github.com/huydhn	2025-09-11 16:45:41 +00:00
Avik Chaudhuri	fccddf02b6	repro 161902 (#162416 ) Summary: Sometimes `ShapeEnv.create_symbol` can return a `sympy.Integer`. This messes up our phantom symbol infra for derived dims. Fixes #161902 Test Plan: added test based on repro Rollback Plan: Differential Revision: D81960709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162416 Approved by: https://github.com/tugsbayasgalan	2025-09-11 16:35:23 +00:00
Nikita Shulga	8be8b94793	Update SECURITY.md with reporting guidelines (#162608 ) Added clarification that all reports will be disclosed within 90 days Pull Request resolved: https://github.com/pytorch/pytorch/pull/162608 Approved by: https://github.com/seemethere, https://github.com/albanD	2025-09-11 16:30:29 +00:00
suo	fe8cc619b8	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-11 16:29:32 +00:00
atalman	2f5a24c2a2	Smoke tests don't run nvshmem on Windows (#162646 ) Only available for linux x86 and aarch64 : https://pypi.org/project/nvidia-nvshmem-cu13/#files nvshmem is available only on linux: `` "nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' \| " `` https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L57 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162646 Approved by: https://github.com/kwen2501	2025-09-11 16:09:20 +00:00
Nikita Shulga	24492cbab2	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-11 15:48:43 +00:00
Avik Chaudhuri	3f6d88f04c	paths to exclude shape guards (#162684 ) Summary: Easier to land than https://www.internalfb.com/diff/D82030581 Test Plan: everything blamed by https://www.internalfb.com/diff/D80713603 (except some old exir tests) Rollback Plan: Differential Revision: D82180349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162684 Approved by: https://github.com/tugsbayasgalan	2025-09-11 15:34:06 +00:00
PyTorch MergeBot	94db2ad51d	Revert "Move prioritized text linker optimization code from setup.py to cmake (#160078 )" This reverts commit 26b3ae58908becbb03b28636f7384d2972a8c9a5. Reverted https://github.com/pytorch/pytorch/pull/160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/160078#issuecomment-3281426631))	2025-09-11 15:29:29 +00:00
PyTorch MergeBot	9f783e172d	Revert "Build and Install Arm Compute Library in manylinux docker image (#159737 )" This reverts commit 582d278983b28a91ac0cedd035183f2495bb6887. Reverted https://github.com/pytorch/pytorch/pull/159737 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/159737#issuecomment-3281398272))	2025-09-11 15:25:24 +00:00
Animesh Jain	a8432bcaad	[dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162695 Approved by: https://github.com/williamwen42 ghstack dependencies: #162694	2025-09-11 15:01:00 +00:00
Animesh Jain	a3a40cb741	[dynamo][guards] Do not consturct framelocals to dict on GlobalsGuardAccessor (#162694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162694 Approved by: https://github.com/williamwen42	2025-09-11 15:01:00 +00:00
Tugsbayasgalan Manlaibaatar	c924c675d0	Fix persistent buffer bug (#162190 ) For non-persistent buffers, we should properly register them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162190 Approved by: https://github.com/zhxchen17	2025-09-11 14:56:26 +00:00
Jithun Nair	c3f30eca9e	Remove tests-to-include from rocm-mi300 workflow (#162721 ) Accidentally introduced by https://github.com/pytorch/pytorch/pull/162288 (was meant to be a temporary change) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162721 Approved by: https://github.com/jeffdaily	2025-09-11 14:36:07 +00:00
Jeff Daily	1e710552c1	[ROCm][CI] benchmark must patch fbgemm_gpu with tbb dep (#162649 ) fbgemm adds tbb as a dep only for rocm to avoid missing tbb symbols at import. But the way it was done was in setup.py to add the linker flag to CMAKE_CXX_FLAGS and it wasn't working for reasons unknown to me. But what did work was to add tbb as a dep in the cmake file. [We have a PR against upstream fbgemm](https://github.com/pytorch/FBGEMM/pull/4859) for that. Meanwhile, a much smaller patch is applied here in this PR until the fbgemm rocm ci commit hash is moved forward to include the tbb patch from upstream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162649 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-11 14:10:51 +00:00
Sun,Jiabin	7c39b2ecbe	use torch.accelerator and device_module instead of cuda to make DataParallel more device agnostic. (#162573 ) use torch.accelerator and `_get_device_module` instead of cuda to make DataParallel more device agnostic. Fixes #162152 recently, I've done some works to support my own privateuse1 backend in DataParallel module, but I found some cuda related APIs exist in parallel_apply.py file, that makes me have to monkey patch DataParallel module to support DP on my own backend. so I make some small changes to replace cuda.xxx to accelerator.xxx, and acquire device module by `_get_device_module`. this is my first time to contribute to pytorch, please let me know if there is any problem about the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162573 Approved by: https://github.com/ezyang, https://github.com/guangyey Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Edward Z. Yang <ezyang@mit.edu>	2025-09-11 10:04:27 +00:00
Naveen Suda	afdd4247a2	[torchao][pt2e] Make prepare and convert faster by caching (#162550 ) Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator. Test Plan: Verified on llm export to QNN. LLM Quantization prepare time of ~20min reduced to ~3min. Rollback Plan: Differential Revision: D82073679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550 Approved by: https://github.com/andrewor14	2025-09-11 07:59:22 +00:00
Lucy Qiu	22df9332da	[serialization] Add pte file to archive (#162520 ) Summary: Add _package_executorch_files to archive apis. Allow us to package a PTE file into the archive. I don't think there's a use-case to have more than one PTE file at the moment, but left it as `EXECUTORCH_FILES` just in case. Test Plan: Tested in D81992612 Rollback Plan: Differential Revision: D81977483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162520 Approved by: https://github.com/angelayi	2025-09-11 07:59:11 +00:00
Sun, Jiayi	6b9b7ce6fe	fix torch.sparse.log_softmax on CPU (#161959 ) Fix https://github.com/pytorch/pytorch/issues/152293. Example: ``` import torch from torch.sparse import log_softmax as sparse_log_softmax def test_bug(): a = torch.rand(4, 3) b = a - 10000000.0 b_sparse = b.to_sparse() cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense() print('cpu_out_sparse =', cpu_out_sparse) b_sparse_double = b.double().to_sparse() cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense() print('cpu_out_sparse_double =', cpu_out_sparse_double) if __name__ == '__main__': test_bug() ``` Output: - before ``` cpu_out_sparse = tensor([[-2., -1., -2.], [-1., -1., -1.], [-1., -2., -2.], [-1., -1., -2.]]) cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514], [-1.0986, -1.0986, -1.0986], [-0.5514, -1.5514, -1.5514], [-0.8620, -0.8620, -1.8620]], dtype=torch.float64) ``` - after ``` cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]]) cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]], dtype=torch.float64) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/mingfeima	2025-09-11 07:52:05 +00:00
Scott Wolchok	1274297e06	Remove __torch_dispatch__ check in THPVariable_make_dtensor (#162337 ) We control DTensor, so we can just guarantee there isn't a programming error with __torch_dispatch__. (The guard is already less-than-perfect; see the note that the deleted comment refers to.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162337 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218, #161596	2025-09-11 06:58:35 +00:00
Scott Wolchok	f68f76d8c7	Remove logger.debug statements in DTensor dispatch (#161596 ) These seem to have been costing us 5-10 usec per detach (out of ~~95 usec total). If they need to ship let's talk about requirements and how we can make this more efficient given that we would prefer if an entire DTensor op could finish in 10 usec. Differential Revision: [D81530106](https://our.internmc.facebook.com/intern/diff/D81530106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161596 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218	2025-09-11 06:58:35 +00:00
Deng, Daisy	fa1d409e83	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-11 06:44:26 +00:00
Xu Han	52d4660ae9	[AOTI] Fix Windows fail to zip opened file. (#162617 ) Original issue: <img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" /> reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu ``` Fixed list: 1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows. 2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access. Local test passed: <img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617 Approved by: https://github.com/jansel	2025-09-11 06:22:21 +00:00
Mark Saroufim	7345454e2e	compile_kernel: Handle python floats as c double (#162626 ) This was an open todo in the code and probably a footgun in waiting Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626 Approved by: https://github.com/malfet	2025-09-11 06:03:25 +00:00
PyTorch MergeBot	23170dfebc	Revert "Move inductor jobs 3.9->3.10 (#162323 )" This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc. Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))	2025-09-11 05:57:13 +00:00
Mark Saroufim	12e993f533	compile_kernel large shared memory fix (#162647 ) Alternate solution to https://github.com/pytorch/pytorch/pull/162328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647 Approved by: https://github.com/eqy	2025-09-11 05:52:46 +00:00
PyTorch UpdateBot	07d2531672	[vllm hash update] update the pinned vllm hash (#162551 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162551 Approved by: https://github.com/pytorchbot	2025-09-11 04:56:04 +00:00
Jagadish Krishnamoorthy	6944d4b639	[ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600 ) Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-09-11 03:34:07 +00:00
Isuru Fernando	f654cff566	[inductor] Add shape to load_input in matmul templates (#162513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513 Approved by: https://github.com/eellison ghstack dependencies: #162426	2025-09-11 01:51:15 +00:00
Isuru Fernando	f17c5e0789	[inductor] Add shape for store_output in matmul templates (#162426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162426 Approved by: https://github.com/eellison	2025-09-11 01:51:15 +00:00
Tianyu Liu	435c18fb4a	[DTensor] add op support for aten.unbind.int (#162560 ) As titled. It seems unbind returns views of the original tensor. E.g. see https://stackoverflow.com/questions/78910951/does-unbind-return-the-views-of-tensors-in-pytorch So we error out when `shard_dim == unbind_dim`. This is similar to why we error out in view ops. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_view_ops.py#L544-L546 This PR also refactors some other tensor ops code, by creating two utils function `shift_shard_dims_after_insert`, `shift_shard_dims_after_remove`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162560 Approved by: https://github.com/zpcore	2025-09-11 00:58:23 +00:00
dolpm	612cdc8f48	-ldl for nativert tests (#162643 ) Fixes #162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162643 Approved by: https://github.com/yiming0416, https://github.com/robert-hardwick	2025-09-11 00:35:57 +00:00
Edward Yang	da5069f289	Don't include cuh header when USE_NVSHMEM is off (#162635 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162635 Approved by: https://github.com/kwen2501	2025-09-11 00:24:50 +00:00
Mark Saroufim	4fd2a2b273	Add cuda headers automatically for compile_kernel (#162634 ) Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst Benefit is now we can add `#include <cuda_fp16.h>` without crapping out Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634 Approved by: https://github.com/ngimel	2025-09-11 00:20:33 +00:00
Ting Lu	bb1d53bc47	[CD] CUDA 13 specific followup changes (#162455 ) Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779 sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm. remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455 Approved by: https://github.com/atalman	2025-09-11 00:03:47 +00:00
Ben Niu	36338fc7f2	Relax fences for intrusive ptr's refcnt (#162072 ) Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing. lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read. incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning. decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is. This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072 Approved by: https://github.com/swolchok, https://github.com/yfeldblum	2025-09-10 23:17:01 +00:00
Daniel Vega-Myhre	e0c910149c	Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544 ) ## Summary - pytorch is not built for a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209) ## Changes - Setting USE_FBGEMM_GENAI for CUDA builds: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](`2033a0a08f/.github/scripts/nova_dir.bash (L29-L32)`)), so I follow the same rule here. - Extra nvcc flags*: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a ## Test plan Test build: ``` echo $CUDA_HOME /usr/local/cuda-12.9 export TORCH_CUDA_ARCH_LIST=10.0 python -m pip install --no-build-isolation -v -e . ``` Check build logs: ``` CMake Warning at CMakeLists.txt:901 (message): Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a ``` Run unit tests: - `pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544 Approved by: https://github.com/drisspg	2025-09-10 22:59:41 +00:00
eellison	f4aeceaa9d	Use upper bound for persistent rblock (#162441 ) Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2. Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441 Approved by: https://github.com/PaulZhang12	2025-09-10 22:29:02 +00:00
Michael Lazos	d8e6b2fddc	[Cutlass] Add exp and sigmoid activations (#162536 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162536 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #162535	2025-09-10 21:44:26 +00:00
Michael Lazos	31c25c7d01	[Cutlass] Add tanh activation and test case for activations (#162535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162535 Approved by: https://github.com/henrylhtsang	2025-09-10 21:44:26 +00:00
eqy	5dbee5691c	[cuDNN][Convolution][TF32][64bit] Add `tf32_on_and_off` decorator to conv3d 64bit test (#161004 ) cuDNN has new generated kernels that can use TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161004 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2025-09-10 21:39:35 +00:00
drisspg	864ffe12d7	Fix some edge cases (#162295 ) ``` Summary 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64) ┆ 111.08814 ┆ 112.447047 ┆ 1.012233 ┆ 1.22327 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64) ┆ 78.23082 ┆ 76.693169 ┆ 0.980345 ┆ -1.965531 │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663 ┆ 95.573333 ┆ 0.985733 ┆ -1.426717 │ │ alibi ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64) ┆ 93.373473 ┆ 92.294147 ┆ 0.988441 ┆ -1.155924 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147 ┆ 96.105389 ┆ 0.991273 ┆ -0.872685 │ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295 Approved by: https://github.com/mlazos, https://github.com/v0i0	2025-09-10 21:33:45 +00:00
Yuhui Shi	4e35594674	[Lowering] Fix the edge case of empty subgraph split due to dataclass node (#161716 ) Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter. Test Plan: The test shall pass in the current diff (after fix), and fail in the parent diff (before fix) ``` buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry ``` Rollback Plan: Differential Revision: D81232435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716 Approved by: https://github.com/ezyang	2025-09-10 21:23:42 +00:00
Gabriel Ferns	35d7b32159	Improve device info with new flops and bandwidth formula based on hardware libraries (#162245 ) Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets. This update: - Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list. - Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245 Approved by: https://github.com/v0i0, https://github.com/shunting314	2025-09-10 21:19:13 +00:00
atalman	0663bdb123	Move inductor jobs 3.9->3.10 (#162323 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-09-10 20:58:41 +00:00
PyTorch MergeBot	40ea6e418a	Revert "Fix decorators skipping NCCL tests (#158846 )" This reverts commit c2388201fc85b0748173212de5a17514c7a71f21. Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some inductor tests ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3276471387))	2025-09-10 20:51:31 +00:00
Colin Peppler	348303ebd2	[ez] add docstring/typing for codegen_kernel_benchmark (#162609 ) ``` lintrunner init && lintrunner -m origin/main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609 Approved by: https://github.com/coconutruben ghstack dependencies: #162442	2025-09-10 20:49:38 +00:00
Colin Peppler	94755e81c4	[inductor] Enable combo kernels with unbacked inputs (#162442 ) Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes. ### Example exception ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel kernel_code_list = self.generate_combo_kernel_code( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code src_code = kernel.codegen_kernel() ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel code.splice(self.codegen_kernel_benchmark(num_gb=0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark var_names.extend(self.kernel_benchmark_extra_args()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args extra_args.append(str(V.graph.sizevars.size_hint(tree.numel))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint return int(out) ^^^^^^^^ File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int ``` Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442 Approved by: https://github.com/jansel	2025-09-10 20:49:38 +00:00
Tugsbayasgalan Manlaibaatar	6d65737aee	testing infra and some fixes (#162183 ) This PR is quite large in that it covers most of rough edges in the new strict export flow: 1. Handle nn_module_stack correctly now that we are tracing wrapper module 2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore. 3. Correct input and output handling. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183 Approved by: https://github.com/zhxchen17	2025-09-10 20:48:12 +00:00
PyTorch MergeBot	053251b98d	Revert "Make functorch notebook symlinks PEP 517 valid (#157813 )" This reverts commit b494547f0bd6cb1ce5d8d104cb419802434c9c08. Reverted https://github.com/pytorch/pytorch/pull/157813 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this surfaces a weird discrepancy between GitHub and Mecurial used internally ([comment](https://github.com/pytorch/pytorch/pull/157813#issuecomment-3276442242))	2025-09-10 20:45:48 +00:00
Justin Chu	7e2e83cdbe	[ONNX] Update export docstring (#162622 ) Update export docstring to reflect the latest configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622 Approved by: https://github.com/titaiwangms	2025-09-10 20:29:46 +00:00
PyTorch MergeBot	d033d11d26	Revert "[torch][c10d] fix split_group in mixed backend case (#162424 )" This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d. Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))	2025-09-10 20:13:44 +00:00
PyTorch MergeBot	80d4da893c	Revert "Put torchao (0.13.0) back to benchmark workflow (#162227 )" This reverts commit 00985970e312c3c5e674e8e14d39fe77c226600e. Reverted https://github.com/pytorch/pytorch/pull/162227 on behalf of https://github.com/huydhn due to Crashing some inductor jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/162227#issuecomment-3276355034))	2025-09-10 20:11:37 +00:00
Parshant Sharma	bf7f481144	Update misleading torch.sparse_coo_tensor error check (#161900 ) Fixes #160622 ### Summary Updated the misleading torch.sparse_coo_tensor error check to provide clear context. earlier: `RuntimeError: number of dimensions must be sparse_dim (3) + dense_dim (0), but got 1` Updated: `RuntimeError: 'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = 1, sparse_dim = 3, dense_dim = 0` Impacts: - Comprehensive error message that will improve developer experience. - module: sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/161900 Approved by: https://github.com/nikitaved, https://github.com/pearu	2025-09-10 19:57:11 +00:00
Max Podkorytov	ab0694f1c6	[ROCm][Inductor][CK backend] Install rocm-composable-kernel python package on ROCm Linux CI docker images (#162288 ) Reopened from #158747 which got reverted since without setuptools-scm in pytorch index URL the wheel cannot be built We reconsider the original PR idea of introducing CK as a pytorch dependency on ROCm Linux and install the CK python package in CI only -- since (1) rocm-composable-kernel depends on setuptools-scm which depends on tomli and the existing index URLs need to be modified to host the new packages and (2) there also is a packaging [bug](https://github.com/pypa/setuptools/issues/3269#issuecomment-1254507377) in Ubuntu 22.04 which prevents correct dynamic version calculation with default system pip. Extras: -> this PR reconsiders how TORCHINDUCTOR_CK_DIR env variable is used; previously, this var was used to point to rocm-composable-kernel package installation path on the filesystem; now, the path is inferred by trying to import ck4inductor -> the tests are updated to reflect this change -> since in CI clang points to a bash script which invokes sccache, we cannot patch PATH to not contain sccache, this logic is removed from the testing code -> scaled_mm test crashes during the benchmarking when the benchmarking happens in the main process, and times out benchmarking when it happens in a subprocess, on gfx942, so it is disabled TBD: roll back rocm-mi300 workflow before merging Pull Request resolved: https://github.com/pytorch/pytorch/pull/162288 Approved by: https://github.com/jeffdaily	2025-09-10 19:33:40 +00:00
Animesh Jain	5f630d28d7	[dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162525 Approved by: https://github.com/williamwen42 ghstack dependencies: #162509	2025-09-10 18:52:15 +00:00
Animesh Jain	a67e798cb7	[dynamo][guards] Prevent framelocals to dict conversion for not required LAMBDA_GUARD (#162509 ) This is a smaller PR to reduce framelocals to dict conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162509 Approved by: https://github.com/williamwen42	2025-09-10 18:52:15 +00:00
Ruben Rodriguez Buchillon	30191fcf03	[inductor][choices] rename get_mm_configs to get_template_configs (#162293 ) # why - eventually we want all templates to go through this - we're exposing this through diode as a sort of interface/API - avoid later renaming # what - rename get_mm_configs to get_template_configs - rename _finalize_mm_configs to _finalize_template_configs # testing - lintrunner - ci Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350	2025-09-10 18:47:44 +00:00
Ruben Rodriguez Buchillon	623e623c82	[inductor] leverage template stacking in V.choices.get_mm_configs (#161350 ) # why - now everything is in place to just gather templates and run the V.choices.get_mm_configs once per op - enables any overrides inside V.choices.get_mm_configs to have a full view of the options for an op, not just for one template # what - replace multiple calls to V.choices.get_mm_configs with calls to gather the active templates, and then using those in a single call # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161351	2025-09-10 18:47:44 +00:00
Ruben Rodriguez Buchillon	f08487aa86	[inductor] FlexibleLayout for ExternKernelChoice for mms (#161351 ) # why - if we only use ExternKernelChoice we're not doing any codegen - if we're not doing any codegen, we can use a FlexibleLayout here, and provide deeper passes more chances to change it # what - if all the kernel template choices (KTC) are with a ExternKernelChoice template, we switch to a FlexibleLayout before generating the choice - add a test to make sure that works as intended (FlexibleLayout for only extern, and FixedLayout if Triton is involved) - caveats: - because CPP, CUTLASS, and CK are not using V.choices.get_mm_configs yet, we turn off the optimization if either of those backends are in use. This will be relaxed once they support this too - because Triton templates are still using their own calls (not a single call) to get_mm_configs, it's also turned off there. The next diff unifies Triton + ATEN to a single call to get_mm_configs and that in turn allows the optimization there too # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-10 18:47:34 +00:00
Edward Yang	1051c7dbc2	Don't unconditionally import torch._dynamo, it's slow (#162595 ) A trivial test on OS X. Before: ``` real 0m6.550s user 0m2.532s sys 0m3.359s ``` After: ``` real 0m2.607s user 0m1.898s sys 0m3.344s ``` Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162595 Approved by: https://github.com/albanD	2025-09-10 17:21:03 +00:00
suo	2dc2613180	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-10 16:59:18 +00:00
Robert Hardwick	582d278983	Build and Install Arm Compute Library in manylinux docker image (#159737 ) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737 Approved by: https://github.com/seemethere ghstack dependencies: #160078	2025-09-10 15:39:38 +00:00
Benjamin Glass	b5e6e58050	[nn] Assert parsed iterable arguments are an appropriate length (#162340 ) Fixes #162327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162340 Approved by: https://github.com/Skylion007	2025-09-10 15:15:49 +00:00
Masaki Kozuki	fefc406a3d	fix typo: summit -> submit (#162587 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587 Approved by: https://github.com/justinchuby	2025-09-10 14:43:53 +00:00
atalman	3d32bb114b	[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566 ) Fixes aarch64 linux packaging, following error: https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62 ``` Traceback (most recent call last): File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module> import torch File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libarm_compute.so: cannot open shared object file: No such file or directory ``` Due to missing dependencies. Current Error: File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Hence the repackaging does not take any effect. This PR does following File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file. ``` 2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling 2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1 2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5 2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so 2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so 2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0 2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0 2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0 2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0 2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64 2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64 2025-09-10T10:10:14.1503682Z 2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK 2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl ``` Test Plan, Executed on local file: ``` inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD Bundling CUDA libraries with wheel Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64 to Tag: cp310-cp310-manylinux_2_28_aarch64 Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl.. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566 Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug	2025-09-10 14:22:41 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	de05dbc39c	Replace export_for_training with export (#162396 ) Summary: replace export_for_training with epxort Test Plan: CI Rollback Plan: Differential Revision: D81935792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162396 Approved by: https://github.com/angelayi, https://github.com/jerryzh168	2025-09-10 14:19:34 +00:00
PyTorch MergeBot	fc1b09a52a	Revert "Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267 )" This reverts commit b9a7d0e13b4a34be83c778734dbad437c7c5117b. Reverted https://github.com/pytorch/pytorch/pull/162267 on behalf of https://github.com/malfet due to Not sure how it happened, but looks like it broke everything, see `c2388201fc/1` ([comment](https://github.com/pytorch/pytorch/pull/162267#issuecomment-3275164109))	2025-09-10 14:12:22 +00:00
Alexander Grund	c2388201fc	Fix decorators skipping NCCL tests (#158846 ) Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip` In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup Using `unittest.skip` decorators avoids the starting of the test in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846 Approved by: https://github.com/Skylion007	2025-09-10 12:25:42 +00:00
Tan Hoang	a6f9e0e62a	[c10d][nvshmem] fix override function modifier (#162515 ) Summary: Fix compilation error in fbsource by missing override modifier Differential Revision: D82038876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162515 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-09-10 11:35:49 +00:00
Yiming Zhou	337fe1079d	[nativert] AOTI delegate with flat inputs and outputs (#162538 ) Summary: `executorch_call_delegate` should have flattened inputs and outputs. So that it can be correctly serialized and the input/output specs are consistent with runtime. Test Plan: CI Rollback Plan: Differential Revision: D82064354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162538 Approved by: https://github.com/dolpm	2025-09-10 11:35:44 +00:00
Klaus Zimmermann	b494547f0b	Make functorch notebook symlinks PEP 517 valid (#157813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157813 Approved by: https://github.com/zou3519, https://github.com/atalman	2025-09-10 10:13:24 +00:00
dolpm	d9832d8425	[triton][export] serialization in internal path + unit tests (#162200 ) Summary: will package triton artifacts to be runnable in nativert if wrappers exist. Test Plan: unit tests Rollback Plan: Differential Revision: D81368559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162200 Approved by: https://github.com/angelayi	2025-09-10 09:49:10 +00:00
Menglu Yu	f0ae3a57f6	[Optimus] Add batch dropout pattern (#162443 ) Summary: We observe dropout pattern in AFOC, such add a new pattern to Optimus Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_batch_dropout_pre_grad_fusion ``` Buck UI: https://www.internalfb.com/buck2/2c899fb5-6e8b-43eb-8fb3-b53abfbfa6d9 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598805248688 Network: Up: 0B Down: 0B (reSessionID-bfbb9e6a-7e2a-425a-a027-b44282cef419) Executing actions. Remaining 0/3 1.3s exec time total Command: test. Finished 2 local Time elapsed: 1:22.3s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E baseline f791163796 proposal f793225207 Rollback Plan: Differential Revision: D81981264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162443 Approved by: https://github.com/Yuzhen11, https://github.com/mlazos	2025-09-10 09:49:01 +00:00
Robert Hardwick	26b3ae5890	Move prioritized text linker optimization code from setup.py to cmake (#160078 ) Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078 Approved by: https://github.com/seemethere	2025-09-10 09:21:53 +00:00
fduwjj	be8095b07f	[DeviceMesh] Clarifying flatten use case (#161311 ) Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding: 1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached). 2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in https://github.com/pytorch/pytorch/pull/160709 but it does not fixed the check for the case when we call the `_flatten` twice. What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why? 1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__). 2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that. Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161311 Approved by: https://github.com/fegin	2025-09-10 07:46:51 +00:00
FFFrog	b2d8f6a6af	[OpenReg] Update the docs about Accelerator Integration (#162046 ) Fix the issue describled by this [comment](https://github.com/pytorch/pytorch/pull/161845#discussion_r2317299390) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162046 Approved by: https://github.com/albanD	2025-09-10 07:45:07 +00:00
Huy Do	98e22c8a69	Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555 ) Fixes https://github.com/pytorch/pytorch/issues/68643 It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647. It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555 Approved by: https://github.com/clee2000	2025-09-10 07:05:14 +00:00
PyTorch MergeBot	e1f0a69943	Revert "test fixing benchmarks (#162503 )" This reverts commit 484c4093a87a3e6767e55ed553f95db8fc137442. Reverted https://github.com/pytorch/pytorch/pull/162503 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it regresses CPU perf smoke test ([comment](https://github.com/pytorch/pytorch/pull/162503#issuecomment-3273554680))	2025-09-10 06:55:35 +00:00
Xingyuan Li	833997a6fd	[Inductor][UT] Fix flex attention related inductor cases (#162450 ) ## Motivation Fixes #162435, Fixes #162436 UT failures: * https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651636 * https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651637 To fix flex attention related cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162450 Approved by: https://github.com/drisspg	2025-09-10 06:48:00 +00:00
Benjamin Girard	b9a7d0e13b	Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267 ) Change is_impure to check in-place operations on Node to prevent eliminate_dead_code from eliminating in-place operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162267 Approved by: https://github.com/ezyang	2025-09-10 06:02:15 +00:00
dolpm	1c16c18a53	[nativert][triton] improve hardware registration (#162499 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D82031814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162499 Approved by: https://github.com/angelayi	2025-09-10 04:52:57 +00:00
PyTorch MergeBot	96ef26f71a	Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105 )" This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5. Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))	2025-09-10 04:42:28 +00:00
Rob Timpe	5ac112b569	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-10 04:39:20 +00:00
Edward Yang	dda071587f	Revert "Make distributed modules importable even when backend not built (#159889 )" (#162568 ) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn	2025-09-10 04:29:42 +00:00
PyTorch UpdateBot	11acfed3ce	[audio hash update] update the pinned audio hash (#162552 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162552 Approved by: https://github.com/pytorchbot	2025-09-10 04:24:39 +00:00
Nikita Shulga	5f40a8a9a3	[BE] Fix `'_WIN32' is not defined` warning (#162516 ) Summary: As indeed it is not defined neither on Linux nor on MacOS platforms Test Plan: CI Rollback Plan: Differential Revision: D82044853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162516 Approved by: https://github.com/Skylion007	2025-09-10 04:21:38 +00:00
Huy Do	e64965300a	Repackage vLLM nightlies (#162371 ) I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true. The error is as follows: ``` $ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129 Using Python 3.12.11+meta environment at: venv/py3.12 Resolved 28 packages in 759ms error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129) Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371 Approved by: https://github.com/atalman	2025-09-10 04:02:34 +00:00
Huy Do	00985970e3	Put torchao (0.13.0) back to benchmark workflow (#162227 ) 0.13.0 was released on Sep 3rd https://pypi.org/project/torchao/#history, which should have fixed the crashing issue on transformers now Pull Request resolved: https://github.com/pytorch/pytorch/pull/162227 Approved by: https://github.com/malfet	2025-09-10 03:56:25 +00:00
angelayi	484c4093a8	test fixing benchmarks (#162503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162503 Approved by: https://github.com/huydhn ghstack dependencies: #160741	2025-09-10 03:15:49 +00:00
Boyuan Feng	760c478a14	[FlexAttn][Minor] Update FlexConfig doc (#162533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162533 Approved by: https://github.com/drisspg	2025-09-10 02:03:48 +00:00
Yu Guo	dc4f97e9c1	[triton] enable int64 indexing in convolution and mm template (#162506 ) Summary: hitting illegal memory access issue when compiling conv and addmm kernels with the change in https://github.com/pytorch/pytorch/pull/157767 Differential Revision: D81995664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162506 Approved by: https://github.com/iseeyuan	2025-09-10 01:53:26 +00:00
Justin Chu	c66e58b7d0	[ONNX] Expose the testing module (#162495 ) * Created a new module `torch/onnx/testing.py` that exposes the `assert_onnx_program` function for testing exported ONNX models. * Updated the ONNX documentation (`docs/source/onnx.md`) to include `onnx_testing` in the list of relevant modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162495 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2025-09-10 01:40:24 +00:00
Tristan Rice	878f59ef75	DeviceMesh: support _rank for use with non-global PGs (#162439 ) Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated. Test Plan: ``` buck2 test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:device_mesh -- init_backend ``` Rollback Plan: Differential Revision: D81981777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-09-10 01:18:28 +00:00
Tianyu Liu	e60ad4f628	[DTensor] fix copy_ strategy to support linearity (#162460 ) Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538 where `aten.copy_.default` is registered as a pointwise op, but without linearity. In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy. This was discovered from silent incorrect results e.g. on `torch.einsum` backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460 Approved by: https://github.com/zpcore	2025-09-10 00:47:14 +00:00
PyTorch MergeBot	2281d009e5	Revert "[ROCm] Add specific compile options for CK SDPA (#161759 )" This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e. Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))	2025-09-10 00:44:30 +00:00
Saurabh Mishra	33589374b6	[DCP] Avoid multiple storage writer resets in async save (#159448 ) Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path. Test Plan: ``` buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test ``` https://www.internalfb.com/intern/testinfra/testrun/15199648841705052 Rollback Plan: Differential Revision: D79230339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448 Approved by: https://github.com/meetv18	2025-09-10 00:43:03 +00:00
Animesh Jain	5539916fe1	[dynamo][refactor] Move get_framelocals_idx to a helper (#162519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162519 Approved by: https://github.com/williamwen42	2025-09-10 00:35:09 +00:00
Laith Sakka	e4174b1fd7	remove gso from collapse_view_helper (#162212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162212 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-09-10 00:17:15 +00:00
Scott Wolchok	0e7ccc09db	[easy] Don't force copy result of getAllOperatorsFor in init.cpp (#162218 ) It returns a const reference to a vector. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162218 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220	2025-09-10 00:08:15 +00:00
Thomas Bohnstingl	87cc126457	[associative_scan] partial gradient support (#162388 ) This PR tests the partial gradient support of the `associative_scan` operation. It replaces https://github.com/bohnstingl/pytorch/pull/6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162388 Approved by: https://github.com/ydwu4	2025-09-09 23:52:29 +00:00
PyTorch MergeBot	a3e26d1727	Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670 )" This reverts commit e2545487de3dbbe663e3f0adb699547a14da0f6a. Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing a trunk test ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3272626391))	2025-09-09 23:40:26 +00:00
Andy Lugo	d2393c2d7d	[ROCm] Integrate AITER Fav3 fwd kernels (#160105 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105 Approved by: https://github.com/jeffdaily	2025-09-09 22:30:12 +00:00
SandishKumarHN	b498299953	154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690 ) Fixes #154849 This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments. Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api Tests: For validations purpose: test_signal_handling.py, simple_test_api_signal_handling.py, Unit Tests: for launcher changes:launcher/test_api.py for api changes: multiprocessing/test_api.py E2E: test_run.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690 Approved by: https://github.com/fduwjj	2025-09-09 22:23:06 +00:00
Howard Huang	4d66a3b894	fix Dtensor doc link (#162494 ) Small fix for https://docs.pytorch.org/docs/main/distributed.tensor.parallel.html <img width="890" height="274" alt="image" src="https://github.com/user-attachments/assets/6ee7fc7c-e0fe-4f5e-ab7e-a895bb3fa79f" /> now it is: <img width="909" height="320" alt="image" src="https://github.com/user-attachments/assets/8b2c41ef-1684-4597-8dae-144b49723796" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162494 Approved by: https://github.com/XilunWu	2025-09-09 22:10:37 +00:00
Rob Timpe	e2545487de	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-09 21:07:49 +00:00
Ke Wen	8922bbcaab	Use same NVSHMEM version across CUDA builds (#162206 ) #161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007	2025-09-09 20:59:50 +00:00
atalman	14744e1ab2	[Release 2.9] Add compatibility matrix, Version Bump (#162526 ) Release 2.9 1. Add release compatibility matrix 2. Add version bump for 2.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162526 Approved by: https://github.com/malfet	2025-09-09 20:38:15 +00:00
Jeff Daily	b477fb106f	[ROCm] enable grouped gemm fallback (#162419 ) Enables bf16 group gemm alternative path as described in #161366 Fast path will be enabled in future through CK integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 20:04:56 +00:00
Andy Lugo	d22d916719	[ROCm] Add specific compile options for CK SDPA (#161759 ) Updates CK version and adds CK specific compilation options Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759 Approved by: https://github.com/jeffdaily	2025-09-09 20:04:19 +00:00
morrison-turnansky	86d34a43f5	NamedTuple: Allow side effects for dynamic attributes (#161645 ) I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it. The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple. Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object. Fixes #161610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645 Approved by: https://github.com/anijain2305, https://github.com/StrongerXi	2025-09-09 19:42:02 +00:00
Huy Do	8508651477	Fix flaky AOTFxirTestCase (#162472 ) Fixes https://github.com/pytorch/pytorch/issues/162357 Fixes https://github.com/pytorch/pytorch/issues/160970 Fixes https://github.com/pytorch/pytorch/issues/161038 Fixes https://github.com/pytorch/pytorch/issues/160951 Fixes https://github.com/pytorch/pytorch/issues/161698 These tests were introduced in https://github.com/pytorch/pytorch/pull/160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option). The issue could be reproduced by running them locally multiple times. For example, ``` pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add (output logs at P1938386961) ... --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] ================================================================================================================================================= short test summary info ================================================================================================================================================== FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' =============================================================================================================================================== 3 failed, 7 passed in 12.89s =============================================================================================================================================== ``` Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from https://github.com/pytorch/pytorch/pull/160765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162472 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-09-09 19:39:24 +00:00
rzou	723c27ed78	[standalone_compile] binary format write should be atomic (#162432 ) We update it to call write_atomic instead of file.write Pull Request resolved: https://github.com/pytorch/pytorch/pull/162432 Approved by: https://github.com/oulgen	2025-09-09 18:43:13 +00:00
Benjamin Glass	bdbe931d58	[build] Add LeakSanitizer option to CMake (#158686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158686 Approved by: https://github.com/eellison	2025-09-09 18:41:20 +00:00
jainapurva	af60398c3a	Update the operator benchmarking, to benchmark using torch.compile (#161394 ) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394 Approved by: https://github.com/jbschlosser	2025-09-09 18:17:37 +00:00
PyTorch MergeBot	82f1eb9b03	Revert "[MPS] mps sparse mul op implementation (#162349 )" This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54. Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))	2025-09-09 18:14:16 +00:00
Brian Hirsh	4b2d297eec	python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580 ) My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps: (1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop. (2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++) I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us): aten.detach(plain_tensor) ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790> _ = x.detach() 4.80 us 1 measurement, 100000 runs , 1 thread ``` aten.detach(DTensor) [before this PR] ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750> _ = x_dt.detach() 336.40 us 1 measurement, 1000 runs , 1 thread ``` aten.detach(DTensor) [after this PR] ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520> _ = x_dt.detach() Median: 133.45 us 2 measurements, 1000 runs per measurement, 1 thread ``` benchmark script: ``` import torch import torch.distributed as dist from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard from torch.testing._internal.distributed.fake_pg import FakeStore import torch.utils.benchmark as benchmark fake_store = FakeStore() dist.init_process_group("fake", store=fake_store, rank=0, world_size=2) mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,)) x = torch.randn(4, 4, requires_grad=True) x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) t0 = benchmark.Timer( stmt='_ = x_dt.detach()', globals={'x_dt': x_dt}, ) print(t0.blocked_autorange()) dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580 Approved by: https://github.com/ezyang	2025-09-09 18:04:56 +00:00
Jane Xu	0ec723acd0	Update docs for quantile to be clearer for nearest (#162423 ) Correct the rounding scheme for nearest in quantile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162423 Approved by: https://github.com/soulitzer	2025-09-09 18:04:12 +00:00
Howard Huang	e1be887870	[PP] Add spacing to visualizer (#160474 ) When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this: <img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" /> While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this: <img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" /> This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change: <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160474 Approved by: https://github.com/fegin	2025-09-09 17:52:52 +00:00
Ruben Rodriguez Buchillon	d91eecc9a5	[inductor][template heuristics] don't take layout to generate choices (#162238 ) # why - unnecessary as we only ever need to know the dtype and maybe the device - we already take in the kernel inputs which have the device - enable us to specify the layout after finding all the configs but before generating the ChoiceCallers # what - replace all calls in template_heuristics that used to take Layout with now just taking out_dtype # testing ci Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238 Approved by: https://github.com/eellison ghstack dependencies: #161347, #161348, #161349	2025-09-09 17:17:04 +00:00
Ruben Rodriguez Buchillon	24a4dae85b	[inductor] V.choices.get_mm_configs override point (#161349 ) # why - enable us to override the default configs, or fall back to them through subclassing InductorChoices # what - override (private) function - default implementationt takes the kernel template choice (ktc) generator for every template and just executes the generator - future overrides can decide to replace those generators, or filter out choices - the 2nd expensive step (maybe_append_choices, choice_or_none) is handled outside this function, in the main V.choices.get_mm_configs this means that any overriding benefits from not generating expensive templates that aren't going to be used # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349 Approved by: https://github.com/eellison ghstack dependencies: #161347, #161348	2025-09-09 17:17:04 +00:00
Ruben Rodriguez Buchillon	d3c4cf838e	[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 ) \# why - every callsite just executes the generator on the spot - previous pr adds the ability to add an override before expensive generators are executed, so we don't need this generator anymore \# what - rather than yielding the ChoiceCaller, just return the list of all valid ChoiceCallers \# testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348 Approved by: https://github.com/eellison ghstack dependencies: #161347	2025-09-09 17:16:57 +00:00
Ruben Rodriguez Buchillon	b1e99c8c7a	[inductor] add kernel template choice (ktc) (#161347 ) # why - gather everything up to make choices, without running potentially expensive generators - enables overrides where we toss the entire list of configs from inductor, without having to enumrate it (expensive) # what - add a holding class that just gets all the components necessary to generate a ChoiceCaller - use that class to generate ChoiceCallers - this does not (yet) add the override function, but just prepares the scene ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347 Approved by: https://github.com/eellison	2025-09-09 17:16:50 +00:00
Eddie Yan	5eb35d2ab8	[CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387 Approved by: https://github.com/Skylion007	2025-09-09 17:04:06 +00:00
Jeff Daily	f03d635dc6	[ROCm][CI] skip test_max_autotune until resolved (#162496 ) many tests taking >30 min and causing timeouts Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 16:34:01 +00:00
Hashem Hashemi	1f0b01d4b6	[ROCm] OffsetCalc Unroll Optimization (#161700 ) Our compiler is generating inefficient code for the offsetCalc in certain situations. The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700 Approved by: https://github.com/jeffdaily	2025-09-09 16:11:48 +00:00
Prachi Gupta	c0142f5c06	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@fb.com>	2025-09-09 15:49:21 +00:00
Isalia20	3ea6868049	[MPS] mps sparse mul op implementation (#162349 ) Implements mps sparse mul operation as well as enables other operations such as: 1. copy_ 2. div 3. sum 4. floor 5. power 6. sub 7. floor_divide Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349 Approved by: https://github.com/pearu, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-09 15:45:37 +00:00
Jack Taylor	be3b8d2ec9	[ROCm][CI] update fbgemm nightly benchmark hash (#162385 ) fbgemm_gpu was failing to clone due to missing submodule commit. ``` + pushd fbgemm/fbgemm_gpu ~/pytorch/fbgemm/fbgemm_gpu ~/pytorch + git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9 error: Submodule 'external/composable_kernel' could not be updated. error: Cannot update submodule: external/composable_kernel ``` Log File [inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162385 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 15:44:39 +00:00
PyTorch MergeBot	5ccf3ca3ec	Revert "Use same NVSHMEM version across CUDA builds (#162206 )" This reverts commit 0d9c95cd7ee299e2e8c09df26d395be8775b506b. Reverted https://github.com/pytorch/pytorch/pull/162206 on behalf of https://github.com/malfet due to Broke lint, see `4dd73e659a/1` ([comment](https://github.com/pytorch/pytorch/pull/162206#issuecomment-3271040521))	2025-09-09 14:40:45 +00:00
atalman	e38e953432	CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425 ) Related to https://github.com/pytorch/pytorch/issues/162333 https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425 Approved by: https://github.com/tinglvv, https://github.com/malfet	2025-09-09 14:40:34 +00:00
PyTorch MergeBot	4dd73e659a	Revert "fix torch.sparse.log_softmax on CPU (#161959 )" This reverts commit 002e59440afe8711019e68df500f5e18b9a43f3c. Reverted https://github.com/pytorch/pytorch/pull/161959 on behalf of https://github.com/davidberard98 due to test failure: test_sparse.py::TestSparseMPS::test_log_softmax_float_mps_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17573794461/job/49915138287) [HUD commit link](`002e59440a`) ([comment](https://github.com/pytorch/pytorch/pull/161959#issuecomment-3270509418))	2025-09-09 12:33:25 +00:00
Ke Wen	0d9c95cd7e	Use same NVSHMEM version across CUDA builds (#162206 ) #161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007	2025-09-09 08:52:27 +00:00
Scott Wolchok	dcc42e95f4	Fix missing moves in initJITBindings (#162428 ) Per @Skylion007 on #162219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162428 Approved by: https://github.com/Skylion007	2025-09-09 08:47:33 +00:00
Sun, Jiayi	002e59440a	fix torch.sparse.log_softmax on CPU (#161959 ) Fix https://github.com/pytorch/pytorch/issues/152293. Example: ``` import torch from torch.sparse import log_softmax as sparse_log_softmax def test_bug(): a = torch.rand(4, 3) b = a - 10000000.0 b_sparse = b.to_sparse() cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense() print('cpu_out_sparse =', cpu_out_sparse) b_sparse_double = b.double().to_sparse() cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense() print('cpu_out_sparse_double =', cpu_out_sparse_double) if __name__ == '__main__': test_bug() ``` Output: - before ``` cpu_out_sparse = tensor([[-2., -1., -2.], [-1., -1., -1.], [-1., -2., -2.], [-1., -1., -2.]]) cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514], [-1.0986, -1.0986, -1.0986], [-0.5514, -1.5514, -1.5514], [-0.8620, -0.8620, -1.8620]], dtype=torch.float64) ``` - after ``` cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]]) cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]], dtype=torch.float64) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959 Approved by: https://github.com/Skylion007	2025-09-09 06:25:16 +00:00

2931 changed files with 89665 additions and 42980 deletions

									
										15

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -5,9 +5,9 @@ GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				# Set CUDA architecture lists to match x86 build_cuda.sh

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;8.0;9.0"

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="7.0;8.0;9.0;10.0;12.0"

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"

				fi

				@ -15,6 +15,8 @@ fi

				# Compress the fatbin with -compress-mode=size for CUDA 13

				if [[ "$DESIRED_CUDA" == *"13"* ]]; then

				    export TORCH_NVCC_FLAGS="-compress-mode=size"

				    # Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801

				    export BUILD_BUNDLE_PTXAS=1

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				@ -31,8 +33,7 @@ pip install -r /pytorch/requirements.txt

				pip install auditwheel==6.2.0 wheel

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn

				    python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn

				else

				    echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"

				    export USE_SYSTEM_NCCL=1

				@ -42,13 +43,9 @@ else

				        echo "Bundling CUDA libraries with wheel for aarch64."

				    else

				        echo "Using nvidia libs from pypi for aarch64."

				        # Fix platform constraints in PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64

				        # Replace 'platform_machine == "x86_64"' with 'platform_machine == "aarch64"'

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS//platform_machine == \'x86_64\'/platform_machine == \'aarch64\'}"

				        echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"

				        export USE_NVIDIA_PYPI_LIBS=1

				    fi

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				    python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				fi

									
										77

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -13,49 +13,6 @@ def list_dir(path: str) -> list[str]:

				    return check_output(["ls", "-1", path]).decode().split("\n")

				def build_ArmComputeLibrary() -> None:

				    """

				    Using ArmComputeLibrary for aarch64 PyTorch

				    """

				    print("Building Arm Compute Library")

				    acl_build_flags = [

				        "debug=0",

				        "neon=1",

				        "opencl=0",

				        "os=linux",

				        "openmp=1",

				        "cppthreads=0",

				        "arch=armv8a",

				        "multi_isa=1",

				        "fixed_format_kernels=1",

				        "build=native",

				    ]

				    acl_install_dir = "/acl"

				    acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")

				    if os.path.isdir(acl_install_dir):

				        shutil.rmtree(acl_install_dir)

				    if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):

				        check_call(

				            [

				                "git",

				                "clone",

				                "https://github.com/ARM-software/ComputeLibrary.git",

				                "-b",

				                "v25.02",

				                "--depth",

				                "1",

				                "--shallow-submodules",

				            ]

				        )

				    check_call(

				        ["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,

				        cwd=acl_checkout_dir,

				    )

				    for d in ["arm_compute", "include", "utils", "support", "src", "build"]:

				        shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")

				def replace_tag(filename) -> None:

				    with open(filename) as f:

				        lines = f.readlines()

				@ -138,6 +95,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				    folder = os.path.dirname(wheel_path)

				    os.mkdir(f"{folder}/tmp")

				    os.system(f"unzip {wheel_path} -d {folder}/tmp")

				    # Delete original wheel since it will be repackaged

				    os.system(f"rm {wheel_path}")

				    # Check if we should use PyPI NVIDIA libraries or bundle system libraries

				    use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

				@ -211,7 +170,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        ]

				        # CUDA version-specific libraries

				        if "130" in desired_cuda:

				        if "13" in desired_cuda:

				            minor_version = desired_cuda[-1]

				            version_specific_libs = [

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",

				                "/usr/local/cuda/lib64/libcublas.so.13",

				@ -221,7 +181,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				                "/usr/local/cuda/lib64/libcusolver.so.12",

				                "/usr/local/cuda/lib64/libnvJitLink.so.13",

				                "/usr/local/cuda/lib64/libnvrtc.so.13",

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.13.0",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",

				            ]

				        elif "12" in desired_cuda:

				            # Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")

				@ -237,6 +197,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				                "/usr/local/cuda/lib64/libnvrtc.so.12",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",

				            ]

				        else:

				            raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")

				        # Combine all libraries

				        libs_to_copy = common_libs + version_specific_libs

				@ -275,14 +237,7 @@ def complete_wheel(folder: str) -> str:

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name.replace(

				            "linux_aarch64", "manylinux_2_28_aarch64"

				        )

				        print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")

				        os.rename(

				            f"/{folder}/dist/{wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				        repaired_wheel_name = list_dir(f"/{folder}/dist")[0]

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				@ -319,7 +274,7 @@ if __name__ == "__main__":

				    ).decode()

				    print("Building PyTorch wheel")

				    build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    build_vars = ""

				    # MAX_JOB=5 is not required for CPU backend (see commit 465d98b)

				    if enable_cuda:

				        build_vars += "MAX_JOBS=5 "

				@ -358,23 +313,17 @@ if __name__ == "__main__":

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				    if enable_mkldnn:

				        build_ArmComputeLibrary()

				        print("build pytorch with mkldnn+acl backend")

				        build_vars += (

				            "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "

				            "ACL_ROOT_DIR=/acl "

				            "LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "

				            "ACL_INCLUDE_DIR=/acl/build "

				            "ACL_LIBRARY=/acl/build "

				        )

				        build_vars += "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "

				        build_vars += "ACL_ROOT_DIR=/acl "

				        if enable_cuda:

				            build_vars += "BLAS=NVPL "

				        else:

				            build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "

				            build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/opt/OpenBLAS "

				    else:

				        print("build pytorch without mkldnn backend")

				    os.system(f"cd /pytorch; {build_vars} python3 setup.py bdist_wheel")

				    os.system(f"cd /pytorch; {build_vars} python3 -m build --wheel --no-isolation")

				    if enable_cuda:

				        print("Updating Cuda Dependency")

				        filename = os.listdir("/pytorch/dist/")

									
										64

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -241,7 +241,7 @@ def wait_for_connection(addr, port, timeout=15, attempt_cnt=5):

				        try:

				            with socket.create_connection((addr, port), timeout=timeout):

				                return

				        except (ConnectionRefusedError, socket.timeout):  # noqa: PERF203

				        except (ConnectionRefusedError, TimeoutError):  # noqa: PERF203

				            if i == attempt_cnt - 1:

				                raise

				            time.sleep(timeout)

				@ -299,40 +299,6 @@ def install_condaforge_python(host: RemoteHost, python_version="3.8") -> None:

				        )

				def build_OpenBLAS(host: RemoteHost, git_clone_flags: str = "") -> None:

				    print("Building OpenBLAS")

				    host.run_cmd(

				        f"git clone https://github.com/xianyi/OpenBLAS -b v0.3.28 {git_clone_flags}"

				    )

				    make_flags = "NUM_THREADS=64 USE_OPENMP=1 NO_SHARED=1 DYNAMIC_ARCH=1 TARGET=ARMV8"

				    host.run_cmd(

				        f"pushd OpenBLAS && make {make_flags} -j8 && sudo make {make_flags} install && popd && rm -rf OpenBLAS"

				    )

				def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None:

				    print("Building Arm Compute Library")

				    acl_build_flags = " ".join(

				        [

				            "debug=0",

				            "neon=1",

				            "opencl=0",

				            "os=linux",

				            "openmp=1",

				            "cppthreads=0",

				            "arch=armv8a",

				            "multi_isa=1",

				            "fixed_format_kernels=1",

				            "build=native",

				        ]

				    )

				    host.run_cmd(

				        f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"

				    )

				    host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")

				def embed_libgomp(host: RemoteHost, use_conda, wheel_name) -> None:

				    host.run_cmd("pip3 install auditwheel")

				    host.run_cmd(

				@ -442,7 +408,7 @@ def build_torchvision(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    host.run_cmd(f"cd vision && {build_vars} python3 setup.py bdist_wheel")

				    host.run_cmd(f"cd vision && {build_vars} python3 -m build --wheel --no-isolation")

				    vision_wheel_name = host.list_dir("vision/dist")[0]

				    embed_libgomp(host, use_conda, os.path.join("vision", "dist", vision_wheel_name))

				@ -497,7 +463,7 @@ def build_torchdata(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    host.run_cmd(f"cd data && {build_vars} python3 setup.py bdist_wheel")

				    host.run_cmd(f"cd data && {build_vars} python3 -m build --wheel --no-isolation")

				    wheel_name = host.list_dir("data/dist")[0]

				    embed_libgomp(host, use_conda, os.path.join("data", "dist", wheel_name))

				@ -553,7 +519,7 @@ def build_torchtext(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    host.run_cmd(f"cd text && {build_vars} python3 setup.py bdist_wheel")

				    host.run_cmd(f"cd text && {build_vars} python3 -m build --wheel --no-isolation")

				    wheel_name = host.list_dir("text/dist")[0]

				    embed_libgomp(host, use_conda, os.path.join("text", "dist", wheel_name))

				@ -614,7 +580,7 @@ def build_torchaudio(

				    host.run_cmd(

				        f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \

				        && ./packaging/ffmpeg/build.sh \

				        && {build_vars} python3 setup.py bdist_wheel"

				        && {build_vars} python3 -m build --wheel --no-isolation"

				    )

				    wheel_name = host.list_dir("audio/dist")[0]

				@ -700,7 +666,6 @@ def start_build(

				    configure_system(

				        host, compiler=compiler, use_conda=use_conda, python_version=python_version

				    )

				    build_OpenBLAS(host, git_clone_flags)

				    if host.using_docker():

				        print("Move libgfortant.a into a standard location")

				@ -723,10 +688,12 @@ def start_build(

				        f"git clone --recurse-submodules -b {branch} https://github.com/pytorch/pytorch {git_clone_flags}"

				    )

				    host.run_cmd("pytorch/.ci/docker/common/install_openblas.sh")

				    print("Building PyTorch wheel")

				    build_opts = ""

				    if pytorch_build_number is not None:

				        build_opts += f" --build-number {pytorch_build_number}"

				        build_opts += f" -C--build-option=--build-number={pytorch_build_number}"

				    # Breakpad build fails on aarch64

				    build_vars = "USE_BREAKPAD=0 "

				    if branch == "nightly":

				@ -743,15 +710,18 @@ def start_build(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    if enable_mkldnn:

				        build_ArmComputeLibrary(host, git_clone_flags)

				        host.run_cmd("pytorch/.ci/docker/common/install_acl.sh")

				        print("build pytorch with mkldnn+acl backend")

				        build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"

				        build_vars += " BLAS=OpenBLAS"

				        build_vars += " OpenBLAS_HOME=/opt/OpenBLAS"

				        build_vars += " ACL_ROOT_DIR=/acl"

				        host.run_cmd(

				            f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && {build_vars} python3 setup.py bdist_wheel{build_opts}"

				            f"cd $HOME/pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"

				        )

				        print("Repair the wheel")

				        pytorch_wheel_name = host.list_dir("pytorch/dist")[0]

				        ld_library_path = "$HOME/acl/build:$HOME/pytorch/build/lib"

				        ld_library_path = "/acl/build:$HOME/pytorch/build/lib"

				        host.run_cmd(

				            f"export LD_LIBRARY_PATH={ld_library_path} && auditwheel repair $HOME/pytorch/dist/{pytorch_wheel_name}"

				        )

				@ -763,7 +733,7 @@ def start_build(

				    else:

				        print("build pytorch without mkldnn backend")

				        host.run_cmd(

				            f"cd pytorch && {build_vars} python3 setup.py bdist_wheel{build_opts}"

				            f"cd pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"

				        )

				    print("Deleting build folder")

				@ -907,7 +877,7 @@ def terminate_instances(instance_type: str) -> None:

				def parse_arguments():

				    from argparse import ArgumentParser

				    parser = ArgumentParser("Builid and test AARCH64 wheels using EC2")

				    parser = ArgumentParser("Build and test AARCH64 wheels using EC2")

				    parser.add_argument("--key-name", type=str)

				    parser.add_argument("--debug", action="store_true")

				    parser.add_argument("--build-only", action="store_true")

				@ -1004,7 +974,7 @@ if __name__ == "__main__":

				        install_condaforge_python(host, args.python_version)

				        sys.exit(0)

				    python_version = args.python_version if args.python_version is not None else "3.9"

				    python_version = args.python_version if args.python_version is not None else "3.10"

				    if args.use_torch_from_pypi:

				        configure_system(host, compiler=args.compiler, python_version=python_version)

									
										3

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -69,7 +69,8 @@ RUN bash ./install_cuda.sh 13.0

				ENV DESIRED_CUDA=13.0

				FROM ${ROCM_IMAGE} as rocm

				ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				ENV MKLROOT /opt/intel

									
										6

.ci/docker/almalinux/build.sh
									
												View File
												
				@ -36,6 +36,12 @@ case ${DOCKER_TAG_PREFIX} in

				    ;;

				  rocm*)

				    BASE_TARGET=rocm

				    PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				    # add gfx950, gfx115x conditionally starting in ROCm 7.0

				    if [[ "$ROCM_VERSION" == *"7.0"* ]]; then

				        PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"

				    fi

				    EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				    ;;

				  *)

				    echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

									
										44

.ci/docker/build.sh
									
												View File
												
				@ -84,8 +84,8 @@ fi

				_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152

				_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96

				if [[ "$image" == *rocm* ]]; then

				  _UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				  _UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e

				  _UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77

				fi

				tag=$(echo $image | awk -F':' '{print $2}')

				@ -175,20 +175,6 @@ case "$tag" in

				    fi

				    GCC_VERSION=11

				    VISION=yes

				    ROCM_VERSION=6.4

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    if [[ $tag =~ "benchmarks" ]]; then

				      INDUCTOR_BENCHMARKS=yes

				    fi

				    ;;

				  pytorch-linux-noble-rocm-alpha-py3)

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    VISION=yes

				    ROCM_VERSION=7.0

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				@ -196,6 +182,9 @@ case "$tag" in

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"

				    if [[ $tag =~ "benchmarks" ]]; then

				      INDUCTOR_BENCHMARKS=yes

				    fi

				    ;;

				  pytorch-linux-jammy-xpu-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				@ -214,8 +203,7 @@ case "$tag" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)

				    # TODO (huydhn): Upgrade this to Python >= 3.10

				    ANACONDA_PYTHON_VERSION=3.9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				@ -263,13 +251,10 @@ case "$tag" in

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-jammy-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.10

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)

				    PYTHON_VERSION=3.9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)

				    PYTHON_VERSION=3.10

				    CUDA_VERSION=12.8.1

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				@ -359,7 +344,7 @@ docker build \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx1100}" \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				@ -456,12 +441,3 @@ elif [ "$HAS_TRITON" = "yes" ]; then

				  echo "expecting triton to not be installed, but it is"

				  exit 1

				fi

				# Sanity check cmake version.  Executorch reinstalls cmake and I'm not sure if

				# they support 4.0.0 yet, so exclude them from this check.

				CMAKE_VERSION=$(drun cmake --version)

				if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then

				  echo "CMake version is not 4.0.0:"

				  drun cmake --version

				  exit 1

				fi

									
										6

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -56,9 +56,13 @@ ENV INSTALLED_VISION ${VISION}

				# Install rocm

				ARG ROCM_VERSION

				RUN mkdir ci_commit_pins

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt

				COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				RUN rm install_rocm.sh common_utils.sh

				RUN rm -r ci_commit_pins

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 aa978594cc155fa8af48cd949f5b5f1823a
 deb42f2a8e48f5032b4a98ee781a15fa87a157cf

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt

View File

 @ -1,2 +1,2 @@
 transformers==4.54.0
 transformers==4.56.0
 soxr==0.5.0

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.27.5-1
 v2.27.5-1

1

.ci/docker/ci_commit_pins/rocm-composable-kernel.txt Normal file

View File

				`@ -0,0 +1 @@`
				`7fe50dc3da2069d6645d9deb8c017a876472a977`

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 fccfc522864cf8bc172abe0cd58ae5581e2d44b9
 ffcb92cdbe98d9f97e4e6f95247e46dfc9fd

									
										27

.ci/docker/common/install_acl.sh
									
										Normal file → Executable file
									
												View File
												
				@ -1,16 +1,27 @@

				set -euo pipefail

				#!/bin/bash

				# Script used only in CD pipeline

				readonly version=v25.02

				readonly src_host=https://github.com/ARM-software

				readonly src_repo=ComputeLibrary

				set -eux

				ACL_VERSION=${ACL_VERSION:-"v25.02"}

				ACL_INSTALL_DIR="/acl"

				# Clone ACL

				[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git

				cd ${src_repo}

				git checkout $version

				git clone https://github.com/ARM-software/ComputeLibrary.git -b "${ACL_VERSION}" --depth 1 --shallow-submodules

				ACL_CHECKOUT_DIR="ComputeLibrary"

				# Build with scons

				pushd $ACL_CHECKOUT_DIR

				scons -j8  Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \

				  os=linux arch=armv8a build=native multi_isa=1 \

				  fixed_format_kernels=1 openmp=1 cppthreads=0

				popd

				# Install ACL

				sudo mkdir -p ${ACL_INSTALL_DIR}

				for d in arm_compute include utils support src build

				do

				  sudo cp -r ${ACL_CHECKOUT_DIR}/${d} ${ACL_INSTALL_DIR}/${d}

				done

				rm -rf $ACL_CHECKOUT_DIR

									
										23

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -42,22 +42,27 @@ install_pip_dependencies() {

				  # A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current

				  # numba and scipy version used in PyTorch CI

				  conda_run pip uninstall -y numba scipy

				  # Yaspin is needed for running CI test (get_benchmark_analysis_data.py)

				  pip_install yaspin==3.1.0

				  popd

				}

				setup_executorch() {

				  pushd executorch

				  export PYTHON_EXECUTABLE=python

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON -DEXECUTORCH_BUILD_TESTS=ON"

				  as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true

				  popd

				}

				clone_executorch

				install_buck2

				install_conda_dependencies

				install_pip_dependencies

				setup_executorch

				if [ $# -eq 0 ]; then

				  clone_executorch

				  install_buck2

				  install_conda_dependencies

				  install_pip_dependencies

				  pushd executorch

				  setup_executorch

				  popd

				else

				  "$@"

				fi

									
										4

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -19,8 +19,8 @@ pip_install \

				  transformers==4.36.2

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.22.1

				pip_install onnxscript==0.4.0

				pip_install onnxruntime==1.23.0

				pip_install onnxscript==0.5.3

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										12

.ci/docker/common/install_openblas.sh
									
										Normal file → Executable file
									
												View File
												
				@ -3,8 +3,10 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules

				OPENBLAS_VERSION=${OPENBLAS_VERSION:-"v0.3.30"}

				# Clone OpenBLAS

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION}" --depth 1 --shallow-submodules

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				OPENBLAS_BUILD_FLAGS="

				@ -17,5 +19,7 @@ CFLAGS=-O3

				BUILD_BFLOAT16=1

				"

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR

				sudo make install -C $OPENBLAS_CHECKOUT_DIR

				rm -rf $OPENBLAS_CHECKOUT_DIR

									
										15

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -2,6 +2,11 @@

				set -ex

				# for pip_install function

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				ROCM_COMPOSABLE_KERNEL_VERSION="$(cat $(dirname $0)/../ci_commit_pins/rocm-composable-kernel.txt)"

				ver() {

				    printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');

				}

				@ -37,12 +42,6 @@ EOF

				    rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

				    amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"

				    # Special case for ROCM_VERSION == 7.0

				    if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then

				        rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"

				        amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"

				    fi

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				@ -113,6 +112,8 @@ EOF

				        rm -rf HIP clr

				    fi

				    pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				@ -176,6 +177,8 @@ install_centos() {

				      sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				  done

				  pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

									
										4

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -12,8 +12,8 @@ function do_install() {

				    rocm_version_nodot=${rocm_version//./}

				    # Version 2.7.2 + ROCm related updates

				    MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				    # https://github.com/icl-utk-edu/magma/pull/65

				    MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec

				    magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				    rocm_dir="/opt/rocm"

									
										6

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -66,15 +66,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				  CXX=g++-9 conda_run python -m build --wheel --no-isolation

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				  CXX=g++-9 conda_run python -m build --wheel --no-isolation

				else

				  conda_run python setup.py bdist_wheel

				  conda_run python -m build --wheel --no-isolation

				fi

				# Copy the wheel to /opt for multi stage docker builds

									
										9

.ci/docker/common/patch_libstdc.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,9 @@

				#!/bin/bash

				set -xe

				# Script used in Linux x86 and aarch64 CD pipeline

				# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.

				# see: https://github.com/pytorch/pytorch/issues/133437

				LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)

				nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20-  > weaken-symbols.txt

				objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

									
										6

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -40,12 +40,16 @@ case ${DOCKER_TAG_PREFIX} in

				        ;;

				    rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				        if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        BASE_TARGET=rocm

				        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        # add gfx950, gfx115x conditionally starting in ROCm 7.0

				        if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then

				            PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"

				        ;;

				    *)

3

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
 RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
     /opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
     done;
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
 # cmake-3.18.4 from pip; force in case cmake3 already exists
 RUN yum install -y python3-pip && \

12

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -62,6 +62,13 @@ ARG OPENBLAS_VERSION
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 # Install Arm Compute Library
 FROM base as arm_compute
 # use python3.9 to install scons
 RUN python3.9 -m pip install scons==4.7.0
 RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
 COPY ./common/install_acl.sh install_acl.sh
 RUN bash ./install_acl.sh && rm install_acl.sh
 FROM base as final
 # remove unnecessary python versions
 @ -70,4 +77,7 @@ RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
 COPY --from=arm_compute /acl /acl
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:/acl/build/:$LD_LIBRARY_PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

13

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -86,6 +86,15 @@ FROM base as nvpl
 ADD ./common/install_nvpl.sh install_nvpl.sh
 RUN bash ./install_nvpl.sh && rm install_nvpl.sh
 # Install Arm Compute Library
 FROM base as arm_compute
 # use python3.9 to install scons
 RUN python3.9 -m pip install scons==4.7.0
 RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
 COPY ./common/install_acl.sh install_acl.sh
 RUN bash ./install_acl.sh && rm install_acl.sh
 FROM base as final
 FROM final as cuda_final
 ARG BASE_CUDA_VERSION
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 @ -93,5 +102,9 @@ COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BAS
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=nvpl /opt/nvpl/lib/  /usr/local/lib/
 COPY --from=nvpl /opt/nvpl/include/  /usr/local/include/
 COPY --from=arm_compute /acl /acl
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 ENV LD_LIBRARY_PATH=/acl/build/:$LD_LIBRARY_PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

71

.ci/docker/manywheel/Dockerfile_cxx11-abi

View File

 @ -1,71 +0,0 @@
 FROM centos:8 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 # change to a valid repo
 RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
 # enable to install ninja-build
 RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
 RUN yum -y update
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
 RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
 FROM base as openssl
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # Install python
 FROM base as python
 RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
 ADD common/install_cpython.sh install_cpython.sh
 RUN bash ./install_cpython.sh && rm install_cpython.sh
 FROM base as conda
 ADD ./common/install_conda_docker.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
 RUN /opt/conda/bin/conda install -y cmake
 FROM base as intel
 # Install MKL
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=conda              /opt/conda                            /opt/conda
 ENV PATH=/opt/conda/bin:$PATH
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as patchelf
 ADD ./common/install_patchelf.sh install_patchelf.sh
 RUN bash ./install_patchelf.sh && rm install_patchelf.sh
 RUN cp $(which patchelf) /patchelf
 FROM base as jni
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM base as final
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=conda              /opt/conda                            /opt/conda
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 RUN yum install -y ninja-build

3

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -115,6 +115,9 @@ RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
 # cmake-3.28.0 from pip for onnxruntime
 RUN python3 -mpip install cmake==3.28.0
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
 # build onnxruntime 1.21.0 from sources.
 # it is not possible to build it from sources using pip,
 # so just build it from upstream repository.

									
										17

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -28,6 +28,7 @@ fi

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				OPENBLAS_VERSION=${OPENBLAS_VERSION:-}

				ACL_VERSION=${ACL_VERSION:-}

				case ${image} in

				    manylinux2_28-builder:cpu)

				@ -41,13 +42,6 @@ case ${image} in

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        OPENBLAS_VERSION="v0.3.30"

				        ;;

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

				        GPU_IMAGE=""

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        MANY_LINUX_VERSION="cxx11-abi"

				        ;;

				    manylinuxs390x-builder:cpu-s390x)

				        TARGET=final

				@ -82,7 +76,7 @@ case ${image} in

				        ;;

				    manylinux2_28-builder:rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				        if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        TARGET=rocm_final

				@ -90,6 +84,10 @@ case ${image} in

				        DEVTOOLSET_VERSION="11"

				        GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        # add gfx950, gfx115x conditionally starting in ROCm 7.0

				        if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then

				            PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"

				        ;;

				    manylinux2_28-builder:xpu)

				@ -121,7 +119,8 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				DOCKER_BUILDKIT=1 docker build  \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \

				    --build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION:-}" \

				    --build-arg "ACL_VERSION=${ACL_VERSION:-}" \

				    --target "${TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

54

.ci/docker/requirements-ci.txt

View File

 @ -10,6 +10,11 @@ boto3==1.35.42
 #Pinned versions: 1.19.12, 1.16.34
 #test that import:
 build==1.3.0
 #Description: A simple, correct Python build frontend.
 #Pinned versions: 1.3.0
 #test that import:
 click
 #Description: Command Line Interface Creation Kit
 #Pinned versions:
 @ -47,10 +52,10 @@ flatbuffers==24.12.23
 #Pinned versions: 24.12.23
 #test that import:
 hypothesis==5.35.1
 hypothesis==6.56.4
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 #Pinned versions: 5.35.1
 #Pinned versions: 6.56.4
 #test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
 junitparser==2.1.1
 @ -93,8 +98,9 @@ librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
 #Pinned versions:
 #test that import:
 mypy==1.16.0
 mypy==1.16.0 ; platform_system == "Linux"
 # Pin MyPy version because new errors are likely to appear with each release
 # Skip on Windows as lots of type annotations are POSIX specific
 #Description: linter
 #Pinned versions: 1.16.0
 #test that import: test_typing.py, test_type_hints.py
 @ -105,20 +111,17 @@ networkx==2.8.8
 #Pinned versions: 2.8.8
 #test that import: functorch
 ninja==1.11.1.3
 ninja==1.11.1.4
 #Description: build system. Used in some tests. Used in build to generate build
 #time tracing information
 #Pinned versions: 1.11.1.3
 #Pinned versions: 1.11.1.4
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
 numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 #Pinned versions: 0.55.2, 0.60.0
 #test that import: test_numba_integration.py
 #For numba issue see https://github.com/pytorch/pytorch/issues/51511
 #Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
 #numpy
 @ -133,7 +136,7 @@ numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
 #test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
 #test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
 #test_binary_ufuncs.py
 numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
 numpy==1.22.4; python_version == "3.10"
 numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
 numpy==2.1.2; python_version >= "3.13"
 @ -165,12 +168,12 @@ optree==0.13.0
 pillow==11.0.0
 #Description:  Python Imaging Library fork
 #Pinned versions: 10.3.0
 #Pinned versions: 11.0.0
 #test that import:
 protobuf==5.29.4
 protobuf==5.29.5
 #Description:  Google's data interchange format
 #Pinned versions: 5.29.4
 #Pinned versions: 5.29.5
 #test that import: test_tensorboard.py, test/onnx/*
 psutil
 @ -213,7 +216,7 @@ pytest-subtests==0.13.1
 #Pinned versions:
 #test that import:
 xdoctest==1.1.0
 xdoctest==1.3.0
 #Description: runs doctests in pytest
 #Pinned versions: 1.1.0
 #test that import:
 @ -238,10 +241,9 @@ pygments==2.15.0
 #Pinned versions: 14.1.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 scikit-image==0.22.0 ; python_version >= "3.10"
 scikit-image==0.22.0
 #Description: image processing routines
 #Pinned versions:
 #Pinned versions: 0.22.0
 #test that import: test_nn.py
 #scikit-learn
 @ -264,7 +266,7 @@ scipy==1.14.1 ; python_version >= "3.12"
 #test that import:
 # needed by torchgen utils
 typing-extensions>=4.10.0
 typing-extensions==4.12.2
 #Description: type hints for python
 #Pinned versions:
 #test that import:
 @ -325,8 +327,6 @@ pywavelets==1.7.0 ; python_version >= "3.12"
 lxml==5.3.0
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries
 PyGithub==2.3.0
 sympy==1.13.3
 @ -339,7 +339,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.4.0
 onnxscript==0.5.3
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -359,9 +359,10 @@ pwlf==2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 pyyaml
 pyyaml==6.0.2
 pyzstd
 setuptools>=70.1.0
 setuptools==78.1.1
 packaging==23.1
 six
 scons==4.5.2 ; platform_machine == "aarch64"
 @ -376,13 +377,16 @@ dataclasses_json==0.6.7
 #Pinned versions: 0.6.7
 #test that import:
 cmake==4.0.0
 cmake==3.31.6
 #Description: required for building
 tlparse==0.4.0
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
 filelock==3.18.0
 #Description: required for inductor testing
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x" and platform_system != "Darwin"
 #Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
 #test that import: test_cuda.py

9

.ci/docker/requirements-docs.txt

View File

 @ -1,8 +1,15 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@1657ad2fc1acdc98aa719eebecbb0128a7c13ce4#egg=pytorch_sphinx_theme2
 standard-imghdr==3.13.0; python_version >= "3.13"
 #Description: This is needed by Sphinx, so it needs to be added here.
 # The reasons are as follows:
 # 1) This module has been removed from the Python standard library since Python 3.13(https://peps.python.org/pep-0594/#imghdr);
 # 2) The current version of Sphinx (5.3.0) is not compatible with Python 3.13.
 # Once Sphinx is upgraded to a version compatible with Python 3.13 or later, we can remove this dependency.
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably
 # something related to Docker setup. We can investigate this later.

									
										6

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -52,9 +52,13 @@ ENV INSTALLED_VISION ${VISION}

				# Install rocm

				ARG ROCM_VERSION

				RUN mkdir ci_commit_pins

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt

				COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				RUN rm install_rocm.sh common_utils.sh

				RUN rm -r ci_commit_pins

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

									
										2

.ci/libtorch/build.sh
									
												View File
												
				@ -7,4 +7,4 @@ set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

				USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.10" ${SCRIPTPATH}/../manywheel/build.sh

									
										8

.ci/lumen_cli/cli/lib/core/vllm/lib.py
									
												View File
												
				@ -41,7 +41,6 @@ def sample_vllm_test_library():

				                "pytest -v -s basic_correctness/test_cumem.py",

				                "pytest -v -s basic_correctness/test_basic_correctness.py",

				                "pytest -v -s basic_correctness/test_cpu_offload.py",

				                "VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py",

				            ],

				        },

				        "vllm_basic_models_test": {

				@ -68,15 +67,12 @@ def sample_vllm_test_library():

				                        "-v",

				                        "-s",

				                        "entrypoints/llm",

				                        "--ignore=entrypoints/llm/test_lazy_outlines.py",

				                        "--ignore=entrypoints/llm/test_generate.py",

				                        "--ignore=entrypoints/llm/test_generate_multiple_loras.py",

				                        "--ignore=entrypoints/llm/test_collective_rpc.py",

				                    ]

				                ),

				                "pytest -v -s entrypoints/llm/test_lazy_outlines.py",

				                "pytest -v -s entrypoints/llm/test_generate.py ",

				                "VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode",

				                "pytest -v -s entrypoints/llm/test_generate.py",

				                "pytest -v -s entrypoints/offline_mode",

				            ],

				        },

				        "vllm_regression_test": {

									
										11

.ci/lumen_cli/cli/lib/core/vllm/vllm_build.py
									
												View File
												
				@ -66,6 +66,11 @@ class VllmBuildParameters:

				        "DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"

				    )

				    # the cleaning script to remove torch dependencies from pip

				    cleaning_script: Path = env_path_field(

				        "cleaning_script", ".github/ci_configs/vllm/use_existing_torch.py"

				    )

				    # OUTPUT_DIR: where docker buildx (local exporter) will write artifacts

				    output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")

				@ -160,6 +165,7 @@ class VllmBuildRunner(BaseRunner):

				        logger.info("Running vllm build with inputs: %s", inputs)

				        vllm_commit = clone_vllm()

				        self.cp_torch_cleaning_script(inputs)

				        self.cp_dockerfile_if_exist(inputs)

				        # cp torch wheels from root direct to vllm workspace if exist

				        self.cp_torch_whls_if_exist(inputs)

				@ -205,6 +211,11 @@ class VllmBuildRunner(BaseRunner):

				        copy(inputs.torch_whls_path, tmp_dir)

				        return tmp_dir

				    def cp_torch_cleaning_script(self, inputs: VllmBuildParameters):

				        script = get_path(inputs.cleaning_script, resolve=True)

				        vllm_script = Path(f"./{self.work_directory}/use_existing_torch.py")

				        copy(script, vllm_script)

				    def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):

				        if not inputs.use_local_dockerfile:

				            logger.info("using vllm default dockerfile.torch_nightly for build")

									
										13

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py
									
												View File
												
				@ -11,7 +11,7 @@ from typing import Any

				from cli.lib.common.cli_helper import BaseRunner

				from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env

				from cli.lib.common.path_helper import copy, remove_dir

				from cli.lib.common.path_helper import copy, get_path, remove_dir

				from cli.lib.common.pip_helper import (

				    pip_install_first_match,

				    pip_install_packages,

				@ -43,6 +43,10 @@ class VllmTestParameters:

				    torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")

				    cleaning_script: Path = env_path_field(

				        "cleaning_script", ".github/ci_configs/vllm/use_existing_torch.py"

				    )

				    def __post_init__(self):

				        if not self.torch_whls_path.exists():

				            raise ValueError("missing torch_whls_path")

				@ -92,11 +96,13 @@ class VllmTestRunner(BaseRunner):

				        self._set_envs(params)

				        clone_vllm(dst=self.work_directory)

				        self.cp_torch_cleaning_script(params)

				        with working_directory(self.work_directory):

				            remove_dir(Path("vllm"))

				            self._install_wheels(params)

				            self._install_dependencies()

				        # verify the torches are not overridden by test dependencies

				        check_versions()

				    def run(self):

				@ -125,6 +131,11 @@ class VllmTestRunner(BaseRunner):

				            # double check the torches are not overridden by other packages

				            check_versions()

				    def cp_torch_cleaning_script(self, params: VllmTestParameters):

				        script = get_path(params.cleaning_script, resolve=True)

				        vllm_script = Path(f"./{self.work_directory}/use_existing_torch.py")

				        copy(script, vllm_script)

				    def _install_wheels(self, params: VllmTestParameters):

				        logger.info("Running vllm test with inputs: %s", params)

				        if not pkg_exists("torch"):

									
										16

.ci/magma-rocm/Makefile
									
												View File
												
				@ -1,11 +1,11 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_ROCM ?= 6.4

				DESIRED_ROCM ?= 7.0

				DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))

				PACKAGE_NAME = magma-rocm

				# inherit this from underlying docker image, do not pass this env var to docker

				#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201

				#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201

				DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-v $(shell git rev-parse --show-toplevel)/.ci:/builder \

				@ -16,20 +16,20 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					magma-rocm/build_magma.sh

				.PHONY: all

				all: magma-rocm70

				all: magma-rocm64

				all: magma-rocm63

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-rocm70

				magma-rocm70: DESIRED_ROCM := 7.0

				magma-rocm70:

					$(DOCKER_RUN)

				.PHONY: magma-rocm64

				magma-rocm64: DESIRED_ROCM := 6.4

				magma-rocm64:

					$(DOCKER_RUN)

				.PHONY: magma-rocm63

				magma-rocm63: DESIRED_ROCM := 6.3

				magma-rocm63:

					$(DOCKER_RUN)

									
										6

.ci/magma-rocm/build_magma.sh
									
												View File
												
				@ -6,8 +6,8 @@ set -eou pipefail

				# The script expects DESIRED_CUDA and PACKAGE_NAME to be set

				ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

				# Version 2.7.2 + ROCm related updates

				MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				# https://github.com/icl-utk-edu/magma/pull/65

				MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec

				# Folders for the build

				PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata

				@ -20,7 +20,7 @@ mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RE

				# Fetch magma sources and verify checksum

				pushd ${PACKAGE_DIR}

				git clone https://bitbucket.org/icl/magma.git

				git clone https://github.com/jeffdaily/magma

				pushd magma

				git checkout ${MAGMA_VERSION}

				popd

									
										2

.ci/manywheel/build_common.sh
									
												View File
												
				@ -142,7 +142,7 @@ time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				    EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    python -m build --wheel --no-isolation --outdir /tmp/$WHEELHOUSE_DIR

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

									
										2

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -104,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr

				fi

				echo "Calling 'python -m pip install .' at $(date)"

				echo "Calling -m pip install . -v --no-build-isolation at $(date)"

				if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				    STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

									
										4

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -107,6 +107,10 @@ if [[ $ROCM_INT -ge 60200 ]]; then

				    ROCM_SO_FILES+=("librocm-core.so")

				fi

				if [[ $ROCM_INT -ge 70000 ]]; then

				    ROCM_SO_FILES+=("librocroller.so")

				fi

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

									
										15

.ci/pytorch/build.sh
									
												View File
												
				@ -89,7 +89,7 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export USE_MKLDNN=1

				  export USE_MKLDNN_ACL=1

				  export ACL_ROOT_DIR=/ComputeLibrary

				  export ACL_ROOT_DIR=/acl

				fi

				if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

				@ -233,7 +233,9 @@ if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *-full-debug* ]]; then

				  export CMAKE_BUILD_TYPE=Debug

				elif [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				@ -290,15 +292,20 @@ else

				      WERROR=1 python setup.py clean

				      WERROR=1 python setup.py bdist_wheel

				      WERROR=1 python -m build --wheel --no-isolation

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      python setup.py bdist_wheel

				      python -m build --wheel --no-isolation

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				    if [[ "$BUILD_ENVIRONMENT" == *full-debug* ]]; then

				      # Regression test for https://github.com/pytorch/pytorch/issues/164297

				      # Torch should be importable and that's about it

				      pushd /; python -c "import torch;print(torch.__config__.show(), torch.randn(5) + 1.7)"; popd

				    fi

				    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then

				      install_torchvision

									
										18

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -258,11 +258,19 @@ function install_torchrec_and_fbgemm() {

				      git clone --recursive https://github.com/pytorch/fbgemm

				      pushd fbgemm/fbgemm_gpu

				      git checkout "${fbgemm_commit}" --recurse-submodules

				      python setup.py bdist_wheel \

				        --build-variant=rocm \

				        -DHIP_ROOT_DIR="${ROCM_PATH}" \

				        -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \

				        -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

				      # until the fbgemm_commit includes the tbb patch

				      patch <<'EOF'

				--- a/FbgemmGpu.cmake

				+++ b/FbgemmGpu.cmake

				@@ -184,5 +184,6 @@ gpu_cpp_library(

				     fbgemm_gpu_tbe_cache

				     fbgemm_gpu_tbe_optimizers

				     fbgemm_gpu_tbe_utils

				+    tbb

				   DESTINATION

				     fbgemm_gpu)

				EOF

				      python setup.py bdist_wheel --build-variant=rocm

				      popd

				      # Save the wheel before cleaning up

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \

				# Build the docs

				pushd docs/cpp

				time make VERBOSE=1 html -j

				time make VERBOSE=1 html

				popd

				popd

									
										40

.ci/pytorch/functorch_doc_push_script.sh
									
												View File
											
				@ -1,40 +0,0 @@

				#!/bin/bash

				# This is where the local pytorch install in the docker image is located

				pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex -o pipefail

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

				# Build functorch docs

				pushd $pt_checkout/functorch/docs

				make html

				popd

				git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages

				pushd functorch_ghpages

				if [ "$version" == "main" ]; then

				  version=nightly

				fi

				git rm -rf "$version" || true

				mv "$pt_checkout/functorch/docs/build/html" "$version"

				git add "$version" || true

				git status

				git config user.email "soumith+bot@pytorch.org"

				git config user.name "pytorchbot"

				# If there aren't changes, don't make a commit; push is no-op

				git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true

				git status

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  git push -u origin gh-pages

				fi

				popd

									
										9

.ci/pytorch/macos-build.sh
									
												View File
												
				@ -35,11 +35,12 @@ fi

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then

				  USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel

				  # Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls

				  USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python -m build --wheel --no-isolation

				else

				  # NB: we always build with distributed; USE_DISTRIBUTED turns off all

				  # backends (specifically the gloo backend), so test that this case works too

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64

				  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

				  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python -m build --wheel --no-isolation -C--build-option=--plat-name=macosx_11_0_arm64

				fi

				if which sccache > /dev/null; then

				  print_sccache_stats

									
										10

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -13,13 +13,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(

				fi

				popd

				python -mpip install -r requirements.txt

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				python -mpip install --no-input -r requirements.txt

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				  # This environment variable makes ProcessGroupGloo default to

				@ -59,7 +55,7 @@ test_python_shard() {

				  setup_test_python

				  time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS"

				  time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "$1" "$NUM_TEST_SHARDS"

				  assert_git_not_dirty

				}

				@ -260,7 +256,7 @@ test_torchbench_smoketest() {

				  local device=mps

				  local dtypes=(undefined float16 bfloat16 notset)

				  local dtype=${dtypes[$1]}

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)

				  local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)

				  for backend in eager inductor; do

				@ -323,7 +319,7 @@ test_aoti_torchbench_smoketest() {

				  local device=mps

				  local dtypes=(undefined float16 bfloat16 notset)

				  local dtype=${dtypes[$1]}

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)

				  local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)

				  echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"

				  local dtype_arg="--${dtype}"

									
										1

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -26,6 +26,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				    time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				    time python test/run_test.py --verbose -i distributed/test_aten_comm_compute_reordering

				    time python test/run_test.py --verbose -i distributed/test_store

				    time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				    time python test/run_test.py --verbose -i distributed/test_pg_wrapper

									
										25

.ci/pytorch/numba-cuda-13.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				From 6e08c9d08e9de59c7af28b720289debbbd384764 Mon Sep 17 00:00:00 2001

				From: Michael Wang <13521008+isVoid@users.noreply.github.com>

				Date: Tue, 1 Apr 2025 17:28:05 -0700

				Subject: [PATCH] Avoid bumping certain driver API to avoid future breakage

				 (#185)

				Co-authored-by: isVoid <isVoid@users.noreply.github.com>

				---

				 numba_cuda/numba/cuda/cudadrv/driver.py | 3 +++

				 1 file changed, 3 insertions(+)

				diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py

				index 1641bf77..233e9ed7 100644

				--- a/numba_cuda/numba/cuda/cudadrv/driver.py

				+++ b/numba_cuda/numba/cuda/cudadrv/driver.py

				@@ -365,6 +365,9 @@ def _find_api(self, fname):

				         else:

				             variants = ('_v2', '')

				+        if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):

				+            return getattr(self.lib, fname)

				+

				         for variant in variants:

				             try:

				                 return getattr(self.lib, f'{fname}{variant}')

									
										23

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (

				    "torch::",

				)

				# Patterns for detecting statically linked libstdc++ symbols

				STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]

				def _apply_libtorch_symbols(symbols):

				    return [

				@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				def grep_symbols(

				    lib: str, patterns: list[Any], symbol_type: str | None = None

				) -> list[str]:

				    def _grep_symbols(

				        symbols: list[tuple[str, str, str]], patterns: list[Any]

				    ) -> list[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            # Filter by symbol type if specified

				            if symbol_type and _s_type != symbol_type:

				                continue

				            for pattern in patterns:

				                if pattern.match(s_name):

				                    rc.append(s_name)

				@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:

				    cxx11_statically_linked_symbols = grep_symbols(

				        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"

				    )

				    num_statically_linked_symbols = len(cxx11_statically_linked_symbols)

				    print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")

				    if num_statically_linked_symbols > 0:

				        raise RuntimeError(

				            f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"

				        )

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				@ -107,6 +127,7 @@ def main() -> None:

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)

				if __name__ == "__main__":

									
										8

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -386,8 +386,8 @@ def smoke_test_compile(device: str = "cpu") -> None:

				def smoke_test_nvshmem() -> None:

				    if not torch.cuda.is_available():

				        print("CUDA is not available, skipping NVSHMEM test")

				    if not torch.cuda.is_available() or target_os == "windows":

				        print("Windows platform or CUDA is not available, skipping NVSHMEM test")

				        return

				    # Check if NVSHMEM is compiled in current build

				@ -396,7 +396,9 @@ def smoke_test_nvshmem() -> None:

				    except ImportError:

				        # Not built with NVSHMEM support.

				        # torch is not compiled with NVSHMEM prior to 2.9

				        if torch.__version__ < "2.9":

				        from torch.torch_version import TorchVersion

				        if TorchVersion(torch.__version__) < (2, 9):

				            return

				        else:

				            # After 2.9: NVSHMEM is expected to be compiled in current build

									
										94

.ci/pytorch/test.sh
									
												View File
												
				@ -32,6 +32,18 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v

				  git config --global --add safe.directory /var/lib/jenkins/workspace

				fi

				# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				  NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)

				  if [ -n "$NUMBA_CUDA_DIR" ]; then

				    NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"

				    pushd "$NUMBA_CUDA_DIR"

				    patch -p4 <"$NUMBA_PATCH"

				    popd

				  fi

				fi

				echo "Environment variables:"

				env

				@ -312,23 +324,29 @@ test_python_shard() {

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_python() {

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				}

				test_python_smoke() {

				  # Smoke tests for H100

				  # Smoke tests for H100/B200

				  time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 inductor/test_max_autotune $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_python_smoke_b200() {

				  # Targeted smoke tests for B200 - staged approach to avoid too many failures

				  time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_h100_distributed() {

				  # Distributed tests at H100

				  time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				@ -374,6 +392,7 @@ test_dynamo_wrapped_shard() {

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --exclude-aot-dispatch-tests \

				    --exclude-quantization-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose \

				    --upload-artifacts-while-running

				@ -418,7 +437,7 @@ test_inductor_distributed() {

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_aten_comm_compute_reordering distributed/test_compute_comm_reordering --verbose

				  assert_git_not_dirty

				}

				@ -819,7 +838,7 @@ test_dynamo_benchmark() {

				      elif [[ "${suite}" == "timm_models" ]]; then

				        export TORCHBENCH_ONLY_MODELS="inception_v3"

				      elif [[ "${suite}" == "torchbench" ]]; then

				        export TORCHBENCH_ONLY_MODELS="hf_Bert"

				        export TORCHBENCH_ONLY_MODELS="BERT_pytorch"

				      fi

				    fi

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				@ -850,13 +869,13 @@ test_inductor_torchbench_smoketest_perf() {

				  mkdir -p "$TEST_REPORTS_DIR"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only BERT_pytorch \

				    --output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				  for test in BERT_pytorch yolov3; do

				    python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \

				      --disable-cudagraphs --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" \

				      --only $test --output "$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv"

				@ -867,7 +886,7 @@ test_inductor_torchbench_smoketest_perf() {

				  done

				  # Perform some "warm-start" runs for a few huggingface models.

				  for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do

				  for test in AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do

				    python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				@ -1146,6 +1165,12 @@ test_distributed() {

				  fi

				}

				test_quantization() {

				  echo "Testing quantization"

				  python test/test_quantization.py

				}

				test_rpc() {

				  echo "Testing RPC C++ tests"

				  # NB: the ending test_rpc must match the current function name for the current

				@ -1392,7 +1417,7 @@ EOF

				  pip3 install -r requirements.txt

				  # shellcheck source=./common-build.sh

				  source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				  python setup.py bdist_wheel --bdist-dir="base_bdist_tmp" --dist-dir="base_dist"

				  python -m build --wheel --no-isolation -C--build-option=--bdist-dir="base_bdist_tmp" --outdir "base_dist"

				  python -mpip install base_dist/*.whl

				  echo "::endgroup::"

				@ -1540,14 +1565,10 @@ test_executorch() {

				  install_torchvision

				  install_torchaudio

				  INSTALL_SCRIPT="$(pwd)/.ci/docker/common/install_executorch.sh"

				  pushd /executorch

				  export PYTHON_EXECUTABLE=python

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  bash .ci/scripts/setup-linux.sh --build-tool cmake

				  "${INSTALL_SCRIPT}" setup_executorch

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				@ -1561,17 +1582,14 @@ test_executorch() {

				  popd

				  # Test torchgen generated code for Executorch.

				  echo "Testing ExecuTorch op registration"

				  "$BUILD_BIN_DIR"/test_edge_op_registration

				  assert_git_not_dirty

				}

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops profiler/test_memory_profiler \

				        distributed/elastic/timer/api_test distributed/elastic/timer/local_timer_example distributed/elastic/timer/local_timer_test \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1601,7 +1619,7 @@ test_operator_benchmark() {

				  test_inductor_set_cpu_affinity

				  cd benchmarks/operator_benchmark/pt_extension

				  python -m pip install .

				  python -m pip install . -v --no-build-isolation

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				@ -1614,6 +1632,25 @@ test_operator_benchmark() {

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				test_operator_microbenchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  cd benchmarks/operator_benchmark/pt_extension

				  python -m pip install .

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \

				      --benchmark-name "PyTorch operator microbenchmark" --use-compile

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \

				      --benchmark-name "PyTorch operator microbenchmark"

				  done

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				@ -1646,6 +1683,8 @@ elif [[ "${TEST_CONFIG}" == *executorch* ]]; then

				  test_executorch

				elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then

				  test_python_legacy_jit

				elif [[ "$TEST_CONFIG" == 'quantization' ]]; then

				  test_quantization

				elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then

				  # TODO: run some C++ tests

				  echo "no-op at the moment"

				@ -1668,6 +1707,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then

				  test_operator_microbenchmark

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1721,11 +1762,6 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				@ -1775,10 +1811,14 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				  test_xpu_bin

				elif [[ "${TEST_CONFIG}" == smoke ]]; then

				  test_python_smoke

				elif [[ "${TEST_CONFIG}" == smoke_b200 ]]; then

				  test_python_smoke_b200

				elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then

				  test_h100_distributed

				elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then

				  test_h100_symm_mem

				elif [[ "${TEST_CONFIG}" == "b200-symm-mem" ]]; then

				  test_h100_symm_mem

				elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then

				  test_h100_cutlass_backend

				else

									
										32

.ci/pytorch/test_fa3_abi_stable.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,32 @@

				#!/bin/bash

				set -ex -o pipefail

				# Suppress ANSI color escape sequences

				export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				echo "Environment variables"

				env

				echo "Testing FA3 stable wheel still works with currently built torch"

				echo "Installing ABI Stable FA3 wheel"

				# The wheel was built on https://github.com/Dao-AILab/flash-attention/commit/b3846b059bf6b143d1cd56879933be30a9f78c81

				# on torch nightly torch==2.9.0.dev20250830+cu129

				$MAYBE_SUDO pip -q install https://s3.amazonaws.com/ossci-linux/wheels/flash_attn_3-3.0.0b1-cp39-abi3-linux_x86_64.whl

				pushd flash-attention/hopper

				export PYTHONPATH=$PWD

				pytest -v -s \

				  "test_flash_attn.py::test_flash_attn_output[1-1-192-False-False-False-0.0-False-False-mha-dtype0]" \

				  "test_flash_attn.py::test_flash_attn_varlen_output[511-1-64-True-False-False-0.0-False-False-gqa-dtype2]" \

				  "test_flash_attn.py::test_flash_attn_kvcache[1-128-128-False-False-True-None-0.0-False-False-True-False-True-False-gqa-dtype0]" \

				  "test_flash_attn.py::test_flash_attn_race_condition[97-97-192-True-dtype0]" \

				  "test_flash_attn.py::test_flash_attn_combine[2-3-64-dtype1]" \

				  "test_flash_attn.py::test_flash3_bw_compatibility"

				popd

									
										2

.ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1
									
												View File
												
				@ -70,7 +70,7 @@ sccache --zero-stats

				sccache --show-stats

				# Build the wheel

				python setup.py bdist_wheel

				python -m build --wheel --no-build-isolation

				if ($LASTEXITCODE -ne 0) { exit 1 }

				# Install the wheel locally

									
										6

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -38,10 +38,12 @@ if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				:: Update CMake

				:: TODO: Investigate why this helps MKL detection, even when CMake from choco is not used

				call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=System' --apply-install-arguments-to-dependencies --version=3.27.9

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				:: TODO: Move to .ci/docker/requirements-ci.txt

				call pip install mkl==2024.2.0 mkl-static==2024.2.0 mkl-include==2024.2.0

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				@ -130,14 +132,14 @@ if "%USE_CUDA%"=="1" (

				:: Print all existing environment variable for debugging

				set

				python setup.py bdist_wheel

				python -m build --wheel --no-isolation

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				sccache --show-stats

				python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"

				(

				  if "%BUILD_ENVIRONMENT%"=="" (

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.

				  ) else (

				    copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

									
										12

.ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
									
												View File
												
				@ -3,12 +3,12 @@ if "%BUILD_ENVIRONMENT%"=="" (

				) else (

				  set CONDA_PARENT_DIR=C:\Jenkins

				)

				set CONDA_ROOT_DIR=%CONDA_PARENT_DIR%\Miniconda3

				:: Be conservative here when rolling out the new AMI with conda. This will try

				:: to install conda as before if it couldn't find the conda installation. This

				:: can be removed eventually after we gain enough confidence in the AMI

				if not exist %CONDA_PARENT_DIR%\Miniconda3 (

				if not exist %CONDA_ROOT_DIR% (

				  set INSTALL_FRESH_CONDA=1

				)

				@ -17,10 +17,14 @@ if "%INSTALL_FRESH_CONDA%"=="1" (

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_ROOT_DIR%

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				)

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

				call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call conda activate py_tmp

				call pip install -r .ci/docker/requirements-ci.txt

									
										2

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -14,7 +14,7 @@ if not errorlevel 0 exit /b

				:: build\torch. Rather than changing all these references, making a copy of torch folder

				:: from conda to the current workspace is easier. The workspace will be cleaned up after

				:: the job anyway

				xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				xcopy /s %CONDA_ROOT_DIR%\envs\py_tmp\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				pushd .

				if "%VC_VERSION%" == "" (

									
										44

.ci/pytorch/win-test-helpers/test_libtorch.bat
									
												View File
												
				@ -15,37 +15,35 @@ if errorlevel 1 exit /b 1

				if not errorlevel 0 exit /b 1

				cd %TMP_DIR_WIN%\build\torch\test

				:: Enable delayed variable expansion to make the list

				setlocal enabledelayedexpansion

				set EXE_LIST=

				for /r "." %%a in (*.exe) do (

				    call :libtorch_check "%%~na" "%%~fa"

				  if "%%~na" == "c10_intrusive_ptr_benchmark" (

				    @REM NB: This is not a gtest executable file, thus couldn't be handled by

				    @REM pytest-cpp and is excluded from test discovery by run_test

				    call "%%~fa"

				    if errorlevel 1 goto fail

				    if not errorlevel 0 goto fail

				  ) else (

				    if "%%~na" == "verify_api_visibility" (

				      @REM Skip verify_api_visibility as it is a compile-level test

				    ) else (

				      set EXE_LIST=!EXE_LIST! cpp/%%~na

				    )

				  )

				)

				goto :eof

				:libtorch_check

				cd %CWD%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\test

				:: Skip verify_api_visibility as it a compile level test

				if "%~1" == "verify_api_visibility" goto :eof

				:: Run python test\run_test.py on the list

				set NO_TD=True && python test\run_test.py --cpp --verbose -i !EXE_LIST!

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				echo Running "%~2"

				if "%~1" == "c10_intrusive_ptr_benchmark" (

				  :: NB: This is not a gtest executable file, thus couldn't be handled by pytest-cpp

				  call "%~2"

				  goto :eof

				)

				python test\run_test.py --cpp --verbose -i "cpp/%~1"

				if errorlevel 1 (

				  echo %1 failed with exit code %errorlevel%

				  goto fail

				)

				if not errorlevel 0 (

				  echo %1 failed with exit code %errorlevel%

				  goto fail

				)

				goto :eof

				:eof

				exit /b 0

									
										2

.ci/pytorch/win-test-helpers/test_python_shard.bat
									
												View File
												
				@ -25,7 +25,7 @@ echo Copying over test times file

				robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"

				echo Run nn tests

				python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

				python run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

				if ERRORLEVEL 1 goto fail

				popd

									
										19

.ci/pytorch/win-test.sh
									
												View File
												
				@ -37,23 +37,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

				fi

				# TODO: Move both of them to Windows AMI

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.15.1.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.4.0

				# Install parameterized

				python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308

				python -m pip install expecttest==0.3.0

				# TODO: Move this to .ci/docker/requirements-ci.txt

				python -m pip install "psutil==5.9.1" nvidia-ml-py "pytest-shard==0.1.2"

				run_tests() {

				    # Run nvidia-smi if available

									
										2

.ci/pytorch/windows/arm64/build_pytorch.bat
									
												View File
												
				@ -48,7 +48,7 @@ sccache --zero-stats

				sccache --show-stats

				:: Call PyTorch build script

				python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"

				python -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"

				:: show sccache stats

				sccache --show-stats

									
										4

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V128%"

									
										10

.ci/pytorch/windows/internal/driver_update.bat
									
												View File
												
				@ -1,9 +1,9 @@

				set WIN_DRIVER_VN=528.89

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe

				set WIN_DRIVER_VN=580.88

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe

				if errorlevel 1 exit /b 1

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot

				if errorlevel 1 exit /b 1

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

									
										2

.ci/pytorch/windows/internal/install_python.bat
									
												View File
												
				@ -28,5 +28,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t

				if errorlevel 1 exit /b 1

				set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"

				%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel

				%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel build

				if errorlevel 1 exit /b 1

									
										2

.ci/pytorch/windows/internal/setup.bat
									
												View File
												
				@ -86,7 +86,7 @@ copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_

				goto build_end

				:pytorch

				%PYTHON_EXEC% setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"

				%PYTHON_EXEC% -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"

				:build_end

				IF ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/windows/internal/static_lib_test.bat
									
												View File
												
				@ -63,7 +63,7 @@ if errorlevel 1 exit /b 1

				call %CONDA_HOME%\condabin\activate.bat testenv

				if errorlevel 1 exit /b 1

				call conda install  -y -q -c conda-forge libuv=1.39

				call conda install  -y -q -c conda-forge libuv=1.51

				call conda install -y -q intel-openmp

				echo "install and test libtorch"

									
										2

.ci/pytorch/windows/setup_build.bat
									
												View File
												
				@ -18,7 +18,7 @@ if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake

				%PYTHON_EXEC% -m pip install pyyaml

				%PYTHON_EXEC% -m pip install mkl-include mkl-static

				%PYTHON_EXEC% -m pip install boto3 ninja typing_extensions setuptools==72.1.0

				%PYTHON_EXEC% -m pip install boto3 requests ninja typing_extensions setuptools==72.1.0

				where cmake.exe

									
										40

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -85,7 +85,7 @@ mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				# Create an isolated directory to store this builds pytorch checkout and conda

				# installation

				if [[ -z "$MAC_PACKAGE_WORK_DIR" ]]; then

				    MAC_PACKAGE_WORK_DIR="$(pwd)/tmp_wheel_conda_${DESIRED_PYTHON}_$(date +%H%M%S)"

				    MAC_PACKAGE_WORK_DIR="$(pwd)/tmp_wheel_${DESIRED_PYTHON}_$(date +%H%M%S)"

				fi

				mkdir -p "$MAC_PACKAGE_WORK_DIR" || true

				if [[ -n ${GITHUB_ACTIONS} ]]; then

				@ -96,11 +96,11 @@ fi

				whl_tmp_dir="${MAC_PACKAGE_WORK_DIR}/dist"

				mkdir -p "$whl_tmp_dir"

				mac_version='macosx_11_0_arm64'

				mac_version='macosx-11_0-arm64'

				libtorch_arch='arm64'

				# Create a consistent wheel package name to rename the wheel to

				wheel_filename_new="${TORCH_PACKAGE_NAME}-${build_version}${build_number_prefix}-cp${python_nodot}-none-${mac_version}.whl"

				wheel_filename_new="${TORCH_PACKAGE_NAME}-${build_version}${build_number_prefix}-cp${python_nodot}-none-${mac_version//[-,]/_}.whl"

				###########################################################

				@ -125,7 +125,6 @@ popd

				export TH_BINARY_BUILD=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				EXTRA_CONDA_INSTALL_FLAGS=""

				CONDA_ENV_CREATE_FLAGS=""

				@ -133,25 +132,20 @@ RENAME_WHEEL=true

				case $desired_python in

				    3.14t)

				        echo "Using 3.14 deps"

				        mac_version='macosx-11.0-arm64'

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.14)

				        echo "Using 3.14t deps"

				        mac_version='macosx-11.0-arm64'

				        NUMPY_PINNED_VERSION="==2.1.0"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.13t)

				        echo "Using 3.13 deps"

				        echo "Using 3.13t deps"

				        mac_version='macosx-11.0-arm64'

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"

				        desired_python="3.13"

				        RENAME_WHEEL=false

				        ;;

				    3.13)

				@ -176,21 +170,15 @@ case $desired_python in

				        ;;

				esac

				# Install into a fresh env

				tmp_env_name="wheel_py$python_nodot"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}

				source activate "$tmp_env_name"

				PINNED_PACKAGES=(

				    "numpy${NUMPY_PINNED_VERSION}"

				)

				retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements-build.txt"

				pip install requests ninja typing-extensions

				retry pip install -r "${pytorch_rootdir}/requirements.txt" || true

				python -mvenv ~/${desired_python}-build

				source ~/${desired_python}-build/bin/activate

				retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements.txt"

				retry brew install libomp

				# For USE_DISTRIBUTED=1 on macOS, this enables gloo, which needs libuv, which

				# is build as part of tensorpipe submodule

				# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule

				export USE_DISTRIBUTED=1

				export USE_MKLDNN=OFF

				@ -198,11 +186,11 @@ export USE_QNNPACK=OFF

				export BUILD_TEST=OFF

				pushd "$pytorch_rootdir"

				echo "Calling setup.py bdist_wheel at $(date)"

				echo "Calling -m build --wheel --no-isolation at $(date)"

				python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name ${mac_version}

				_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python -m build --wheel --no-isolation --outdir "$whl_tmp_dir" -C--plat-name="${mac_version//[-.]/_}"

				echo "Finished setup.py bdist_wheel at $(date)"

				echo "Finished -m build --wheel --no-isolation at $(date)"

				if [[ $package_type != 'libtorch' ]]; then

				    echo "delocating wheel dependencies"

									
										9

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1

				# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.9/13.0 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]] || [[ "$DESIRED_CUDA" == "cu130" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				TRITON_CONSTRAINT="platform_system == 'Linux'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

									
										47

.circleci/scripts/functorch_doc_push_script.sh
									
												View File
											
				@ -1,47 +0,0 @@

				#!/bin/bash

				# =================== The following code **should** be executed inside Docker container ===================

				# Install dependencies

				sudo apt-get -y update

				sudo apt-get -y install expect-dev

				# This is where the local pytorch install in the docker image is located

				pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

				# Build functorch docs

				pushd $pt_checkout/functorch/docs

				pip -q install -r requirements.txt

				make html

				popd

				git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages

				pushd functorch_ghpages

				if [ $version == "main" ]; then

				  version=nightly

				fi

				git rm -rf "$version" || true

				mv "$pt_checkout/functorch/docs/build/html" "$version"

				git add "$version" || true

				git status

				git config user.email "soumith+bot@pytorch.org"

				git config user.name "pytorchbot"

				# If there aren't changes, don't make a commit; push is no-op

				git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true

				git status

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  git push -u origin gh-pages

				fi

				popd

				# =================== The above code **should** be executed inside Docker container ===================

7

.clang-tidy

View File

 @ -59,16 +59,19 @@ performance-*,
 -performance-enum-size,
 readability-container-size-empty,
 readability-delete-null-pointer,
 readability-duplicate-include
 readability-duplicate-include,
 readability-misplaced-array-index,
 readability-redundant*
 readability-redundant*,
 readability-simplify-subscript-expr,
 readability-string-compare,
 -readability-redundant-access-specifiers,
 -readability-redundant-control-flow,
 -readability-redundant-inline-specifier,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 WarningsAsErrors: '*'
 LineFilter:
   - name: '/usr/include/.*'
 CheckOptions:
   cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor: true
   cppcoreguidelines-special-member-functions.AllowImplicitlyDeletedCopyOrMove: true

2

.flake8

View File

 @ -73,7 +73,7 @@ exclude =
     ./docs/src,
     ./functorch/docs,
     ./functorch/examples,
     ./functorch/notebooks,
     ./functorch/docs/source/tutorials,
     ./scripts,
     ./test/generated_type_hints_smoketest.py,
     ./third_party,

									
										4

.github/ISSUE_TEMPLATE/ci-sev.md
									
										vendored
									
												View File
												
				@ -1,6 +1,10 @@

				---

				name: "⚠️ CI SEV"

				about: Tracking incidents for PyTorch's CI infra.

				title: ''

				labels: ''

				assignees: ''

				---

				> NOTE: Remember to label this issue with "`ci: sev`"

									
										18

.github/ISSUE_TEMPLATE/disable-autorevert.md
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,18 @@

				---

				name: DISABLE AUTOREVERT

				about: Disables autorevert when open

				title: "❌​\U0001F519​ [DISABLE AUTOREVERT]"

				labels: 'ci: disable-autorevert'

				assignees: ''

				---

				This issue, while open, disables the autorevert functionality.

				More details can be found [here](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md)

				## Why are you disabling autorevert?

				## Links to any issues/commits/errors that shows the source of problem

									
										6

.github/ISSUE_TEMPLATE/disable-ci-jobs.md
									
										vendored
									
												View File
												
				@ -1,8 +1,10 @@

				---

				name: Disable CI jobs (PyTorch Dev Infra only)

				about: Use this template to disable CI jobs

				title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"

				labels: "module: ci"

				title: DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]

				labels: 'module: ci'

				assignees: ''

				---

				> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once

									
										4

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -21,6 +21,10 @@ self-hosted-runner:

				    - linux.arm64.2xlarge.ephemeral

				    - linux.arm64.m7g.4xlarge

				    - linux.arm64.m7g.4xlarge.ephemeral

				    - linux.arm64.r7g.12xlarge.memory

				    - linux.aws.h100

				    - linux.aws.h100.4

				    - linux.aws.h100.8

				    - linux.4xlarge.nvidia.gpu

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

									
										2

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -274,8 +274,6 @@ runs:

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        # Propagate download.pytorch.org IP to container

				        grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"

				        echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"

				        docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"

									
										2

.github/actions/reuse-old-whl/reuse_old_whl.py
									
										vendored
									
												View File
												
				@ -264,7 +264,7 @@ def unzip_artifact_and_replace_files() -> None:

				        change_content_to_new_version(f"artifacts/dist/{old_stem}/torch/version.py")

				        for file in Path(f"artifacts/dist/{old_stem}").glob(

				            "*.dist-info/**",

				            "*.dist-info/*",

				        ):

				            change_content_to_new_version(file)

									
										35

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -28,6 +28,10 @@ runs:

				        echo "instance-type: $(get_ec2_metadata instance-type)"

				        echo "system info $(uname -a)"

				    - name: Print GPU info (if present)

				      shell: bash

				      run: if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi

				    - name: Check if in a container runner

				      shell: bash

				      id: check_container_runner

				@ -82,37 +86,6 @@ runs:

				        # Prune all of the docker images

				        docker system prune -af

				    - name: Manually resolve download.pytorch.org

				      shell: bash

				      continue-on-error: true

				      run: |

				        set +e

				        set -x

				        PT_DOMAIN=download.pytorch.org

				        # TODO: Flaky access to download.pytorch.org https://github.com/pytorch/pytorch/issues/100400,

				        # cleaning this up once the issue is fixed. There are more than one resolved IP here, the last

				        # one is returned at random

				        RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" | tail -n1)

				        if [ -z "${RESOLVED_IP}" ]; then

				          echo "Couldn't resolve ${PT_DOMAIN}, retrying with Google DNS..."

				          RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" @8.8.8.8 | tail -n1)

				          if [ -z "${RESOLVED_IP}" ]; then

				            echo "Couldn't resolve ${PT_DOMAIN}, exiting..."

				            exit 1

				          fi

				        fi

				        if grep -r "${PT_DOMAIN}" /etc/hosts; then

				          # Clean up any old records first

				          sudo sed -i "/${PT_DOMAIN}/d" /etc/hosts

				        fi

				        echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts

				        cat /etc/hosts

				    - name: Check that the docker daemon is running

				      shell: bash

				      continue-on-error: true

									
										16

.github/actions/setup-win/action.yml
									
										vendored
									
												View File
												
				@ -6,6 +6,12 @@ inputs:

				  cuda-version:

				    description: which cuda version to install, 'cpu' for none

				    required: true

				  python-version:

				    required: false

				    type: string

				    default: "3.10"

				    description: |

				      The python version to be used. Will be 3.10 by default

				runs:

				  using: composite

				@ -38,18 +44,24 @@ runs:

				        CONDA="C:\Jenkins\Miniconda3\condabin\conda.bat"

				        {

				          echo "CONDA=${CONDA}";

				          echo "CONDA_RUN=${CONDA} run --no-capture-output";

				          echo "CONDA_BUILD=${CONDA} run conda-build";

				          echo "CONDA_INSTALL=${CONDA} install";

				        } >> "${GITHUB_ENV}"

				    - name: Setup Python3

				      env:

				          PYTHON_VERSION: ${{ inputs.python-version }}

				      shell: bash

				      run: |

				        set +e

				        set -x

				        PYTHON3=$(${CONDA_RUN} which python3)

				        # Create new py_tmp env with python-version

				        ${CONDA} create -y -n py_tmp python=${PYTHON_VERSION} intel-openmp libuv

				        PYTHON3=$(${CONDA_RUN} -n py_tmp which python3)

				        EXIT_CODE=$?

				        if [[ "${EXIT_CODE}" == "0" ]]; then

				@ -62,7 +74,7 @@ runs:

				          # installation, which is Python 3 based. Its Python is default to Python 3. Further, there

				          # is also the Miniconda installation that is Python 2 based, and both can be installed if

				          # needed. In both cases, Python binary is just called python

				          PYTHON=$(${CONDA_RUN} which python)

				          PYTHON=$(${CONDA_RUN} -n py_tmp which python)

				          EXIT_CODE=$?

				          if [[ "${EXIT_CODE}" == "0" ]]; then

									
										3

.github/actions/teardown-win/action.yml
									
										vendored
									
												View File
												
				@ -23,9 +23,6 @@ runs:

				      run: |

				        .github\scripts\kill_active_ssh_sessions.ps1

				    - name: Clean up leftover processes on non-ephemeral Windows runner

				      uses: pytorch/test-infra/.github/actions/cleanup-runner@main

				    # Cleaning up Windows workspace sometimes fails flakily with device or resource busy

				    # error, meaning one or more processes haven't stopped completely yet. So trying to

				    # retry this step several time similar to how checkout-pytorch GHA does

									
										4

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -33,10 +33,6 @@ runs:

				        )

				        echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then

				          # Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner

				          grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"

				        fi

				        docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"

				        # Generate test script

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 fc2493d383354a008106f22f3be232badee9a1
 ff22e49ed0e92576c4935ccb8c143daac4a3cd

2

.github/ci_commit_pins/fbgemm_rocm.txt vendored

View File

 @ -1 +1 @@
 f1de94a4c2d14f59ad4ca84538c36084ea6b2c8
 ae0af1395c8d8471f4025deb6af9aef90b342f

2

.github/ci_commit_pins/vllm.txt vendored

View File

 @ -1 +1 @@
 e10fef08838612b4560e9c72e5cb1414a5edfa13
 ad9951c416d33c5da4f7a504fb162cbe62386f5

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 c5478ff7c3d50dd1e3047d72ec5909bea474073
 a9138a26ee257fef05310ad3fecf7c55fe80d73

									
										43

.github/ci_configs/vllm/Dockerfile.tmp_vllm
									
										vendored
									
												View File
												
				@ -82,16 +82,10 @@ RUN if command -v apt-get >/dev/null; then \

				        apt-get update -y \

				        && apt-get install -y ccache software-properties-common git curl wget sudo vim; \

				    else \

				        dnf install -y git curl wget sudo vim; \

				        dnf install -y git curl wget sudo; \

				    fi \

				    && python3 --version && python3 -m pip --version

				# Workaround for https://github.com/openai/triton/issues/2507 and

				# https://github.com/pytorch/pytorch/issues/107960 -- hopefully

				# this won't be needed for future versions of this docker image

				# or future versions of triton.

				RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

				# Install uv for faster pip installs if not existed

				RUN --mount=type=cache,target=/root/.cache/uv \

				    if ! python3 -m uv --version >/dev/null 2>&1; then \

				@ -208,7 +202,7 @@ ARG max_jobs=16

				ENV MAX_JOBS=${max_jobs}

				ARG nvcc_threads=4

				ENV NVCC_THREADS=$nvcc_threads

				ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'

				ARG torch_cuda_arch_list='8.0 8.6 8.9 9.0'

				ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

				ARG USE_SCCACHE

				@ -220,11 +214,16 @@ ARG SCCACHE_S3_NO_CREDENTIALS=0

				RUN --mount=type=cache,target=/root/.cache/uv \

				    --mount=type=bind,source=.git,target=.git \

				    if [ "$USE_SCCACHE" = "1" ]; then \

				        echo "Installing sccache..." \

				        && curl -L -o sccache.tar.gz https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz \

				        echo "Installing sccache..."; \

				        if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \

				            SCCACHE_ARCHIVE="sccache-v0.8.1-aarch64-unknown-linux-musl"; \

				        else \

				            SCCACHE_ARCHIVE="sccache-v0.8.1-x86_64-unknown-linux-musl"; \

				        fi; \

				        curl -L -o sccache.tar.gz "https://github.com/mozilla/sccache/releases/download/v0.8.1/${SCCACHE_ARCHIVE}.tar.gz" \

				        && tar -xzf sccache.tar.gz \

				        && sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache \

				        && rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \

				        && sudo mv "${SCCACHE_ARCHIVE}"/sccache /usr/bin/sccache \

				        && rm -rf sccache.tar.gz "${SCCACHE_ARCHIVE}" \

				        && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \

				        && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \

				        && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \

				@ -285,7 +284,7 @@ RUN if command -v apt-get >/dev/null; then \

				        && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \

				        && curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION}; \

				    else \

				        dnf install -y git curl wget sudo vim; \

				        dnf install -y git curl wget sudo; \

				    fi \

				    && python3 --version && python3 -m pip --version

				@ -298,22 +297,28 @@ RUN echo "[INFO] Listing current directory before torch install step:" && \

				    echo "[INFO] Showing torch_build_versions.txt content:" && \

				    cat torch_build_versions.txt

				# Workaround for https://github.com/openai/triton/issues/2507 and

				# https://github.com/pytorch/pytorch/issues/107960 -- hopefully

				# this won't be needed for future versions of this docker image

				# or future versions of triton.

				RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

				# Install build and runtime dependencies, this is needed for flashinfer install

				COPY requirements/build.txt requirements/build.txt

				COPY use_existing_torch.py use_existing_torch.py

				RUN python3 use_existing_torch.py

				RUN cat requirements/build.txt

				# Install uv for faster pip installs if not existed

				RUN --mount=type=cache,target=/root/.cache/uv \

				    if ! python3 -m uv --version > /dev/null 2>&1; then \

				        python3 -m pip install uv==0.8.4; \

				    fi

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				# Use copy mode to avoid hardlink failures with Docker cache mounts

				ENV UV_LINK_MODE=copy

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system -r requirements/build.txt

				# Default mount file as placeholder, this just avoid the mount error

				ARG TORCH_WHEELS_PATH="./requirements"

				# Install torch, torchaudio and torchvision

				@ -339,13 +344,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				# Install xformers wheel from previous stage

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system /wheels/xformers/*.whl --verbose

				# Build flashinfer from source.

				ARG torch_cuda_arch_list='8.0;8.9;9.0a;10.0a;12.0'

				# install package for build flashinfer

				# see issue: https://github.com/flashinfer-ai/flashinfer/issues/738

				RUN pip install build==1.3.0

				RUN pip freeze | grep -E 'setuptools|packaging|build'

				ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

									
										22

.github/ci_configs/vllm/use_existing_torch.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				import glob

				import os

				requires_files = glob.glob("requirements/*.txt")

				requires_files += ["pyproject.toml"]

				for file in requires_files:

				    if not os.path.exists(file):

				        print(f"!!! skipping missing {file}")

				        continue

				    print(f">>> cleaning {file}")

				    with open(file) as f:

				        lines = f.readlines()

				    if "torch" in "".join(lines).lower():

				        print("removed:")

				        with open(file, "w") as f:

				            for line in lines:

				                if "torch" not in line.lower():

				                    f.write(line)

				    print(f"<<< done cleaning {file}")

				    print()

									
										3

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -130,3 +130,6 @@

				- torch/csrc/inductor/aoti_include/**

				- torchgen/aoti/**

				- torchgen/gen_aoti_c_shim.py

				"ciflow/vllm":

				- .github/ci_commit_pins/vllm.txt

									
										15

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -525,6 +525,21 @@

				  - Lint

				  - pull

				- name: typechecking

				  patterns:

				  - 'pyrefly.toml'

				  - 'mypy.ini'

				  - 'mypy-strict.ini'

				  approved_by:

				  - lolpack

				  - maggiemoss

				  - ndmitchell

				  - kinto0

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				- name: superuser

				  patterns:

				  - '*'

									
										36

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -1,41 +1,45 @@

				tracking_issue: 24422

				ciflow_tracking_issue: 64124

				ciflow_push_tags:

				- ciflow/b200

				- ciflow/b200-symm-mem

				- ciflow/binaries

				- ciflow/binaries_libtorch

				- ciflow/binaries_wheel

				- ciflow/triton_binaries

				- ciflow/h100

				- ciflow/h100-cutlass-backend

				- ciflow/h100-distributed

				- ciflow/h100-symm-mem

				- ciflow/inductor

				- ciflow/inductor-periodic

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-test-nightly-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-cu126

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				- ciflow/inductor-perf-compare

				- ciflow/inductor-perf-test-nightly-rocm

				- ciflow/inductor-perf-test-nightly-x86-zen

				- ciflow/inductor-cu126

				- ciflow/inductor-periodic

				- ciflow/inductor-rocm

				- ciflow/linux-aarch64

				- ciflow/mps

				- ciflow/nightly

				- ciflow/op-benchmark

				- ciflow/periodic

				- ciflow/periodic-rocm-mi300

				- ciflow/pull

				- ciflow/quantization-periodic

				- ciflow/riscv64

				- ciflow/rocm

				- ciflow/rocm-mi300

				- ciflow/rocm-mi355

				- ciflow/s390

				- ciflow/riscv64

				- ciflow/slow

				- ciflow/torchbench

				- ciflow/triton_binaries

				- ciflow/trunk

				- ciflow/unstable

				- ciflow/xpu

				- ciflow/vllm

				- ciflow/torchbench

				- ciflow/op-benchmark

				- ciflow/pull

				- ciflow/h100

				- ciflow/h100-distributed

				- ciflow/win-arm64

				- ciflow/h100-symm-mem

				- ciflow/h100-cutlass-backend

				- ciflow/xpu

				retryable_workflows:

				- pull

				- trunk

				@ -44,4 +48,4 @@ retryable_workflows:

				- inductor-A100-perf-nightly

				labeler_config: labeler.yml

				label_to_label_config: label_to_label.yml

				mergebot: True

				mergebot: true

36

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,36 +0,0 @@
 boto3==1.35.42
 cmake==3.27.*
 expecttest==0.3.0
 fbscribelogger==0.1.7
 filelock==3.18.0
 hypothesis==6.56.4
 librosa>=0.6.2
 mpmath==1.3.0
 networkx==2.8.7
 ninja==1.10.2.4
 numba==0.59.0
 numpy==1.26.4
 opt-einsum>=3.3
 optree==0.13.0
 packaging==23.1
 parameterized==0.8.1
 pillow==10.3.0
 protobuf==5.29.4
 psutil==5.9.8
 pygments==2.15.0
 pytest-cpp==2.3.0
 pytest-flakefinder==1.1.0
 pytest-rerunfailures==10.3
 pytest-subtests==0.13.1
 pytest-xdist==3.3.1
 pytest==7.3.2
 pyyaml==6.0.2
 scipy==1.12.0
 setuptools==72.1.0
 sympy==1.13.3
 tlparse==0.4.0
 tensorboard==2.13.0
 typing-extensions==4.12.2
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
 z3-solver==4.15.1.0

									
										4

.github/scripts/docathon-label-sync.py
									
										vendored
									
												View File
												
				@ -39,7 +39,9 @@ def main() -> None:

				    pull_request_label_names = [label.name for label in pull_request_labels]

				    issue_label_names = [label.name for label in issue_labels]

				    labels_to_add = [

				        label for label in issue_label_names if label not in pull_request_label_names

				        label

				        for label in issue_label_names

				        if label not in pull_request_label_names and label != "actionable"

				    ]

				    if not labels_to_add:

				        print("The pull request already has the same labels.")

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

Binary file not shown.

									
										6

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -502,6 +502,7 @@ def perform_misc_tasks(

				    job_name: str,

				    pr_body: str,

				    branch: Optional[str] = None,

				    tag: Optional[str] = None,

				) -> None:

				    """

				    In addition to apply the filter logic, the script also does the following

				@ -509,7 +510,9 @@ def perform_misc_tasks(

				    """

				    set_output(

				        "keep-going",

				        branch == MAIN_BRANCH or check_for_setting(labels, pr_body, "keep-going"),

				        branch == MAIN_BRANCH

				        or bool(tag and re.match(r"^trunk/[a-f0-9]{40}$", tag))

				        or check_for_setting(labels, pr_body, "keep-going"),

				    )

				    set_output(

				        "ci-verbose-test-logs",

				@ -634,6 +637,7 @@ def main() -> None:

				        job_name=args.job_name,

				        pr_body=pr_body if pr_body else "",

				        branch=args.branch,

				        tag=tag,

				    )

				    # Set the filtered test matrix as the output

									
										92

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -30,7 +30,7 @@ CUDA_ARCHES_CUDNN_VERSION = {

				}

				# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this

				ROCM_ARCHES = ["6.3", "6.4"]

				ROCM_ARCHES = ["6.4", "7.0"]

				XPU_ARCHES = ["xpu"]

				@ -43,55 +43,55 @@ CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "13.0-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				    "12.6": (

				        "nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "

				        "nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | "

				        "nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | "

				        "nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | "

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'"

				    ),

				    "12.8": (

				        "nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "

				        "nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | "

				        "nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | "

				        "nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | "

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'"

				    ),

				    "13.0": (

				        "nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | "

				        "nvidia-cublas==13.0.0.19; platform_system == 'Linux' | "

				        "nvidia-cufft==12.0.0.15; platform_system == 'Linux' | "

				        "nvidia-curand==10.4.0.35; platform_system == 'Linux' | "

				        "nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "

				        "nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "

				        "nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "

				        "nvidia-nvtx==13.0.39; platform_system == 'Linux' | "

				        "nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "

				        "nvidia-cufile==1.15.0.42; platform_system == 'Linux'"

				    ),

				    "xpu": (

				        "intel-cmplr-lib-rt==2025.2.1 | "

									
										93

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -127,53 +127,6 @@ LINUX_BINARY_BUILD_WORFKLOWS = [

				    ),

				]

				ROCM_SMOKE_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_variant="rocm",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["6.4"],

				            python_versions=["3.9"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={

				                LABEL_CIFLOW_BINARIES,

				                LABEL_CIFLOW_BINARIES_WHEEL,

				                LABEL_CIFLOW_ROCM,

				            },

				            isolated_workflow=True,

				        ),

				        branches="main",

				    ),

				]

				LINUX_BINARY_SMOKE_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["12.8"],

				            python_versions=["3.12"],

				        ),

				        branches="main",

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

				        build_variant=generate_binary_build_matrix.RELEASE,

				        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(

				            OperatingSystem.LINUX,

				            generate_binary_build_matrix.RELEASE,

				            arches=["cpu"],

				            libtorch_variants=["shared-with-deps"],

				        ),

				        branches="main",

				    ),

				]

				WINDOWS_BINARY_BUILD_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.WINDOWS,

				@ -259,39 +212,6 @@ WINDOWS_BINARY_BUILD_WORKFLOWS = [

				    ),

				]

				WINDOWS_BINARY_SMOKE_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.WINDOWS,

				        package_type="libtorch",

				        build_variant=generate_binary_build_matrix.RELEASE,

				        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(

				            OperatingSystem.WINDOWS,

				            generate_binary_build_matrix.RELEASE,

				            arches=["cpu"],

				            libtorch_variants=["shared-with-deps"],

				        ),

				        branches="main",

				        ciflow_config=CIFlowConfig(

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.WINDOWS,

				        package_type="libtorch",

				        build_variant=generate_binary_build_matrix.DEBUG,

				        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(

				            OperatingSystem.WINDOWS,

				            generate_binary_build_matrix.DEBUG,

				            arches=["cpu"],

				            libtorch_variants=["shared-with-deps"],

				        ),

				        branches="main",

				        ciflow_config=CIFlowConfig(

				            isolated_workflow=True,

				        ),

				    ),

				]

				MACOS_BINARY_BUILD_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS_ARM64,

				@ -372,23 +292,10 @@ def main() -> None:

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            S390X_BINARY_BUILD_WORKFLOWS,

				        ),

				        (

				            # Give rocm it's own workflow file

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            ROCM_SMOKE_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            LINUX_BINARY_SMOKE_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("windows_binary_build_workflow.yml.j2"),

				            WINDOWS_BINARY_BUILD_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("windows_binary_build_workflow.yml.j2"),

				            WINDOWS_BINARY_SMOKE_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("macos_binary_build_workflow.yml.j2"),

				            MACOS_BINARY_BUILD_WORKFLOWS,

									
										1

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -18,6 +18,7 @@ class GitHubComment:

				    body_text: str

				    created_at: str

				    author_login: str

				    author_url: Optional[str]

				    author_association: str

				    editor_login: Optional[str]

				    database_id: int

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

Binary file not shown.

									
										94

.github/scripts/prepare_vllm_wheels.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,94 @@

				#!/usr/bin/env bash

				set -eux

				torch_version=$(unzip -p torch-* '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)

				nightly=$(echo ${torch_version} | cut -d'.' -f4)

				# Copied from .ci/manywheel/build_common.sh

				make_wheel_record() {

				  fpath=$1

				  if echo $fpath | grep RECORD >/dev/null 2>&1; then

				    echo "$fpath,,"

				  else

				    fhash=$(openssl dgst -sha256 -binary $fpath | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				    fsize=$(ls -nl $fpath | awk '{print $5}')

				    echo "$fpath,sha256=$fhash,$fsize"

				  fi

				}

				change_wheel_version() {

				  local package=$1

				  local wheel=$2

				  local f_version=$3

				  local t_version=$4

				  # Extract the wheel

				  ${PYTHON_EXECUTABLE} -mwheel unpack $wheel

				  mv "${package}-${f_version}" "${package}-${t_version}"

				  # Change the version from f_version to t_version in the dist-info dir

				  pushd "${package}-${t_version}"

				  mv "${package}-${f_version}.dist-info" "${package}-${t_version}.dist-info"

				  pushd "${package}-${t_version}.dist-info"

				  sed -i "s/${package}-${f_version}.dist-info/${package}-${t_version}.dist-info/g" RECORD

				  # Update the version in METADATA and its SHA256 hash

				  sed -i "s/Version: ${f_version}/Version: ${t_version}/g" METADATA

				  # then add PyTorch nightly dependency of vLLM

				  if [[ "${package}" == vllm ]] || [[ "${package}" == xformers ]]; then

				    sed -i "/License-File/a\Requires-Dist: torch==${torch_version}" METADATA

				  fi

				  sed -i '/METADATA,sha256/d' RECORD

				  popd

				  make_wheel_record "${package}-${t_version}.dist-info/METADATA" >> "${package}-${t_version}.dist-info/RECORD"

				  popd

				  # Repack the wheel

				  ${PYTHON_EXECUTABLE} -mwheel pack "${package}-${t_version}"

				  # Clean up

				  rm -rf "${package}-${t_version}"

				}

				repackage_wheel() {

				  local package=$1

				  pushd $package

				  local orig_wheel=$(find . -name *${package//-/_}*)

				  local orig_version=$(unzip -p $orig_wheel '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)

				  local version=""

				  if [[ "${package}" == vllm ]]; then

				    # Copied from vllm/.buildkite/scripts/upload-wheels.sh

				    version=1.0.0

				  else

				    version=$(echo $orig_version | tr '.+' '.' | cut -d'.' -f1-3)

				  fi

				  local nightly_version=$version.$nightly

				  # Use nightly version

				  change_wheel_version ${package//-/_} $orig_wheel $orig_version $nightly_version

				  # Clean up

				  rm "${orig_wheel}"

				  auditwheel repair --plat $PLATFORM *.whl \

				    --exclude libc10* --exclude libtorch* --exclude libcu* --exclude libnv*

				  local repair_wheel=$(find wheelhouse -name *${PLATFORM}*)

				  local repair_wheel=$(basename ${repair_wheel})

				  popd

				  cp ${package}/wheelhouse/${repair_wheel} .

				  rm -rf $package

				}

				# Require to re-package the wheel

				${PYTHON_EXECUTABLE} -mpip install wheel==0.45.1

				pushd externals/vllm/wheels

				for package in xformers flashinfer-python vllm; do

				  repackage_wheel $package

				done

				popd

									
										2

.github/scripts/test_check_labels.py
									
										vendored
									
												View File
												
				@ -38,6 +38,7 @@ def mock_get_comments() -> list[GitHubComment]:

				            body_text="mock_body_text",

				            created_at="",

				            author_login="",

				            author_url=None,

				            author_association="",

				            editor_login=None,

				            database_id=1,

				@ -48,6 +49,7 @@ def mock_get_comments() -> list[GitHubComment]:

				            body_text=" #" + LABEL_ERR_MSG_TITLE.replace("`", ""),

				            created_at="",

				            author_login=BOT_AUTHORS[1],

				            author_url=None,

				            author_association="",

				            editor_login=None,

				            database_id=2,

									
										18

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -32,6 +32,7 @@ from trymerge import (

				    main as trymerge_main,

				    MandatoryChecksMissingError,

				    MergeRule,

				    PostCommentError,

				    RE_GHSTACK_DESC,

				    read_merge_rules,

				    remove_job_name_suffix,

				@ -588,6 +589,23 @@ class TestTryMerge(TestCase):

				            self.assertEqual(mock_merge_base, pr.get_merge_base())

				            mocked_gh_fetch_merge_base.assert_called_once()

				    def test_app_can_revert(self, *args: Any) -> None:

				        pr = GitHubPR("pytorch", "pytorch", 164660)

				        repo = DummyGitRepo()

				        app_comment_id, impostor_comment_id = 3375785595, 3377647892

				        # Check that app can revert

				        self.assertIsNotNone(validate_revert(repo, pr, comment_id=app_comment_id))

				        # But impostor can not

				        self.assertRaises(

				            PostCommentError,

				            lambda: validate_revert(repo, pr, comment_id=impostor_comment_id),

				        )

				        # Despite it's name being the name of the bot

				        self.assertEqual(

				            pr.get_comment_by_id(impostor_comment_id).author_login,

				            "pytorch-auto-revert",

				        )

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

Compare commits

1467 Commits v2.9.0 ... windows_mm

15 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

77 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

64 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

3 .ci/docker/almalinux/Dockerfile Unescape Escape View File

6 .ci/docker/almalinux/build.sh Unescape Escape View File

44 .ci/docker/build.sh Unescape Escape View File

6 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface-requirements.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/rocm-composable-kernel.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

27 .ci/docker/common/install_acl.sh Normal file → Executable file Unescape Escape View File

23 .ci/docker/common/install_executorch.sh Unescape Escape View File

4 .ci/docker/common/install_onnx.sh Unescape Escape View File

12 .ci/docker/common/install_openblas.sh Normal file → Executable file Unescape Escape View File

15 .ci/docker/common/install_rocm.sh Unescape Escape View File

4 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

6 .ci/docker/common/install_triton.sh Unescape Escape View File

9 .ci/docker/common/patch_libstdc.sh Executable file Unescape Escape View File

6 .ci/docker/libtorch/build.sh Unescape Escape View File

3 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

12 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

13 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

71 .ci/docker/manywheel/Dockerfile_cxx11-abi Unescape Escape View File

3 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

17 .ci/docker/manywheel/build.sh Unescape Escape View File

54 .ci/docker/requirements-ci.txt Unescape Escape View File

9 .ci/docker/requirements-docs.txt Unescape Escape View File

6 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

2 .ci/libtorch/build.sh Unescape Escape View File

8 .ci/lumen_cli/cli/lib/core/vllm/lib.py Unescape Escape View File

11 .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py Unescape Escape View File

13 .ci/lumen_cli/cli/lib/core/vllm/vllm_test.py Unescape Escape View File

16 .ci/magma-rocm/Makefile Unescape Escape View File

6 .ci/magma-rocm/build_magma.sh Unescape Escape View File

2 .ci/manywheel/build_common.sh Unescape Escape View File

2 .ci/manywheel/build_libtorch.sh Unescape Escape View File

4 .ci/manywheel/build_rocm.sh Unescape Escape View File

15 .ci/pytorch/build.sh Unescape Escape View File

18 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

40 .ci/pytorch/functorch_doc_push_script.sh Unescape Escape View File

9 .ci/pytorch/macos-build.sh Unescape Escape View File

10 .ci/pytorch/macos-test.sh Unescape Escape View File

1 .ci/pytorch/multigpu-test.sh Unescape Escape View File

25 .ci/pytorch/numba-cuda-13.patch Normal file Unescape Escape View File

23 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

8 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

94 .ci/pytorch/test.sh Unescape Escape View File

32 .ci/pytorch/test_fa3_abi_stable.sh Executable file Unescape Escape View File

2 .ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1 Unescape Escape View File

6 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

12 .ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/setup_pytorch_env.bat Unescape Escape View File

44 .ci/pytorch/win-test-helpers/test_libtorch.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_python_shard.bat Unescape Escape View File

19 .ci/pytorch/win-test.sh Unescape Escape View File

2 .ci/pytorch/windows/arm64/build_pytorch.bat Unescape Escape View File

4 .ci/pytorch/windows/cuda128.bat Unescape Escape View File

10 .ci/pytorch/windows/internal/driver_update.bat Unescape Escape View File

2 .ci/pytorch/windows/internal/install_python.bat Unescape Escape View File

2 .ci/pytorch/windows/internal/setup.bat Unescape Escape View File

2 .ci/pytorch/windows/internal/static_lib_test.bat Unescape Escape View File

2 .ci/pytorch/windows/setup_build.bat Unescape Escape View File

40 .ci/wheel/build_wheel.sh Unescape Escape View File

9 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

47 .circleci/scripts/functorch_doc_push_script.sh Unescape Escape View File

7 .clang-tidy Unescape Escape View File

2 .flake8 Unescape Escape View File

4 .github/ISSUE_TEMPLATE/ci-sev.md vendored Unescape Escape View File

18 .github/ISSUE_TEMPLATE/disable-autorevert.md vendored Normal file Unescape Escape View File

6 .github/ISSUE_TEMPLATE/disable-ci-jobs.md vendored Unescape Escape View File

4 .github/actionlint.yaml vendored Unescape Escape View File

2 .github/actions/linux-test/action.yml vendored Unescape Escape View File

2 .github/actions/reuse-old-whl/reuse_old_whl.py vendored Unescape Escape View File

35 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

16 .github/actions/setup-win/action.yml vendored Unescape Escape View File

1467 Commits

v2.9.0 ... windows_mm

15

.ci/aarch64_linux/aarch64_ci_build.sh

View File

77

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

64

.ci/aarch64_linux/build_aarch64_wheel.py

View File

3

.ci/docker/almalinux/Dockerfile

View File

6

.ci/docker/almalinux/build.sh

View File

44

.ci/docker/build.sh

View File

6

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

1

.ci/docker/ci_commit_pins/rocm-composable-kernel.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

27

.ci/docker/common/install_acl.sh Normal file → Executable file

View File

23

.ci/docker/common/install_executorch.sh

View File

4

.ci/docker/common/install_onnx.sh

View File

12

.ci/docker/common/install_openblas.sh Normal file → Executable file

View File

15

.ci/docker/common/install_rocm.sh

View File

4

.ci/docker/common/install_rocm_magma.sh

View File

6

.ci/docker/common/install_triton.sh

View File

9

.ci/docker/common/patch_libstdc.sh Executable file

View File

6

.ci/docker/libtorch/build.sh

View File

3

.ci/docker/manywheel/Dockerfile_2_28

View File

12

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

13

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

71

.ci/docker/manywheel/Dockerfile_cxx11-abi

View File

3

.ci/docker/manywheel/Dockerfile_s390x

View File

17

.ci/docker/manywheel/build.sh

View File

54

.ci/docker/requirements-ci.txt

View File

9

.ci/docker/requirements-docs.txt

View File

6

.ci/docker/ubuntu-rocm/Dockerfile

View File

2

.ci/libtorch/build.sh

View File

8

.ci/lumen_cli/cli/lib/core/vllm/lib.py

View File

11

.ci/lumen_cli/cli/lib/core/vllm/vllm_build.py

View File

13

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py

View File

16

.ci/magma-rocm/Makefile

View File

6

.ci/magma-rocm/build_magma.sh

View File

2

.ci/manywheel/build_common.sh

View File

2

.ci/manywheel/build_libtorch.sh

View File

4

.ci/manywheel/build_rocm.sh

View File

15

.ci/pytorch/build.sh

View File

18

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/cpp_doc_push_script.sh

View File

40

.ci/pytorch/functorch_doc_push_script.sh

View File

9

.ci/pytorch/macos-build.sh

View File

10

.ci/pytorch/macos-test.sh

View File

1

.ci/pytorch/multigpu-test.sh

View File

25

.ci/pytorch/numba-cuda-13.patch Normal file

View File

23

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

8

.ci/pytorch/smoke_test/smoke_test.py

View File

94

.ci/pytorch/test.sh

View File

32

.ci/pytorch/test_fa3_abi_stable.sh Executable file

View File

2

.ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1

View File

6

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

12

.ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat

View File

2

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat

View File

44

.ci/pytorch/win-test-helpers/test_libtorch.bat

View File

2

.ci/pytorch/win-test-helpers/test_python_shard.bat

View File

19

.ci/pytorch/win-test.sh

View File

2

.ci/pytorch/windows/arm64/build_pytorch.bat

View File

4

.ci/pytorch/windows/cuda128.bat

View File

10

.ci/pytorch/windows/internal/driver_update.bat

View File

2

.ci/pytorch/windows/internal/install_python.bat

View File

2

.ci/pytorch/windows/internal/setup.bat

View File

2

.ci/pytorch/windows/internal/static_lib_test.bat

View File

2

.ci/pytorch/windows/setup_build.bat

View File

40

.ci/wheel/build_wheel.sh

View File

9

.circleci/scripts/binary_populate_env.sh

View File

47

.circleci/scripts/functorch_doc_push_script.sh

View File

7

.clang-tidy

View File

2

.flake8

View File

4

.github/ISSUE_TEMPLATE/ci-sev.md vendored

View File

18

.github/ISSUE_TEMPLATE/disable-autorevert.md vendored Normal file

View File

6

.github/ISSUE_TEMPLATE/disable-ci-jobs.md vendored

View File

4

.github/actionlint.yaml vendored

View File

2

.github/actions/linux-test/action.yml vendored

View File

2

.github/actions/reuse-old-whl/reuse_old_whl.py vendored

View File

35

.github/actions/setup-linux/action.yml vendored

View File

16

.github/actions/setup-win/action.yml vendored

View File

3

.github/actions/teardown-win/action.yml vendored

View File