pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-26 00:24:53 +08:00

Author	SHA1	Message	Date
Natalia Gimelshein	2d0cdee394	move thread-local capture mode guard to include work.isStarted (#160398 ) Per title, should fix capture errors that happen because nccl watchdog races with capture start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398 Approved by: https://github.com/aorenste	2025-08-12 19:25:04 +00:00
eqy	9903ca4f70	[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 ) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN https://github.com/pytorch/pytorch/issues/155225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-08-12 18:07:41 +00:00
PyTorch MergeBot	f341077ce4	Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281 )" This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b. Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](`a7abf57aab`). Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))	2025-08-12 17:57:57 +00:00
Edward Z. Yang	3cec82a7e9	Ensure outer aliasing on DTensor matches inner aliasing (#158954 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954 Approved by: https://github.com/albanD, https://github.com/wconstab	2025-08-12 17:47:48 +00:00
Jerry Mannil	ee9f8ba11d	[ROCm] Use opportunistic fastatomics based on hueristics (#159430 ) * Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address Co-author: @amd-hhashemi Reproducer: ``` import time import torch x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float) ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda') src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float) for _ in range(20): x.index_add_(0, ind, src) start_time = time.time() for i in range(100): x.index_add_(0, ind, src) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/100 print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us") ``` Perf numbers: ``` Before: Avg time for index_add_: 25652.16 us After: Avg time for index_add_: 2675.15 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-08-12 17:13:54 +00:00
David Berard	1f4057c11a	[inductor] remove no_x_dim (#159810 ) no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-08-12 17:10:31 +00:00
Jovian Anthony Jaison	94b91a8763	[redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. ref: D79456310 (got reverted because of linter) Testing: Refer differential Revision: D79917440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352 Approved by: https://github.com/masnesral	2025-08-12 16:49:08 +00:00
Xinya Zhang	a7abf57aab	[ROCm] Support large inputs for coalesceValuesKernel (#158281 ) # Description `.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit. This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation. Confirmed the new approach can handle large inputs. Correctness needs validation. # Testing Command `python torch_spmv.py 22500000 272500000` ## Script `torch_spmv.py` ``` python import torch import argparse def parse_args(): parser = argparse.ArgumentParser( description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch" ) parser.add_argument("n", type=int, help="Size of the NxN matrix") parser.add_argument("nnz", type=int, help="Number of non-zero entries") return parser.parse_args() def main(): args = parse_args() n = args.n nnz = args.nnz dtype = torch.float32 device = torch.device('cuda') # Generate random indices for the sparse matrix in COO format. torch.manual_seed(42) rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) indices = torch.stack([rows, cols], dim=0) # Generate random values. values = torch.randn(nnz, dtype=torch.float32, device=device) # Create the sparse COO matrix and move it to the target device. sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device) sparse_matrix = sparse_matrix.coalesce() # Generate a random dense vector. dense_vector = torch.randn(n, dtype=torch.float32, device=device) # Perform sparse matrix - dense vector multiplication. # Using torch.sparse.mm which expects a 2D tensor for the vector. result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze() # result = torch.mv(sparse_matrix, dense_vector) # Print the result. print("Result of the multiplication:") print(torch.sum(result)) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-08-12 16:42:55 +00:00
PyTorch MergeBot	f7b2f3314c	Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000 )" This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0. Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))	2025-08-12 16:33:02 +00:00
Jeff Daily	9d37c960a4	[ROCm][CI] use new benchmark image for dynamo (#160421 ) Follow-up to #160047 that separated the rocm image into default CI and benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 16:07:19 +00:00
PyTorch MergeBot	b219ca2a00	Revert "Update triton xpu commit to support python 3.14 (#160183 )" This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d. Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](`7fbc22855c`). Maybe because the docker build changed? Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))	2025-08-12 15:29:19 +00:00
atalman	b7db86600a	Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416 ) Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416 Approved by: https://github.com/malfet	2025-08-12 15:15:12 +00:00
James Wu	9708fcf92d	Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120 ) This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user only changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet. This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time. This should handle 99% of cases, though. I added an expectedFailure test to show the limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120 Approved by: https://github.com/zou3519	2025-08-12 14:11:06 +00:00
Wang, Chuanqi	a288b15ea9	[CI] Reduce XPU Windows build time (#159763 ) Reduce the time cost from 2.5 hours to about 1.5 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:04:29 +00:00
Wang, Chuanqi	7fbc22855c	Update triton xpu commit to support python 3.14 (#160183 ) Follow PR #159725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:02:36 +00:00
IvanKobzarev	f33ce40bc0	[bucketing] Bucket only adjacent collectives to prevent reordering (#159983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983 Approved by: https://github.com/wconstab, https://github.com/eellison	2025-08-12 11:57:00 +00:00
Animesh Jain	4d5b3f2d5a	[dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796 Approved by: https://github.com/jansel	2025-08-12 09:49:11 +00:00
zeshengzong	f990490a23	Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 ) Fixes #91545 ## Changes - Add `label_smoothing` param and docs - Add test case for `label_smoothing` - Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss` ## Test Result ```bash pytest -s test/test_nn.py -k test_bce ``` ![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05) ![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964) ![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282 Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki	2025-08-12 09:37:03 +00:00
morrison-turnansky	b9003ed3d8	Dynamo Deep Dive Documentation Fix (#158860 ) changed SourceBuilder to VariableBuilder Fixes #158447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860 Approved by: https://github.com/mlazos	2025-08-12 08:53:33 +00:00
Laith Sakka	fea7e9dd37	extract shape in _view_has_unbacked_input (#160255 ) Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..] Test Plan: existing tests. Rollback Plan: Differential Revision: D79951119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255 Approved by: https://github.com/bobrenjc93	2025-08-12 08:38:19 +00:00
Jovian Anthony Jaison	9a0f7a3bb0	[retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348 ) refer: https://github.com/pytorch/pytorch/pull/159655 Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed. Updated test_dynamo_timed + re-ran locally to test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348 Approved by: https://github.com/masnesral	2025-08-12 06:24:54 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Animesh Jain	8d3d1c8443	[dynamo] fixes to propagate tag safeness (#159807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807 Approved by: https://github.com/jansel	2025-08-12 04:50:13 +00:00
PyTorch UpdateBot	0f3b10b8ee	[audio hash update] update the pinned audio hash (#160384 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384 Approved by: https://github.com/pytorchbot	2025-08-12 04:38:04 +00:00
Boyuan Feng	5f1010fbb3	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-12 04:37:58 +00:00
Nikita Shulga	edaa151d0d	[CI] Move CUDA tests to trunk workflow (#160379 ) Which is getting run before PR is merged anyway, but according to 3X less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b) <img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" /> I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379 Approved by: https://github.com/izaitsevfb	2025-08-12 04:23:50 +00:00
rzou	10bc36fe84	Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341 ) Short-term fix for https://github.com/pytorch/pytorch/issues/160333 The problem is: 1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation 2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented. 3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition. The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented return logic into the decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341 Approved by: https://github.com/drisspg	2025-08-12 04:09:37 +00:00
PyTorch UpdateBot	32e5e2f596	[vllm hash update] update the pinned vllm hash (#160259 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259 Approved by: https://github.com/pytorchbot	2025-08-12 04:04:53 +00:00
Scott Todd	bfc873d02e	[ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083 ) This reverts the changes from `b367e5f6a6`. This will also close https://github.com/pytorch/pytorch/pull/158922. Since `30387ab2e4`, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib. There was also a bug in here - if `ROCM_DIR` is unset, the code crashes: ``` File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command cmd_obj.run() File "D:\b\pytorch_main\setup.py", line 853, in run rocm_dir_path = Path(os.environ["ROCM_DIR"]) ~~~~~~~~~~^^^^^^^^^^^^ File "<frozen os>", line 714, in __getitem__ KeyError: 'ROCM_DIR' ``` The code could have checked for `ROCM_PATH` too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083 Approved by: https://github.com/jeffdaily	2025-08-12 02:45:49 +00:00
Scott Todd	eed9dbf70f	[ROCm] Add torch/_rocm_init.py to .gitignore. (#159806 ) Follow-up to https://github.com/pytorch/pytorch/pull/155285. Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like: ```python def initialize(): import rocm_sdk rocm_sdk.initialize_process( preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'], check_version='7.0.0rc20250804') ``` We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520). This file is then loaded if present here: `a7f3bdf550/torch/__init__.py (L145-L157)` Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806 Approved by: https://github.com/jeffdaily	2025-08-12 02:24:21 +00:00
Natalia Gimelshein	be53f609aa	fix retaining multimem in symmetric memory (#160343 ) fixes OOM in #160289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343 Approved by: https://github.com/eqy	2025-08-12 02:03:20 +00:00
Zain Rizvi	95210cc409	[BE] Isolate pre-push hook dependencies in dedicated virtual environment (#160048 ) This adds two changes: - Isolates pre-push hook dependencies into an isolated venv, no longer affect your system environment - Lets you manually run the pre-push lintrunner (including with lintrunner -a) by invoking `python scripts/lintrunner.py [-a]` (it's ugly, but better than nothing...for now) This is a follow up to: - https://github.com/pytorch/pytorch/pull/158389 ## Problem The current pre-push hook setup installs lintrunner and related dependencies globally, which makes developers nervous about system pollution and can cause version conflicts with existing installations. Also, if the pre-push lintrunner found errors, you had to hope your normal lintrunner could fix them (which wasn't always the case, e.g. if those errors only manifested in certain python versions) ## Key Changes: - Isolated Environment: Creates .git/hooks/linter/.venv/ with Python 3.9 (the python used in CI) and an isolated lintrunner installation - User-Friendly CLI: New python scripts/lintrunner.py wrapper allows developers to run lintrunner (including -a auto-fix) from any environment - Simplified Architecture: Eliminates pre-commit dependency entirely - uses direct git hooks File Changes: - scripts/setup_hooks.py: Rewritten to create isolated uv-managed virtual environment - scripts/lintrunner.py: New wrapper script with shared hash management logic - scripts/run_lintrunner.py: Removed (functionality merged into lintrunner.py) - .pre-commit-config.yaml: Removed (no longer needed) ## Usage: ``` # Setup (run once) python scripts/setup_hooks.py # Manual linting (works from any environment) python scripts/lintrunner.py # Check mode python scripts/lintrunner.py -a # Auto-fix mode # Git hooks work automatically git push # Runs lintrunner in isolated environment # Need to skip the pre-push hook? git push --no-verify ``` ## Benefits: - ✅ Zero global dependency installation - ✅ Per-repository isolation prevents version conflicts - ✅ Full lintrunner functionality is now accessible ## Implementation Notes: - Virtual env is kept in a dedicated dir in .git, to keep per-repo mechanics - lintrunner.py does not need to be invoked from a specific venv. It'll invoke the right venv itself. A minor bug: It tends to garble the lintrunner output a bit, like the screenshot below shows, but I haven't found a workaround so far and it remains understandable to users: <img width="241" height="154" alt="image" src="https://github.com/user-attachments/assets/9496f925-8524-4434-8486-dc579442d688" /> ## What's next? Features that could be added: - Check for lintrunner updates, auto-update if needed - Depending on dev response, this could be enabled by default for all pytorch/pytorch environments Pull Request resolved: https://github.com/pytorch/pytorch/pull/160048 Approved by: https://github.com/seemethere	2025-08-12 01:58:46 +00:00
Ramya Ramineni	7a974a88f2	[ROCm] Fix resource_strings.h (#159996 ) This PR fixes the errors like below: ``` [rank7]: RuntimeError: /tmp/comgr-c3c81b/input/CompileSourceejOPx6:34:8: error: unknown type name 'uint64_t'; did you mean '__hip_internal::uint64_t'? [rank7]: 34 \| if(((uint64_t) t0.data) % (4 * sizeof(half)) != 0) flag_vec4 = false; ``` The following datatypes needs to be defined in `torch/csrc/jit/codegen/fuser/cuda/resource_strings.h` for ROCm versions >= 7.0. ``` typedef unsigned char uint8_t; typedef signed char int8_t; typedef short int int16_t; typedef long long int int64_t; typedef unsigned long long int uint64_t; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159996 Approved by: https://github.com/pruthvistony, https://github.com/Skylion007, https://github.com/jeffdaily	2025-08-12 01:58:02 +00:00
henrylhtsang	f3f159ff8c	[BE][cutlass backend] Reduce severity of log message for no cutlass config found (#160148 ) This is not really a problem. Sometimes we cannot find a cutlass config due to shape, e.g. when k is odd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160148 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2025-08-12 01:41:58 +00:00
henrylhtsang	b90feeac86	[BE][cutlass backend] Fix subproc addmm tests (#160295 ) Differential Revision: [D79977421](https://our.internmc.facebook.com/intern/diff/D79977421/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160295 Approved by: https://github.com/jingsh	2025-08-12 01:41:06 +00:00
Han, Xu	0d40ff3b49	[inductor] fix test_different_file_paths_local_pgo on Windows. (#160382 ) fix test_different_file_paths_local_pgo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160382 Approved by: https://github.com/angelayi	2025-08-12 01:35:39 +00:00
Scott Todd	cae2b5e3d2	[ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079 ) This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079 Approved by: https://github.com/jeffdaily	2025-08-12 01:28:20 +00:00
Scott Todd	ee89cc7a0a	[ROCm][Windows] Fix LoadHIP handling of environment variable paths on Windows. (#159080 ) See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`. This fixes configure errors like: ``` CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set): Syntax error in cmake code at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 when parsing string D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules Invalid character escape '\p'. CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile): Failed to configure test project build system. ``` (note the mixed usage of `\` and `/` in that string) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080 Approved by: https://github.com/jeffdaily	2025-08-12 00:18:19 +00:00
Howard Huang	e63c2b21c1	[PP] Initialize P2P communicators on first step (#160210 ) Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this. This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible. Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210 Approved by: https://github.com/wconstab	2025-08-11 23:46:58 +00:00
drisspg	3626ba711b	[FlexAttention] Swap from and to & for new triton (#160227 ) Fixes #158463 On B200 I am getting a bunch of error spew: ```Shell /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ``` ```Shell 74 = arith.subi %170, %166 : i32 %175 = arith.muli %174, %c128_i32 : i32 %176 = arith.subi %175, %c64_i32 : i32 %177 = arith.extui %173 : i1 to i32 %178 = arith.muli %176, %177 : i32 %179 = arith.subi %c1_i32, %177 : i32 %180 = arith.muli %179, %c64_i32 : i32 %181 = arith.addi %178, %180 : i32 %182 = arith.muli %181, %c64_i32 : i32 %183 = tt.splat %182 : i32 -> tensor<64x64xi32> %184 = tt.addptr %arg19, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %185 = tt.addptr %arg20, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %186 = tt.splat %181 : i32 -> tensor<64xi32> %187 = arith.addi %arg21, %186 : tensor<64xi32> scf.yield %163, %184, %185, %187 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.select %115, %cst_4, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %123 = tt.broadcast %122 : tensor<1x64xi1> -> tensor<64x64xi1> %124 = arith.select %123, %121, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %125 = arith.mulf %124, %cst_2 : tensor<64x64xf32> %126 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %127 = arith.subf %125, %126 : tensor<64x64xf32> %128 = math.exp2 %127 : tensor<64x64xf32> %129 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %130 = tt.dot %51, %129, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %131 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %132 = tt.broadcast %131 : tensor<64x1xf32> -> tensor<64x64xf32> %133 = arith.subf %130, %132 : tensor<64x64xf32> %134 = arith.mulf %128, %133 : tensor<64x64xf32> %135 = arith.mulf %134, %cst_3 : tensor<64x64xf32> %136 = arith.select %116, %135, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %137 = arith.select %115, %122, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %138 = tt.broadcast %137 : tensor<1x64xi1> -> tensor<64x64xi1> %139 = arith.select %138, %136, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %140 = arith.truncf %139 : tensor<64x64xf32> to tensor<64x64xf16> %141 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %142 = tt.dot %140, %141, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %142 : tensor<64x64xf32> } else { scf.yield %cst_9 : tensor<64x64xf32> } %84 = tt.addptr %arg13, %22 : !tt.ptr<i32>, i32 %85 = tt.load %84 : !tt.ptr<i32> %86 = arith.muli %85, %c128_i32 : i32 %87 = tt.addptr %arg12, %21 : !tt.ptr<i32>, i32 %88 = tt.load %87 : !tt.ptr<i32> %89 = tt.splat %86 : i32 -> tensor<64xi32> %90 = arith.addi %89, %14 : tensor<64xi32> %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %92 = arith.muli %91, %cst_11 : tensor<1x64xi32> %93 = tt.addptr %71, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %94 = tt.broadcast %93 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %95 = tt.addptr %94, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %96 = tt.addptr %76, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %97 = tt.broadcast %96 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %98 = tt.addptr %97, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %99 = arith.muli %88, %c2_i32 : i32 %100 = arith.minsi %99, %c4_i32 : i32 %101 = arith.cmpi sge, %100, %c1_i32 : i32 %102 = scf.if %101 -> (tensor<64x64xf32>) { %112 = arith.subi %100, %c1_i32 : i32 %113:4 = scf.for %arg17 = %c0_i32 to %112 step %c1_i32 iter_args(%arg18 = %83, %arg19 = %95, %arg20 = %98, %arg21 = %90) -> (tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %137 = tt.expand_dims %arg21 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %138 = arith.cmpi slt, %137, %cst_7 : tensor<1x64xi32> %139 = tt.broadcast %138 : tensor<1x64xi1> -> tensor<64x64xi1> %140 = tt.load %arg19, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %141 = tt.dot %46, %140, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %142 = arith.mulf %141, %cst_13 : tensor<64x64xf32> %143 = arith.mulf %142, %cst_3 : tensor<64x64xf32> %144 = arith.mulf %143, %cst_2 : tensor<64x64xf32> %145 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %146 = arith.subf %144, %145 : tensor<64x64xf32> %147 = math.exp2 %146 : tensor<64x64xf32> %148 = tt.load %arg20, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %149 = tt.dot %51, %148, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %151 = tt.broadcast %150 : tensor<64x1xf32> -> tensor<64x64xf32> %152 = arith.subf %149, %151 : tensor<64x64xf32> %153 = arith.mulf %147, %152 : tensor<64x64xf32> %154 = arith.mulf %153, %cst_3 : tensor<64x64xf32> %155 = arith.truncf %154 : tensor<64x64xf32> to tensor<64x64xf16> %156 = tt.trans %140 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %157 = tt.dot %155, %156, %arg18, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %158 = arith.divsi %arg17, %c2_i32 : i32 %159 = tt.addptr %84, %158 : !tt.ptr<i32>, i32 %160 = tt.load %159 evictionPolicy = evict_last : !tt.ptr<i32> %161 = arith.addi %158, %c1_i32 : i32 %162 = arith.cmpi slt, %161, %88 : i32 %163 = tt.addptr %159, %c1_i32 : !tt.ptr<i32>, i32 %164 = tt.load %163, %162 evictionPolicy = evict_last : !tt.ptr<i32> %165 = arith.addi %arg17, %c1_i32 : i32 %166 = arith.remsi %165, %c2_i32 : i32 %167 = arith.cmpi eq, %166, %c0_i32 : i32 %168 = arith.subi %164, %160 : i32 %169 = arith.muli %168, %c128_i32 : i32 %170 = arith.subi %169, %c64_i32 : i32 %171 = arith.extui %167 : i1 to i32 %172 = arith.muli %170, %171 : i32 %173 = arith.subi %c1_i32, %171 : i32 %174 = arith.muli %173, %c64_i32 : i32 %175 = arith.addi %172, %174 : i32 %176 = arith.muli %175, %c64_i32 : i32 %177 = tt.splat %176 : i32 -> tensor<64x64xi32> %178 = tt.addptr %arg19, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %179 = tt.addptr %arg20, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %180 = tt.splat %175 : i32 -> tensor<64xi32> %181 = arith.addi %arg21, %180 : tensor<64xi32> scf.yield %157, %178, %179, %181 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.mulf %121, %cst_2 : tensor<64x64xf32> %123 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %124 = arith.subf %122, %123 : tensor<64x64xf32> %125 = math.exp2 %124 : tensor<64x64xf32> %126 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %127 = tt.dot %51, %126, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %128 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %129 = tt.broadcast %128 : tensor<64x1xf32> -> tensor<64x64xf32> %130 = arith.subf %127, %129 : tensor<64x64xf32> %131 = arith.mulf %125, %130 : tensor<64x64xf32> %132 = arith.mulf %131, %cst_3 : tensor<64x64xf32> %133 = arith.select %116, %132, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %134 = arith.truncf %133 : tensor<64x64xf32> to tensor<64x64xf16> %135 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %136 = tt.dot %134, %135, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %136 : tensor<64x64xf32> } else { scf.yield %83 : tensor<64x64xf32> } %103 = tt.splat %33 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %104 = tt.addptr %103, %37 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %105 = tt.broadcast %104 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %106 = tt.addptr %105, %42 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %107 = arith.mulf %102, %cst_13 : tensor<64x64xf32> %108 = arith.cmpi slt, %40, %cst_11 : tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64xi1> -> tensor<64x64xi1> %110 = arith.andi %45, %109 : tensor<64x64xi1> %111 = arith.truncf %107 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %106, %111, %110 : tensor<64x64x!tt.ptr<f16>> } else { %16 = arith.divsi %0, %c2_i32 : i32 %17 = arith.muli %0, %c64_i32 : i32 %18 = tt.splat %17 : i32 -> tensor<64xi32> %19 = arith.addi %18, %14 : tensor<64xi32> %20 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %21 = arith.muli %20, %cst_14 : tensor<64x1xi32> %22 = tt.splat %11 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %23 = tt.addptr %22, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %24 = tt.expand_dims %14 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %25 = tt.broadcast %23 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %26 = tt.broadcast %24 : tensor<1x64xi32> -> tensor<64x64xi32> %27 = tt.addptr %25, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %28 = arith.cmpi slt, %20, %cst_10 : tensor<64x1xi32> %29 = tt.broadcast %28 : tensor<64x1xi1> -> tensor<64x64xi1> %30 = tt.load %27, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %31 = tt.splat %12 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %32 = tt.addptr %31, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %33 = tt.broadcast %32 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %34 = tt.addptr %33, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %35 = tt.load %34, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %36:2 = scf.for %arg17 = %c0_i32 to %c4_i32 step %c1_i32 iter_args(%arg18 = %cst_9, %arg19 = %cst_9) -> (tensor<64x64xf32>, tensor<64x64xf32>) : i32 { %55 = arith.muli %2, %c4_i32 : i32 %56 = arith.addi %55, %arg17 : i32 %57 = arith.muli %56, %c2048_i32 : i32 %58 = arith.muli %1, %c32768_i32 : i32 %59 = arith.addi %57, %58 : i32 %60 = arith.extsi %59 : i32 to i64 %61 = arith.muli %1, %c16_i32 : i32 %62 = arith.addi %61, %56 : i32 %63 = arith.muli %62, %c32_i32 : i32 %64 = arith.extsi %63 : i32 to i64 %65 = tt.addptr %arg0, %60 : !tt.ptr<f16>, i64 %66 = tt.addptr %arg5, %60 : !tt.ptr<f16>, i64 %67 = tt.addptr %arg3, %64 : !tt.ptr<f32>, i64 %68 = tt.addptr %arg4, %64 : !tt.ptr<f32>, i64 %69 = arith.remsi %56, %c16_i32 : i32 %70 = arith.muli %3, %c16_i32 : i32 %71 = arith.addi %70, %69 : i32 %72 = arith.muli %71, %c2_i32 : i32 %73 = arith.addi %72, %16 : i32 %74 = tt.addptr %arg11, %73 : !tt.ptr<i32>, i32 %75 = tt.load %74 : !tt.ptr<i32> %76 = arith.muli %75, %c128_i32 : i32 %77 = tt.addptr %arg10, %73 : !tt.ptr<i32>, i32 %78 = tt.load %77 : !tt.ptr<i32> %79 = tt.splat %76 : i32 -> tensor<64xi32> %80 = arith.addi %79, %14 : tensor<64xi32> %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %82 = arith.muli %81, %cst_11 : tensor<1x64xi32> %83 = tt.splat %65 : !tt.ptr<f16> -> tensor<1x64x!tt.ptr<f16>> %84 = tt.addptr %83, %82 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %85 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %86 = tt.broadcast %84 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %87 = tt.broadcast %85 : tensor<64x1xi32> -> tensor<64x64xi32> %88 = tt.addptr %86, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %89 = tt.expand_dims %80 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %90 = arith.muli %89, %cst_14 : tensor<64x1xi32> %91 = tt.splat %66 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %92 = tt.addptr %91, %90 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %93 = tt.broadcast %92 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %94 = tt.addptr %93, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %95 = arith.muli %78, %c2_i32 : i32 %96 = arith.minsi %95, %c1_i32 : i32 %97 = arith.cmpi sge, %96, %c1_i32 : i32 %98:2 = scf.if %97 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %96, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %arg18, %arg22 = %arg19, %arg23 = %88, %arg24 = %94, %arg25 = %80) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %167 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %168 = arith.cmpi slt, %167, %cst_1 : tensor<1x64xi32> %169 = tt.broadcast %168 : tensor<1x64xi1> -> tensor<64x64xi1> %170 = tt.load %arg23, %169, %cst_8 : tensor<64x64x!tt.ptr<f16>> %171 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %172 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %173 = tt.addptr %172, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %174 = tt.load %173, %171 : tensor<64x!tt.ptr<f32>> %175 = arith.cmpf oeq, %174, %cst_16 : tensor<64xf32> %176 = arith.select %175, %cst_15, %174 : tensor<64xi1>, tensor<64xf32> %177 = tt.dot %30, %170, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %178 = arith.mulf %177, %cst_13 : tensor<64x64xf32> %179 = arith.mulf %178, %cst_3 : tensor<64x64xf32> %180 = arith.mulf %179, %cst_2 : tensor<64x64xf32> %181 = tt.expand_dims %176 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %182 = tt.broadcast %181 : tensor<1x64xf32> -> tensor<64x64xf32> %183 = arith.subf %180, %182 : tensor<64x64xf32> %184 = math.exp2 %183 : tensor<64x64xf32> %185 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %186 = arith.cmpi slt, %185, %cst_12 : tensor<64x1xi32> %187 = tt.broadcast %186 : tensor<64x1xi1> -> tensor<64x64xi1> %188 = tt.load %arg24, %187, %cst_8 : tensor<64x64x!tt.ptr<f16>> %189 = arith.truncf %184 : tensor<64x64xf32> to tensor<64x64xf16> %190 = tt.dot %189, %188, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %191 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %192 = tt.addptr %191, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %193 = tt.load %192, %171 : tensor<64x!tt.ptr<f32>> %194 = tt.trans %188 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %195 = tt.dot %35, %194, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %196 = tt.expand_dims %193 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %197 = tt.broadcast %196 : tensor<1x64xf32> -> tensor<64x64xf32> %198 = arith.subf %195, %197 : tensor<64x64xf32> %199 = arith.mulf %184, %198 : tensor<64x64xf32> %200 = arith.mulf %199, %cst_3 : tensor<64x64xf32> %201 = arith.truncf %200 : tensor<64x64xf32> to tensor<64x64xf16> %202 = tt.trans %170 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %203 = tt.dot %201, %202, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %204 = arith.divsi %arg20, %c2_i32 : i32 %205 = tt.addptr %74, %204 : !tt.ptr<i32>, i32 %206 = tt.load %205 evictionPolicy = evict_last : !tt.ptr<i32> %207 = arith.addi %204, %c1_i32 : i32 %208 = arith.cmpi slt, %207, %78 : i32 %209 = tt.addptr %205, %c1_i32 : !tt.ptr<i32>, i32 %210 = tt.load %209, %208 evictionPolicy = evict_last : !tt.ptr<i32> %211 = arith.addi %arg20, %c1_i32 : i32 %212 = arith.remsi %211, %c2_i32 : i32 %213 = arith.cmpi eq, %212, %c0_i32 : i32 %214 = arith.subi %210, %206 : i32 %215 = arith.muli %214, %c128_i32 : i32 %216 = arith.subi %215, %c64_i32 : i32 %217 = arith.extui %213 : i1 to i32 %218 = arith.muli %216, %217 : i32 %219 = arith.subi %c1_i32, %217 : i32 %220 = arith.muli %219, %c64_i32 : i32 %221 = arith.addi %218, %220 : i32 %222 = arith.muli %221, %c64_i32 : i32 %223 = tt.splat %222 : i32 -> tensor<64x64xi32> %224 = tt.addptr %arg23, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %225 = tt.addptr %arg24, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %226 = tt.splat %221 : i32 -> tensor<64xi32> %227 = arith.addi %arg25, %226 : tensor<64xi32> scf.yield %203, %190, %224, %225, %227 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.select %28, %cst, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %137 = tt.broadcast %136 : tensor<64x1xi1> -> tensor<64x64xi1> %138 = arith.select %137, %135, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %139 = arith.mulf %138, %cst_2 : tensor<64x64xf32> %140 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %141 = tt.broadcast %140 : tensor<1x64xf32> -> tensor<64x64xf32> %142 = arith.subf %139, %141 : tensor<64x64xf32> %143 = math.exp2 %142 : tensor<64x64xf32> %144 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %145 = arith.cmpi slt, %144, %cst_12 : tensor<64x1xi32> %146 = tt.broadcast %145 : tensor<64x1xi1> -> tensor<64x64xi1> %147 = tt.load %121#3, %146, %cst_8 : tensor<64x64x!tt.ptr<f16>> %148 = arith.truncf %143 : tensor<64x64xf32> to tensor<64x64xf16> %149 = tt.dot %148, %147, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %151 = tt.addptr %150, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %152 = tt.load %151, %126 : tensor<64x!tt.ptr<f32>> %153 = tt.trans %147 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %154 = tt.dot %35, %153, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %155 = tt.expand_dims %152 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %156 = tt.broadcast %155 : tensor<1x64xf32> -> tensor<64x64xf32> %157 = arith.subf %154, %156 : tensor<64x64xf32> %158 = arith.mulf %143, %157 : tensor<64x64xf32> %159 = arith.mulf %158, %cst_3 : tensor<64x64xf32> %160 = arith.select %29, %159, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %161 = arith.select %28, %136, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %162 = tt.broadcast %161 : tensor<64x1xi1> -> tensor<64x64xi1> %163 = arith.select %162, %160, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %164 = arith.truncf %163 : tensor<64x64xf32> to tensor<64x64xf16> %165 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %166 = tt.dot %164, %165, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %166, %149 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %arg18, %arg19 : tensor<64x64xf32>, tensor<64x64xf32> } %99 = tt.addptr %arg15, %73 : !tt.ptr<i32>, i32 %100 = tt.load %99 : !tt.ptr<i32> %101 = arith.muli %100, %c128_i32 : i32 %102 = tt.addptr %arg14, %73 : !tt.ptr<i32>, i32 %103 = tt.load %102 : !tt.ptr<i32> %104 = tt.splat %101 : i32 -> tensor<64xi32> %105 = arith.addi %104, %14 : tensor<64xi32> %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %107 = arith.muli %106, %cst_11 : tensor<1x64xi32> %108 = tt.addptr %83, %107 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %110 = tt.addptr %109, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %111 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %112 = arith.muli %111, %cst_14 : tensor<64x1xi32> %113 = tt.addptr %91, %112 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %114 = tt.broadcast %113 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %115 = tt.addptr %114, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %116 = arith.muli %103, %c2_i32 : i32 %117 = arith.minsi %116, %c1_i32 : i32 %118 = arith.cmpi sge, %117, %c1_i32 : i32 %119:2 = scf.if %118 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %117, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %98#0, %arg22 = %98#1, %arg23 = %110, %arg24 = %115, %arg25 = %105) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %161 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %162 = arith.cmpi slt, %161, %cst_1 : tensor<1x64xi32> %163 = tt.broadcast %162 : tensor<1x64xi1> -> tensor<64x64xi1> %164 = tt.load %arg23, %163, %cst_8 : tensor<64x64x!tt.ptr<f16>> %165 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %166 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %167 = tt.addptr %166, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %168 = tt.load %167, %165 : tensor<64x!tt.ptr<f32>> %169 = arith.cmpf oeq, %168, %cst_16 : tensor<64xf32> %170 = arith.select %169, %cst_15, %168 : tensor<64xi1>, tensor<64xf32> %171 = tt.dot %30, %164, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %172 = arith.mulf %171, %cst_13 : tensor<64x64xf32> %173 = arith.mulf %172, %cst_3 : tensor<64x64xf32> %174 = arith.mulf %173, %cst_2 : tensor<64x64xf32> %175 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %176 = tt.broadcast %175 : tensor<1x64xf32> -> tensor<64x64xf32> %177 = arith.subf %174, %176 : tensor<64x64xf32> %178 = math.exp2 %177 : tensor<64x64xf32> %179 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %180 = arith.cmpi slt, %179, %cst_12 : tensor<64x1xi32> %181 = tt.broadcast %180 : tensor<64x1xi1> -> tensor<64x64xi1> %182 = tt.load %arg24, %181, %cst_8 : tensor<64x64x!tt.ptr<f16>> %183 = arith.truncf %178 : tensor<64x64xf32> to tensor<64x64xf16> %184 = tt.dot %183, %182, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %185 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %186 = tt.addptr %185, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %187 = tt.load %186, %165 : tensor<64x!tt.ptr<f32>> %188 = tt.trans %182 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %189 = tt.dot %35, %188, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %190 = tt.expand_dims %187 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %191 = tt.broadcast %190 : tensor<1x64xf32> -> tensor<64x64xf32> %192 = arith.subf %189, %191 : tensor<64x64xf32> %193 = arith.mulf %178, %192 : tensor<64x64xf32> %194 = arith.mulf %193, %cst_3 : tensor<64x64xf32> %195 = arith.truncf %194 : tensor<64x64xf32> to tensor<64x64xf16> %196 = tt.trans %164 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %197 = tt.dot %195, %196, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %198 = arith.divsi %arg20, %c2_i32 : i32 %199 = tt.addptr %99, %198 : !tt.ptr<i32>, i32 %200 = tt.load %199 evictionPolicy = evict_last : !tt.ptr<i32> %201 = arith.addi %198, %c1_i32 : i32 %202 = arith.cmpi slt, %201, %103 : i32 %203 = tt.addptr %199, %c1_i32 : !tt.ptr<i32>, i32 %204 = tt.load %203, %202 evictionPolicy = evict_last : !tt.ptr<i32> %205 = arith.addi %arg20, %c1_i32 : i32 %206 = arith.remsi %205, %c2_i32 : i32 %207 = arith.cmpi eq, %206, %c0_i32 : i32 %208 = arith.subi %204, %200 : i32 %209 = arith.muli %208, %c128_i32 : i32 %210 = arith.subi %209, %c64_i32 : i32 %211 = arith.extui %207 : i1 to i32 %212 = arith.muli %210, %211 : i32 %213 = arith.subi %c1_i32, %211 : i32 %214 = arith.muli %213, %c64_i32 : i32 %215 = arith.addi %212, %214 : i32 %216 = arith.muli %215, %c64_i32 : i32 %217 = tt.splat %216 : i32 -> tensor<64x64xi32> %218 = tt.addptr %arg23, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %219 = tt.addptr %arg24, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %220 = tt.splat %215 : i32 -> tensor<64xi32> %221 = arith.addi %arg25, %220 : tensor<64xi32> scf.yield %197, %184, %218, %219, %221 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.mulf %135, %cst_2 : tensor<64x64xf32> %137 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %138 = tt.broadcast %137 : tensor<1x64xf32> -> tensor<64x64xf32> %139 = arith.subf %136, %138 : tensor<64x64xf32> %140 = math.exp2 %139 : tensor<64x64xf32> %141 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %142 = arith.cmpi slt, %141, %cst_12 : tensor<64x1xi32> %143 = tt.broadcast %142 : tensor<64x1xi1> -> tensor<64x64xi1> %144 = tt.load %121#3, %143, %cst_8 : tensor<64x64x!tt.ptr<f16>> %145 = arith.truncf %140 : tensor<64x64xf32> to tensor<64x64xf16> %146 = tt.dot %145, %144, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %147 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %148 = tt.addptr %147, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %149 = tt.load %148, %126 : tensor<64x!tt.ptr<f32>> %150 = tt.trans %144 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %151 = tt.dot %35, %150, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %153 = tt.broadcast %152 : tensor<1x64xf32> -> tensor<64x64xf32> %154 = arith.subf %151, %153 : tensor<64x64xf32> %155 = arith.mulf %140, %154 : tensor<64x64xf32> %156 = arith.mulf %155, %cst_3 : tensor<64x64xf32> %157 = arith.select %29, %156, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %158 = arith.truncf %157 : tensor<64x64xf32> to tensor<64x64xf16> %159 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %160 = tt.dot %158, %159, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %160, %146 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %98#0, %98#1 : tensor<64x64xf32>, tensor<64x64xf32> } scf.yield %119#0, %119#1 : tensor<64x64xf32>, tensor<64x64xf32> } %37 = tt.splat %13 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %38 = tt.addptr %37, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %39 = tt.broadcast %38 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %40 = tt.addptr %39, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %41 = arith.cmpi slt, %24, %cst_11 : tensor<1x64xi32> %42 = tt.broadcast %41 : tensor<1x64xi1> -> tensor<64x64xi1> %43 = arith.andi %29, %42 : tensor<64x64xi1> %44 = arith.truncf %36#1 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %40, %44, %43 : tensor<64x64x!tt.ptr<f16>> %45 = arith.mulf %36#0, %cst_13 : tensor<64x64xf32> %46 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x64xi32> %47 = arith.addi %26, %46 : tensor<64x64xi32> %48 = tt.splat %4 : i32 -> tensor<64x64xi32> %49 = arith.addi %47, %48 : tensor<64x64xi32> %50 = tt.splat %8 : i32 -> tensor<64x64xi32> %51 = arith.addi %49, %50 : tensor<64x64xi32> %52 = tt.splat %arg16 : !tt.ptr<f16> -> tensor<64x64x!tt.ptr<f16>> %53 = tt.addptr %52, %51 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %54 = arith.truncf %45 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %53, %54, %29 : tensor<64x64x!tt.ptr<f16>> } tt.return } } {-# external_resources: { mlir_reproducer: { pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=90}, sccp, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", disable_threading: false, verify_each: true } } #-} /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ROWS_GUARANTEED_SAFE : tl.constexpr = False BLOCKS_ARE_CONTIGUOUS : tl.constexpr = False WRITE_DQ : tl.constexpr = True OUTPUT_LOGSUMEXP : tl.constexpr = True FLOAT32_PRECISION : tl.constexpr = 'tf32' IS_DIVISIBLE : tl.constexpr = False SM_SCALE : tl.constexpr = 0.125 GQA_SHARED_HEADS : tl.constexpr = 4 HAS_FULL_BLOCKS : tl.constexpr = True QK_HEAD_DIM : tl.constexpr = 64 QK_HEAD_DIM_ROUNDED : tl.constexpr = 64 V_HEAD_DIM : tl.constexpr = 64 V_HEAD_DIM_ROUNDED : tl.constexpr = 64 SAFE_HEAD_DIM : tl.constexpr = True BLOCK_M1 : tl.constexpr = 64 BLOCK_N1 : tl.constexpr = 64 BLOCK_M2 : tl.constexpr = 64 BLOCK_N2 : tl.constexpr = 64 SPARSE_Q_BLOCK_SIZE : tl.constexpr = 128 SPARSE_KV_BLOCK_SIZE : tl.constexpr = 128 Q = arg_Q K = arg_K V = arg_V LSE = arg_LSE DELTA = arg_DELTA DO = arg_DO DQ = arg_DQ DV = arg_DV KV_NUM_BLKS = arg_KV_NUM_BLKS KV_IDX = arg_KV_IDX Q_NUM_BLKS = arg_Q_NUM_BLKS Q_IDX = arg_Q_IDX FULL_KV_NUM_BLKS = arg_FULL_KV_NUM_BLKS FULL_KV_IDX = arg_FULL_KV_IDX FULL_Q_NUM_BLKS = arg_FULL_Q_NUM_BLKS FULL_Q_IDX = arg_FULL_Q_IDX # Sub notation for this kernel: # # Q: Query, K: Key, V: Value # LSE: logsumexp (logsumexp is always stored in fp32 regardless of the input dtype) # DELTA: Precomputed sum(OUTDO, axis=-1) # DO: Derivative of Output, DQ: Derivative of Query, DV: Derivative of Value # DK: Derivative of Key, is the written to via the store_output call due to some limitations with # inductor codegen # M: Number of queries, N: Number of keys/values # QK_HEAD_DIM: The dimension of the query and key embeddings # V_HEAD_DIM: The dimension of the value embeddings # z: Batch size, h: Number of heads, m: Number of queries or keys/values, d: Head dim # GQA_SHARED_HEADS: number of query heads sharing one kv head in GQA setups. # (Modifiable) Performance tuning options # BLOCK_M1: when calculating DK & DV, iterate over BLOCK_M1 across the seqlen dim of Q in each thread block. # BLOCK_N1: when calculating DK & DV, the thread block size across the seqlen dim of K/V. # BLOCK_M2: when calculating DQ, the thread block size across the seqlen dim of Q. # BLOCK_N2: when calculating DQ, iterate over BLOCK_N2 across the seqlen dim of K/V in each thread block. # # The following FULL_ and PARTIAL_* is defined in the block sparse mask grid, rather than the thread block grid. # KV_NUM_BLKS: The number of KV blocks (that may or may not require masking) for each query. # KV_IDX: The indices of KV blocks (that may or may not require masking) for each query. # Q_NUM_BLKS: The number of Q blocks (that may or may not require masking) for each query. # Q_IDX: The indices of Q blocks (that may or may not require masking) for each query. # FULL_KV_NUM_BLKS: The number of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_KV_IDX: The indices of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_Q_NUM_BLKS: The number of fully unmasked Q blocks (so we don't need masking) for each query. # FULL_Q_IDX: The indices of fully unmasked Q blocks (so we don't need masking) for each query. # The below are kernel options that can be applied for certain score_mods, # or involve a numerics vs. perf tradeoff # PRESCALE_QK: Whether to pre-scale QK by 1/sqrt(d) and change of base. Has # about 20% more numerical error, but slightly faster. # Define strides of inputs stride_qz, stride_qh, stride_qm, stride_qd = 32768, 2048, 64, 1 stride_kz, stride_kh, stride_kn, stride_kd = 65536, 16384, 64, 1 stride_vz, stride_vh, stride_vn, stride_vd = 65536, 16384, 64, 1 stride_doz, stride_doh, stride_dom, stride_dod = 32768, 2048, 64, 1 stride_dqz, stride_dqh, stride_dqm, stride_dqd = 32768, 2048, 64, 1 stride_dvz, stride_dvh, stride_dvm, stride_dvd = 65536, 16384, 64, 1 ZQ = 2 HQ = 16 HKV = 4 Q_LEN = 32 ZKV = 2 KV_LEN = 256 MATMUL_PRECISION = Q.dtype.element_ty pid = tl.program_id(0) NUM_KV_BLOCKS = tl.cdiv(KV_LEN, BLOCK_N1) NUM_Q_BLOCKS = tl.cdiv(Q_LEN, BLOCK_M2) off_zq = tl.program_id(1) # q batch idx off_hkv = tl.program_id(2) # kv head idx off_zkv = off_zq % ZKV # kv batch idx SPARSE_Z = 2 SPARSE_HQ = 16 sparse_idx_z = off_zq % SPARSE_Z k_adj = (stride_kh * off_hkv + stride_kz * off_zkv).to(tl.int64) v_adj = (stride_vh * off_hkv + stride_vz * off_zkv).to(tl.int64) # first compute broadcasted dv of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dv of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] dv_adj = (stride_dvh * off_hkv + stride_dvz * off_zq).to(tl.int64) # offset K, V, DV pointers for batch/kv-head K += k_adj V += v_adj DV += dv_adj RCP_LN2 = 1.44269504 offs_k = tl.arange(0, QK_HEAD_DIM_ROUNDED) offs_v = tl.arange(0, V_HEAD_DIM_ROUNDED) if pid >= NUM_KV_BLOCKS: off_pid = pid - NUM_KV_BLOCKS # THIS BLOCK DOES DQ SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M2) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N2) off_hq2 = off_pid // NUM_Q_BLOCKS + off_hkv * GQA_SHARED_HEADS start_m2_block = off_pid % NUM_Q_BLOCKS off_pid_mask = start_m2_block // SPARSE_Q_MULTIPLE stride_kv_num_blks_h = 1 stride_kv_idx_h = 2 stride_kv_idx_m = 2 sparse_idx_hq2 = off_hq2 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq2 sparse_kv_num_blks_offset = sparse_hz_offset * stride_kv_num_blks_h + off_pid_mask sparse_kv_idx_offset = sparse_hz_offset * stride_kv_idx_h + off_pid_mask * stride_kv_idx_m # noqa: B950 # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj2 = (stride_qh * off_hq2 + stride_qz * off_zq).to(tl.int64) do_adj2 = (stride_doh * off_hq2 + stride_doz * off_zq).to(tl.int64) dq_adj2 = (stride_dqh * off_hq2 + stride_dqz * off_zq).to(tl.int64) off_chz2 = ((off_zq * HQ + off_hq2) * Q_LEN).to(tl.int64) Q2 = Q + q_adj2 DO2 = DO + do_adj2 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) DQ2 = DQ + dq_adj2 LSE2 = LSE + off_chz2 DELTA2 = DELTA + off_chz2 # dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM], dtype=tl.float32) dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_m2 = start_m2_block * BLOCK_M2 offs_m2 = start_m2 + tl.arange(0, BLOCK_M2) # load Q and do: they stay in SRAM throughout the inner loop. q = load_checked_2d(Q2, offs_m2, offs_k, stride_qm, stride_qd, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, QK_HEAD_DIM) do = load_checked_2d(DO2, offs_m2, offs_v, stride_dom, stride_dod, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, V_HEAD_DIM) if PRESCALE_QK: q = (q * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) if IS_DIVISIBLE: Di = tl.load(DELTA2 + offs_m2) lse = tl.load(LSE2 + offs_m2) else: Di = tl.load(DELTA2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.load(LSE2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.where(lse == -float("inf"), 0.0, lse) lse = lse[:, None] # ~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # KV_IDX and KV_NUM_BLKS are always contiguous. kv_indices = KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~ partial unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_KV_IDX and FULL_KV_NUM_BLKS are always contiguous. kv_indices = FULL_KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(FULL_KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dQ. dq_ptrs = DQ2 + offs_m2[:, None] * stride_dqm + offs_k[None, :] * stride_dqd dq = SM_SCALE if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dq_ptrs, dq) else: tl.store(dq_ptrs, dq, mask=(offs_m2[:, None] < Q_LEN) & (offs_k[None, :] < QK_HEAD_DIM)) else: # THIS BLOCK DOES DK & DV SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M1) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N1) pid_mask = pid // SPARSE_KV_MULTIPLE stride_q_num_blks_h = 2 stride_q_idx_h = 2 stride_q_idx_n = 1 dv = tl.zeros([BLOCK_N1, V_HEAD_DIM_ROUNDED], dtype=tl.float32) dk = tl.zeros([BLOCK_N1, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_n1 = pid BLOCK_N1 offs_n1 = start_n1 + tl.arange(0, BLOCK_N1) # load K and V: they stay in SRAM throughout the inner loop. k = load_checked_2d(K, offs_n1, offs_k, stride_kn, stride_kd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, QK_HEAD_DIM) v = load_checked_2d(V, offs_n1, offs_v, stride_vn, stride_vd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, V_HEAD_DIM) if PRESCALE_QK: k = (k * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) for off_g in range(0, GQA_SHARED_HEADS): off_hq1 = off_hkv * GQA_SHARED_HEADS + off_g # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj1 = (stride_qh * off_hq1 + stride_qz * off_zq).to(tl.int64) do_adj1 = (stride_doh * off_hq1 + stride_doz * off_zq).to(tl.int64) dq_adj1 = (stride_dqh * off_hq1 + stride_dqz * off_zq).to(tl.int64) off_chz1 = ((off_zq * HQ + off_hq1) * Q_LEN).to(tl.int64) Q1 = Q + q_adj1 DO1 = DO + do_adj1 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) LSE1 = LSE + off_chz1 DELTA1 = DELTA + off_chz1 sparse_idx_hq1 = off_hq1 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq1 sparse_q_num_blks_offset = sparse_hz_offset * stride_q_num_blks_h + pid_mask sparse_q_idx_offset = sparse_hz_offset * stride_q_idx_h + pid_mask * stride_q_idx_n # noqa: B950 # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Q_IDX and Q_NUM_BLKS are always contiguous. q_indices = Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_Q_IDX and FULL_Q_NUM_BLKS are always contiguous. q_indices = FULL_Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(FULL_Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dV and dK. dv_ptrs = DV + offs_n1[:, None] * stride_dvm + offs_v[None, :] * stride_dvd index_n = offs_n1[:, None] index_k = offs_k[None, :] index_v = offs_v[None, :] if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dv_ptrs, dv) else: tl.store(dv_ptrs, dv, mask=(index_n < KV_LEN) & (index_v < V_HEAD_DIM)) dk = SM_SCALE if SAFE_HEAD_DIM: mask = index_n < KV_LEN else: mask = (index_n < KV_LEN) & (index_k < QK_HEAD_DIM) # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 16384off_hkv + 65536off_zq tl.store(out_ptr0 + (tl.broadcast_to(xindex, dk.shape)), dk, mask) metadata: {'signature': {'arg_Q': 'fp16', 'arg_K': 'fp16', 'arg_V': 'fp16', 'arg_LSE': 'fp32', 'arg_DELTA': 'fp32', 'arg_DO': 'fp16', 'arg_DQ': 'fp16', 'arg_DV': 'fp16', 'arg_KV_NUM_BLKS': 'i32', 'arg_KV_IDX': 'i32', 'arg_Q_NUM_BLKS': 'i32', 'arg_Q_IDX': 'i32', 'arg_FULL_KV_NUM_BLKS': 'i32', 'arg_FULL_KV_IDX': 'i32', 'arg_FULL_Q_NUM_BLKS': 'i32', 'arg_FULL_Q_IDX': 'i32', 'out_ptr0': 'fp16'}, 'device': 0, 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (9,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (14,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 4, 'num_stages': 3, 'debug': True, 'cc': 100} Traceback (most recent call last): File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, *compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed frames [('total', 3), ('ok', 3)] inline_call [] stats [('calls_captured', 8), ('unique_graphs', 3)] aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('ok', 1)] inductor [('triton_bundler_save_kernel', 8), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1), ('fxgraph_cache_bypass', 1)] graph_break [] F ==================================================== FAILURES ===================================================== _____________________________ TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 ______________________________ Traceback (most recent call last): File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, *kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 446, in instantiated_test raise rte File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1349, in dep_fn return fn(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1215, in dep_fn return fn(slf, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 1430, in test_GQA self.run_test(inputs) File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 566, in run_test compiled_out.backward(backward_grad) File "/home/drisspg/meta/pytorch/torch/_tensor.py", line 625, in backward torch.autograd.backward( File "/home/drisspg/meta/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/drisspg/meta/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/autograd/function.py", line 315, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2303, in backward return impl_fn() ^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2289, in impl_fn out = CompiledFunction._backward_impl(ctx, all_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2394, in _backward_impl CompiledFunction.compiled_bw = aot_config.bw_compiler( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/schemas.py", line 1256, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/backends/common.py", line 76, in _wrapped_bw_compiler disable( File "/home/drisspg/meta/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_utils_internal.py", line 92, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 2428, in bw_compiler return inner_compile( ^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 773, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 952, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1652, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile compiled_module = graph.compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2318, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2328, in _compile_to_module mod = self._compile_to_module_lines(wrapper_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2396, in _compile_to_module_lines mod = PyCodeCache.load_by_key_path( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/codecache.py", line 3466, in load_by_key_path mod = _reload_python_module(key, path, set_sys_modules=in_toplevel) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmp0yiz3c94/az/caza2gzmsagyuusmf2ka3oat3na4xv6zudssk244xmlzsbv2knze.py", line 117, in <module> File "/home/drisspg/meta/pytorch/torch/_inductor/async_compile.py", line 489, in triton kernel.precompile( File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 437, in precompile self._precompile_worker() File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 459, in _precompile_worker compile_results.append(self._precompile_config(c)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, **compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed To execute this test, run the following from the base repo dir: python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ============================================= short test summary info ============================================= FAILED [5.1441s] test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_GQA_score_mod1_cuda_float16 - RuntimeError: PassManager::run failed ================================== 1 failed, 1 passed, 1404 deselected in 18.10s ================================== ~/meta/pytorch flex-warning !1 ❯ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160227 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2025-08-11 23:30:20 +00:00
Sherlock Huang	99bc2f94c1	Update export/schema.py (#160220 ) Summary: Model could have multiple ExportedPrograms - for different methods. They can have different weights. - for different delegates. They can also have different weights. For this reason, we make weight per ExportedProgram. Also, we cleanup Model, and Program. IIUC, Model and Program are not used anywhere, so it's ok to make BC breaking change. Test Plan: CI Rollback Plan: Differential Revision: D79917395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160220 Approved by: https://github.com/angelayi, https://github.com/dolpm, https://github.com/jingsh	2025-08-11 23:14:08 +00:00
Yidi Wu	fc25c68f20	[hop][exc] make UncapturedHigherOrderOpError print user code and avoid re-raise (#159296 ) After the change, the error stacktrace is attached with user code stack and is suppressed into 1 (without the scrolling up mssage). For example: ```python class Test(torch.nn.Module): def forward(self, c, x): def cond_fn(c, x): return c > 0 and x.size(0) < 20 def body_fn(c, x): return c - 1, x.sin() return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ``` Now gives the following error message: ```python Traceback (most recent call last): File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1705, in test_while_loop_size_mismatch_tensor_expansion self._run_test( ~~~~~~~~~~~~~~^ model=WhileLoopModels.SizeMismatchTensorExpansion(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<2 lines>... dynamic=dynamic, ^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1417, in _run_test result = model(inputs_with_counters) File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(args, *kwargs) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1053, in forward return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 176, in while_loop return torch.compile( ~~~~~~~~~~~~~~ _while_loop_op_wrapper, backend=backend, fullgraph=True ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ )(flat_cond_fn, flat_body_fn, tuple(flat_inputs), tuple()) ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1595, in __call__ result = self._torchdynamo_orig_backend( frame, cache_entry, self.hooks, frame_state, skip=1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1353, in __call__ result = self._inner_convert( frame, cache_entry, hooks, frame_state, skip=skip + 1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 682, in __call__ result = _compile( frame.f_code, ...<16 lines>... convert_frame_box=self._box, ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1172, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_utils_internal.py", line 98, in wrapper_function return function(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 858, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 897, in _compile_inner out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1461, in transform_code_object transformations(instructions, code_options) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 300, in _fn return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 818, in transform tracer.run() ~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3528, in run super().run() ~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 91, in graph_break_as_hard_error raise exc.with_traceback(sys.exc_info()[2]) from None File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 77, in graph_break_as_hard_error return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1287, in call_function ) = speculate_subgraph( ~~~~~~~~~~~~~~~~~~^ tx, ^^^ ...<33 lines>... supports_aliasing=self.supports_aliasing, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 877, in speculate_subgraph raise ex File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 718, in speculate_subgraph output = f.call_function(tx, args, sub_kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 830, in inner unimplemented_v2( ~~~~~~~~~~~~~~~~^ gb_type="Data-dependent branching", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<5 lines>... ], ^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 580, in unimplemented_v2 raise Unsupported(msg) torch._dynamo.exc.UncapturedHigherOrderOpError: while_loop doesn't work unless it is captured completely with torch.compile. Got Data-dependent branching Explanation: Detected data-dependent branching (e.g. `if my_tensor.sum() > 0:`). Dynamo does not support tracing dynamic control flow. Hint: This graph break is fundamental - it is unlikely that Dynamo will ever be able to trace through your code. Consider finding a workaround. Hint: Use `torch.cond` to express dynamic control flow. Developer debug context: attempted to jump with TensorVariable() For more details about this graph break, please visit: https://pytorch-labs.github.io/compile-graph-break-site/gb/gb0170.html from user code: File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 167, in _while_loop_op_wrapper return while_loop_op(args, *kwargs) File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 137, in flat_cond_fn return cond_fn(carried, *additional) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1047, in cond_fn return c > 0 and x.size(0) < 20 Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" To execute this test, run the following from the base repo dir: python test/inductor/test_control_flow.py WhileLoopTests.test_while_loop_size_mismatch_tensor_expansion_device_cpu_dynamic_False This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159296 Approved by: https://github.com/zou3519	2025-08-11 22:48:10 +00:00
Pat Vignola	5a40c57844	[MTIA] Implement isAvailable() for MTIA hooks (#160304 ) Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods. Test Plan: `torch._C._get_accelerator()` Rollback Plan: Differential Revision: D79981115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304 Approved by: https://github.com/nautsimon	2025-08-11 21:45:11 +00:00
Nikita Shulga	7d2ec704e4	Fix MPS autocast for ConvTranspose3d (#160345 ) ## Summary - ensure ConvTranspose3d uses fp32 under MPS autocast - add MPS autocast test for ConvTranspose3d Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c Fixes https://github.com/pytorch/pytorch/issues/160332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345 Approved by: https://github.com/dcci	2025-08-11 21:01:52 +00:00
Sandeep Narendranath Karjala	fc80f6859e	Fix collective schedule logging and runtime tests (#160260 ) Summary: - Fix collective schedule logging so that only logs when collectives present - Fix runtime estimate test to check if each op has a number value Pull Request resolved: https://github.com/pytorch/pytorch/pull/160260 Approved by: https://github.com/Skylion007	2025-08-11 20:58:52 +00:00
PaulZhang12	cf0a0dcb0a	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-11 20:54:33 +00:00
PyTorch MergeBot	b149c7204c	Revert "port distributed pipeline test files for Intel GPU (#159033 )" This reverts commit 76a0609b6bddb2bc40f1eb4ade12885023653d59. Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](`76a0609b6b`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))	2025-08-11 20:44:45 +00:00
PyTorch MergeBot	09381f5dac	Revert "[Graph Partition] Pass all OSS unit tests (#154667 )" This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9. Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](`ca7315c171`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))	2025-08-11 20:34:27 +00:00
Pian Pawakapan	9eedd2a20b	[PGO] no counterfactual suggestions for dynamic allowlist (#160231 ) Being more conservative with whitelist suggestions as we roll out suggestions; now we only suggest sources that were dynamic in previous runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160231 Approved by: https://github.com/bobrenjc93	2025-08-11 20:13:25 +00:00
Edward Yang	c3dc8dc412	159965 is merged, no need to patch it in (#160275 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160275 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-08-11 19:55:04 +00:00
Liao, Wei	76a0609b6b	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path 6. add TEST_MULTIACCELERATOR in common_utils for all backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Daisy Deng <daisy.deng@intel.com>	2025-08-11 19:43:15 +00:00
Simon Fan	c8205cb354	[autograd] match 0-dim gradients device type regardless of subclassness (#160165 ) Not sure if there some subclasses where the outer.dim() == 0 but you wouldn't want to move it? FIXES https://github.com/pytorch/pytorch/issues/160084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160165 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-08-11 17:57:32 +00:00
Nikita Shulga	d25c4f954d	[MPS] Type-promote tensor-iterator common dtype (#160334 ) Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels Fixes https://github.com/pytorch/pytorch/issues/160208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-08-11 17:53:56 +00:00
David Berard	d0e2240f68	[triton_heuristics] Optimize the triton launcher in pt2 (#160000 ) Summary: (Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent) We observed ~10us PT2-Triton launch overhead regression after pin update. Before Triton pin-update: {F1980557238} After Triton pin-update: {F1980557240} The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path. The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel. Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (`e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)`), there is no need to pass in constexprs to the generated launcher code. The new launcher code needs to work on three cases: - StaticallyLaunchedCudaKernel - triton.compile.CompiledKernel - AOTInductor Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0 Test Plan: Before: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.893x ``` ``` $ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00760921 1.80298 0.623282 5.25024 0.203722 19 0.00799885 4.78223 1.00226 5.8213 0.239084 average 0.00780403 3.29261 0.812769 5.53577 0.221403 ``` After: ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00747067 1.92589 0.726509 4.35459 0.204205 19 0.00747823 7.36852 1.26241 6.28208 0.239278 average 0.00747445 4.6472 0.994459 5.31834 0.221741 ``` ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.985x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000 Approved by: https://github.com/jansel Co-authored-by: Xu Zhao <xzhao9@meta.com>	2025-08-11 17:22:40 +00:00
Shangdi Yu	9ccd0f5e31	Fix unbacked symint and memory leak in inductor memory planning (#159839 ) Summary: In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints . So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin. Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range. If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool. We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor. In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool` actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`. ``` AtenTensorHandle handle_name; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....); ``` This is bug prune. If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well, otherwise you get memory leaks. We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool. Test Plan: ``` python test/inductor/test_memory_planning.py ``` Rollback Plan: Differential Revision: D79603119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839 Approved by: https://github.com/jansel	2025-08-11 17:16:15 +00:00
Boyuan Feng	ca7315c171	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-11 16:25:12 +00:00
Richard Barnes	68a4b4b2e3	[codemod] Fix unreachable-break issue in caffe2/c10/cuda/CUDAFunctions.cpp +2 (#160257 ) Summary: LLVM has a warning `-Wunreachable-code-break` which identifies `break` statements that cannot be reached. These compromise readability, are misleading, and may identify bugs. This diff removes such statements. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Rollback Plan: Differential Revision: D79835614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160257 Approved by: https://github.com/Skylion007	2025-08-11 16:09:24 +00:00
Xu Han	80cca83079	[inductor] Skip some AOTI UTs on Windows. (#160287 ) Skip some AOTI UTs on Windows, it is not fully ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160287 Approved by: https://github.com/ezyang	2025-08-11 13:50:43 +00:00
Xu Han	515cb70367	[inductor] normalize_path_separator for test_different_file_paths_local_pgo (#160286 ) `normalize_path_separator` for test_different_file_paths_local_pgo Pull Request resolved: https://github.com/pytorch/pytorch/pull/160286 Approved by: https://github.com/ezyang	2025-08-11 13:50:18 +00:00
cyy	c184cb3852	[submodule] Bump fbgemm to latest (#158210 ) Merge the recent commits of FBGEMM and remove unnecessary CMake code. Specifically, we 1. enable `fbgemm_autovec` since the target is now correctly handled. 2. remove option `USE_FAKELOWP` which is not used. 3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210 Approved by: https://github.com/q10	2025-08-11 13:48:02 +00:00
PyTorch UpdateBot	2259dbed4e	Update slow tests (#158222 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158222 Approved by: https://github.com/pytorchbot	2025-08-11 12:00:13 +00:00
PyTorch UpdateBot	05029ad1c3	[xla hash update] update the pinned xla hash (#160306 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160306 Approved by: https://github.com/pytorchbot	2025-08-11 11:28:49 +00:00
cyy	cf4964be68	Remove unnecessary CMake checks for glog (#158185 ) With the updating to CMake 2.27, some old scripts can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158185 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-08-11 10:14:47 +00:00
Tanmay Sinha	ecea81117b	Fix clang builds by adding headers (#160252 ) Clang compiler from llvm-14 fails to build full torch from source with the message ``` no template named 'unordered_map' in namespace 'std' std::unordered_map<std::string, HandlerFunc> handlers_{}; ~~~~~^ ``` A similar issue here https://github.com/intel/llvm/issues/5264 Fix is to add the correct headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160252 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-08-11 09:03:14 +00:00
fduwjj	1c2cba17ea	[FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119 ) To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional) Screenshot: <img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" /> <img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2025-08-11 07:27:10 +00:00
Nick Riasanovsky	ff0d56d035	[Inductor] [Triton] Enable Configuration warmup/rep iterations when benchmarking in inductor (#159982 ) Summary: When benchmarking on B200 Max Autotune, I discovered that the estimations from the autotune logs consistently produced a better ATEN result by > 20% on an example shape. Here is an example of the output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3081120103597641, "best_triton_pos": 1, "best_triton_time": 0.6589759886264801, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3081 ms 100.0% triton_mm_16 0.6590 ms 46.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6830 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.7015 ms 43.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.8487 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8695 ms 35.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.8797 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.9089 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9718 ms 31.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0169 ms 30.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 2.8574 seconds and 0.1032 seconds precompiling for 20 choices Removed 3483 outliers from 28645 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:20<00:00, 20.00s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359392 (±8.27%) 1209.61 average 1209.61 ``` Based on my reading about B200 power usage, I believe this is due to the new for power aware benchmarking as a kernel may perform better in short bursts. This adds environment variables to expand autotuning iterations so we can get more consistent results between the estimation and the actual runtime. I did not update the default yet, even for B200 because I'm not sure how this is used in practice. This is the new output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3848319947719574, "best_triton_pos": 1, "best_triton_time": 0.6287680268287659, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3848 ms 100.0% triton_mm_16 0.6288 ms 61.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.6299 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6728 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.7189 ms 53.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.8566 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8693 ms 44.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9298 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.9524 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0216 ms 37.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 3.9245 seconds and 0.0965 seconds precompiling for 20 choices Removed 3537 outliers from 29530 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:23<00:00, 23.70s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359328 (±9.71%) 1209.82 average 1209.82 ``` Test Plan: `TORCH_AUTOTUNE_REP=1000 CUDA_VISIBLE_DEVICES=2 ENABLE_MMA_V5_ATT_PIPELINE=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op gemm --iter $NUM_ITERS --input-loader /home/njriasan/parsed_shapes.json --only pt2_matmul_maxautotune` Rollback Plan: Reviewed By: NikhilAPatel Differential Revision: D79737929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159982 Approved by: https://github.com/NikhilAPatel	2025-08-11 05:27:51 +00:00
Jiaxi WANG	334b38ccc4	Fix typo in README.md (#160160 ) The "Get the PyTorch Source" section is now located before the "Install Dependencies/Common" section, so "... using the “Get the PyTorch Source“ section below" should be "... using the “Get the PyTorch Source“ section above". Pull Request resolved: https://github.com/pytorch/pytorch/pull/160160 Approved by: https://github.com/BoyuanFeng	2025-08-11 05:09:59 +00:00
FFFrog	dc0d18e023	[CUDA] Remove the uncessary CUDA_GUARD (#160249 ) `CUDA_GUARD` is unnecessary in `initDeviceStreamState`, because the `initSingleStream` has already done it. `29712314dd/c10/cuda/CUDAStream.cpp (L202-L203)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160249 Approved by: https://github.com/Skylion007	2025-08-11 05:08:05 +00:00
cyy	8ae4d2652f	Tidy torch/csrc/jit/passes/onnx code (#160262 ) Apply clang-tidy fixes to torch/csrc/jit/passes/onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/160262 Approved by: https://github.com/justinchuby	2025-08-11 04:50:38 +00:00
Edward Z. Yang	8088cfa592	Add type assert for tensor_meta, based on real bug in autoparallel. (#157927 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157927 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/wconstab	2025-08-11 04:22:02 +00:00
Nikita Shulga	d8cb3db533	Add unsigned support to `IValue` (#160102 ) - Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger` - Added `TestPythonDispatch.test_dispatch_uint64` regression test Fixes https://github.com/pytorch/pytorch/issues/159168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102 Approved by: https://github.com/ezyang	2025-08-11 03:57:18 +00:00
Han, Xu	e7152ff8a6	[inductor] fix some windows inductor UTs (#160292 ) This PR is the UT part of https://github.com/pytorch/pytorch/pull/160161. As @malfet 's comments: https://github.com/pytorch/pytorch/pull/160161#pullrequestreview-3103812178 This PR will not land turn on change, and only land UT part. changes: 1. Fixed `test_invalid_artifact_flag_error_msg`. 2. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 3. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160292 Approved by: https://github.com/malfet	2025-08-11 02:55:37 +00:00
Nikita Shulga	842cc77ab9	[MPS] Extend addmm to integral types (#160270 ) By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via ``` PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_ ``` Fixes https://github.com/pytorch/pytorch/issues/154901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #160228, #160234	2025-08-11 00:54:17 +00:00
PyTorch MergeBot	b602ea9cab	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit 4416433c7c625127b7f975c92f8ec98ea4c67fd3. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/xuhancn due to auto merged with two related issue ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172982125))	2025-08-11 00:04:25 +00:00
Xu Han	4416433c7c	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-10 23:18:35 +00:00
Andy (An) Wang	05c19d1ace	[Inductor] Add back the revert part (#160054 ) Add back the reverted code(https://github.com/pytorch/pytorch/pull/159809) as we've figured out the actual root cause of the internal test failures. Mote details in the internal diff. Rollback Plan: Differential Revision: D79776691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160054 Approved by: https://github.com/blaine-rister	2025-08-10 19:20:30 +00:00
Xu Han	d6786741a7	[inductor] slow test some Windows UTs. (#160267 ) When we enabled Windows inductor UTs since the PR: https://github.com/pytorch/pytorch/pull/160161/ The main branch CI occurred timeout issue, Let's move some UT to slow test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160267 Approved by: https://github.com/ezyang	2025-08-10 18:35:42 +00:00
PyTorch MergeBot	7ae0629d64	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](`f0980fc0bb`). note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))	2025-08-10 17:33:19 +00:00
Xu Han	0e3e377bd5	[inductor] fix CompiledArtifact.load path on Windows. (#160268 ) fix CompiledArtifact.load path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160268 Approved by: https://github.com/ezyang	2025-08-10 14:22:52 +00:00
Isalia20	a84b60c0c4	[MPS] Sparse coalesce more dtypes to match cpu (#160254 ) More dtypes to match the cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254 Approved by: https://github.com/malfet	2025-08-10 12:25:18 +00:00
atalman	3ac86e728d	Add Alban and Piotr to list of maintainers (#160187 ) Add Alban and Piotr to list of maintainers Pull Request resolved: https://github.com/pytorch/pytorch/pull/160187 Approved by: https://github.com/albanD	2025-08-10 12:00:16 +00:00
Edward Yang	c9671dc865	Delete Python reference implementation from torchdim, as it is untested (#160115 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160115 Approved by: https://github.com/albanD	2025-08-10 11:21:33 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
Edward Yang	5dddcd5b07	Correctly copy self.module_stack in ModuleStackTracer (#159956 ) There is a bigger cluster of issues which this does not completely fix, but I think this is a matter of good hygiene, especially because we immediately mutate the dict after assigning it. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159956 Approved by: https://github.com/pianpwk	2025-08-10 03:33:59 +00:00
PyTorch MergeBot	d3d359dbaf	Revert "Fix get_free_symbol_uses for several nodes. (#160134 )" This reverts commit db78943a1ca13a32a3d6045eb15e2b719ee13a2f. Reverted https://github.com/pytorch/pytorch/pull/160134 on behalf of https://github.com/malfet due to No, those are not pre-existing, see `df55ec7d4b/1` ([comment](https://github.com/pytorch/pytorch/pull/160134#issuecomment-3172314322))	2025-08-10 02:37:40 +00:00
Nikita Shulga	df55ec7d4b	[OpInfo][BE] Better inputs for addmm (#160234 ) Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234 Approved by: https://github.com/Skylion007 ghstack dependencies: #160228	2025-08-10 01:26:48 +00:00
Xu Han	f0980fc0bb	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-09 21:06:00 +00:00
Laith Sakka	db78943a1c	Fix get_free_symbol_uses for several nodes. (#160134 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : 1. eliminating of some nodes due to not detection of any users. (See the added unit test) 2. Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160134 Approved by: https://github.com/bobrenjc93	2025-08-09 18:15:46 +00:00
thenumberouscode	29712314dd	[fx][pass] Support converting a float32 tensor to a scalar in FX trace. (#158216 ) Fixes https://github.com/pytorch/pytorch/issues/158083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158216 Approved by: https://github.com/laithsakka	2025-08-09 15:13:13 +00:00
cyy	01f66d08d9	Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075 ) Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075 Approved by: https://github.com/Skylion007	2025-08-09 14:23:17 +00:00
PyTorch MergeBot	2f4c222617	Revert "Make user defined Triton kernels serializable for fx_graph_runnable (#160002 )" This reverts commit 4183d4ff3dcc1d87400326a9a7998c3f9e966f60. Reverted https://github.com/pytorch/pytorch/pull/160002 on behalf of https://github.com/albanD due to Breaks inductor tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/160002#issuecomment-3170855866))	2025-08-09 14:01:58 +00:00
xinan.lin	8047421fbb	[Linter] Expanding the scope of detecting device-bias code. (#159949 ) Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios: 1. Detect device-bias code in functions decorated with @requires_triton. 2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example: ``` if __name__ == "__main__": if HAS_GPU: run_tests() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-09 09:41:16 +00:00
PaulZhang12	4183d4ff3d	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-09 09:26:05 +00:00
Sherlock Huang	fb887c3bb5	Add Sherlock and Zhengxu as codeowner for schema.py (#160233 ) Test Plan: CI Rollback Plan: Differential Revision: D79933462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160233 Approved by: https://github.com/zhxchen17	2025-08-09 04:44:12 +00:00
PyTorch UpdateBot	bcf23ecc47	[vllm hash update] update the pinned vllm hash (#160235 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160235 Approved by: https://github.com/pytorchbot	2025-08-09 04:17:32 +00:00
Animesh Jain	303c614f3d	[dynamo] Be consistent with UserMethodVariable source (#160155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160155 Approved by: https://github.com/StrongerXi	2025-08-09 04:16:14 +00:00
PyTorch UpdateBot	0d88593dd8	[audio hash update] update the pinned audio hash (#160153 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160153 Approved by: https://github.com/pytorchbot	2025-08-09 04:01:31 +00:00
Rob Timpe	5ed4f91779	[dynamo] support itertools.permutations (#159694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159694 Approved by: https://github.com/guilhermeleobas ghstack dependencies: #159693	2025-08-09 03:01:58 +00:00
Rob Timpe	e07c52b2c0	[dynamo] Improve support for itertools.product (#159693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159693 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos	2025-08-09 03:01:58 +00:00
cyy	10e3514c96	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-08-09 02:21:22 +00:00
Shangdi Yu	11a3565f18	[Torch Native] Add test for packaging weight (#158750 ) Add test that require weights to be packaged for torch native For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model. After we added weight deduping, we should be able to let this config be False. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750 Approved by: https://github.com/desertfire	2025-08-09 01:04:21 +00:00
Ankita George	e96c7c4bb0	[dcp][hf] Improve HF consolidation algorithm (#158648 ) Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648 Approved by: https://github.com/saumishr	2025-08-09 00:11:22 +00:00
Jane Xu	9b803cdbe2	[BE] Remove more optim entries from docs coverage ignore list (#160194 ) This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2 If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-08-09 00:09:45 +00:00
Nikita Shulga	8c41cb800a	[MPS][BE] Combine all pre-MacOS14 xfail lists (#160228 ) It does not matter whether it started to fail after 13.1 or 13.3, fact that it still fails on latest MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228 Approved by: https://github.com/dcci	2025-08-09 00:00:46 +00:00
Yanan Cao (PyTorch)	731ee31f7b	[TorchScript, PT2] Add torch._check compatibility support (#159988 ) Summary: Add support for torch._check() in TorchScript jit.script frontend. * It will be special cased to behave like torch._assert, turned into an if + raise exception. Test Plan: Unit tests Rollback Plan: Differential Revision: D79744604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159988 Approved by: https://github.com/davidberard98	2025-08-08 23:14:13 +00:00
Ti-Tai Wang	566c6d52ef	[ONNX] Fix the export of the model having none as output (#160200 ) Fixes #160150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160200 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-08-08 23:09:34 +00:00
Aidyn-A	4e2ddb5db6	[Inductor][CUTLASS] Copy cutlass_mock_imports directory (#159724 ) Pip wheels of PyTorch nightly and 2.8 release candidates do not contain `cutlass_mock_imports`. This is the path to the source code: ``` root@8120d02fd9c5:$ tree ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── cutlass_mock_imports │ ├── cuda │ │ ├── __init__.py │ │ ├── cuda.py │ │ └── cudart.py │ ├── pydot │ │ └── __init__.py │ └── scipy │ ├── __init__.py │ └── special.py ├── evt_extensions.py └── gemm_operation_extensions.py 5 directories, 8 files ``` And this what installed wheel has: ``` root@8120d02fd9c5:$ tree /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── __init__.py ├── evt_extensions.py └── gemm_operation_extensions.py 1 directory, 3 files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159724 Approved by: https://github.com/henrylhtsang	2025-08-08 22:56:05 +00:00
Kanya-Mo	9e07673deb	Fix test_fsdp_ep.py due to _MeshEnv API change (#158695 ) #132339 changed parent/child mesh related APIs from _MeshEnv. UT TestFSDPWithEP.test_e2e still uses old APIs and will fail: ``` File "/home/kanya/pytorch/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 77, in test_e2e mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, ("dp",)) AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh' To execute this test, run the following from the base repo dir: python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158695 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2025-08-08 22:36:47 +00:00
Eddie Yan	1128f4c2a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-08-08 22:22:48 +00:00
Robert Hardwick	334ecbd4ff	Add torchao to install_inductor_benchmark_deps cleanup stage (#160191 ) It looks like `torcho` was missed from the cleanup during torchbench setup. Fixes #160188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160191 Approved by: https://github.com/huydhn	2025-08-08 22:18:41 +00:00
PyTorch MergeBot	206c1eef65	Revert "[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 )" This reverts commit 2ee22e435131369a7e4f8cc4732579acc29a941b. Reverted https://github.com/pytorch/pytorch/pull/159655 on behalf of https://github.com/clee2000 due to broke dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed [GH job link](https://github.com/pytorch/pytorch/actions/runs/16839294394/job/47711078667) [HUD commit link](`2ee22e4351`). Probably a landrace since it did run on the PR ([comment](https://github.com/pytorch/pytorch/pull/159655#issuecomment-3169400889))	2025-08-08 22:04:22 +00:00
Nikita Shulga	28ccc9e724	[MPS] Extend `index_put` to complex types (#160159 ) And delete confusing supported types check. Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header Fixes https://github.com/pytorch/pytorch/issues/160034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007	2025-08-08 21:54:30 +00:00
Syed Tousif Ahmed	2247aa6d1d	Documents tuning NVLink performance on H100/H200 (#159792 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792 Approved by: https://github.com/ngimel	2025-08-08 20:28:24 +00:00
Sheng Fu	1febab2a89	Do not treat ReinterpretView as a realized node (#159920 ) Summary: Do not treat ReinterpretView as a realized node Function [gather_origins](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L888](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L888&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) calls is_realized_node to decide if a FX node should be included in the origins of a IR node. ReinterpretView is considered a realized node, so it is not included in the origins. It leads to an incomplete graph. For example: ``` @torchdynamo.optimize("inductor") def fn(input_data, weight): normalized_input = input_data * weight.unsqueeze(0) return normalized_input input_data = torch.randn(4272, 192, requires_grad=True).to(device) weight = torch.randn(192, requires_grad=True).to(device) fn(input_data, weight) ``` The original FX graph returned in [get_kernel_metadata](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L723](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L723&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul The unsqueeze op is missing. With this DIFF, the new FX graph is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %unsqueeze : Tensor "f32[1, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.unsqueeze.default](args = (%primals_1, 0), kwargs = {}) %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul Pull Request resolved: https://github.com/pytorch/pytorch/pull/159920 Approved by: https://github.com/mlazos	2025-08-08 20:13:35 +00:00
Jovian Anthony Jaison	2ee22e4351	[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 ) This change logs the stack trace of the code being compiled by Dynamo, improving visibility into what is compiled. It adds a stack_trace field to compilation metrics. This helps with debugging and analysis of Dynamo compilation behavior. Ref [D79287964](https://www.internalfb.com/diff/D79287964) Test Plan: $ python -m test_utils Internal: ref [D79372519](https://www.internalfb.com/diff/D79372519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159655 Approved by: https://github.com/c00w	2025-08-08 19:53:47 +00:00
James Dong	c86040a8e6	[torch.export] Fix test_export_api_with_dynamic_shapes (#160164 ) Summary: Update test KJT's dynamic_shapes to match the newly exported fields. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test:test_export -- --exact 'caffe2/test:test_export - test_export_api_with_dynamic_shapes_cpp_runtime_nonstrict (caffe2.test.export.test_nativert.NativeRTTestExport)' File changed: fbcode//caffe2/test/export/test_export.py Buck UI: https://www.internalfb.com/buck2/8247eaf8-eaf9-4876-95cb-7b4263d15ef2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275093345198 Network: Up: 100KiB Down: 0B (reSessionID-72a2579f-df3f-4262-9aa3-de0db9687 Executing actions. Remaining 0/2 Command: test. Time elapsed: 2:20.5s Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: malaybag Differential Revision: D79862872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160164 Approved by: https://github.com/angelayi, https://github.com/ezyang	2025-08-08 19:45:30 +00:00
Anshul Sinha	72009ec6be	[replicate][be] improved readability and cleaned up remaining DDP code (#160133 ) Summary As much of ReplicateState functionality is copied from FSDPState, I fixed any remaining comments that incorrectly used FSDP instead of Replicate. In addition, instead of labeling modules FSDPModule or FSDPLinear, I have changed it so that is now uses Replicate____. Finally, I have removed some leftover code from the DDP implementation. I have included test cases to verify correctness. Test Case 1. pytest test/distributed/_composable/test_replicate_with_fsdp.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160133 Approved by: https://github.com/mori360 ghstack dependencies: #160128	2025-08-08 19:42:23 +00:00
Andres Lugo	5f5f508aa8	[ROCm] Ck backend UX refactor (#152951 ) Refactors how the enablement/disablement of CK Gemms and SDPA works. - Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms. - USE_ROCM_CK_GEMM is set to True by default on Linux - Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA. - USE_ROCM_CK_SDPA is set to False by default - (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release) - Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it. - the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-08-08 18:40:17 +00:00
Yu, Guangye	da1f608ca3	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-08 17:41:22 +00:00
Yu, Guangye	84f7e88aef	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-08 17:41:22 +00:00
Yu, Guangye	d7114f05b1	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-08 17:41:10 +00:00
albanD	c5ec5458a5	Don't build nccl when distributed is disabled (#160086 ) Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-08-08 17:19:16 +00:00
Kurt Mohler	86eb65f7f0	[MPS] Move max_pool2d to Metal for `stride != 1` (#157876 ) This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: `2f02f2bf7a/max_pool_mps/perf.py` <details><summary>Click to expand</summary> case \| before PR \| after PR \| speedup \| \| case info -- \| -- \| -- \| -- \| -- \| -- 0 \| 0.014264 \| 0.004473 \| 3.188911245 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': True} 1 \| 0.010752 \| 0.00421 \| 2.55391924 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': False} 2 \| 0.020777 \| 0.006123 \| 3.393271272 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': True} 3 \| 0.011065 \| 0.005759 \| 1.921340511 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': False} 4 \| 0.01452 \| 0.007829 \| 1.854642994 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': True} 5 \| 0.009258 \| 0.007075 \| 1.308551237 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': False} 6 \| 0.188137 \| 0.168688 \| 1.115295694 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 7 \| 0.161362 \| 0.154746 \| 1.042753932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 8 \| 0.182883 \| 0.16945 \| 1.079274122 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 9 \| 0.156875 \| 0.163346 \| 0.9603847049 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 10 \| 0.193433 \| 0.167396 \| 1.155541351 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 11 \| 0.158967 \| 0.151246 \| 1.051049284 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 12 \| 0.931071 \| 0.932883 \| 0.9980576342 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 13 \| 0.324496 \| 0.3252 \| 0.9978351784 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 14 \| 0.944071 \| 0.936246 \| 1.008357846 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 15 \| 0.322171 \| 0.314854 \| 1.023239343 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 16 \| 0.894158 \| 0.886408 \| 1.008743152 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 17 \| 0.309338 \| 0.304146 \| 1.017070749 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 18 \| 0.606 \| 0.260546 \| 2.325884873 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 19 \| 0.30445 \| 0.231054 \| 1.317657344 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 20 \| 0.474708 \| 0.261925 \| 1.812381407 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 21 \| 0.23175 \| 0.231883 \| 0.9994264349 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 22 \| 0.434475 \| 0.266246 \| 1.631855502 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 23 \| 0.236942 \| 0.231792 \| 1.022218196 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 24 \| 0.202396 \| 0.174888 \| 1.157289237 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 25 \| 0.160679 \| 0.158246 \| 1.015374796 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 26 \| 0.200354 \| 0.184133 \| 1.088093932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 27 \| 0.160779 \| 0.160679 \| 1.000622359 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 28 \| 0.199175 \| 0.178625 \| 1.115045486 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 29 \| 0.159458 \| 0.160883 \| 0.9911426316 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 30 \| 0.199021 \| 0.165329 \| 1.203787599 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 31 \| 0.156337 \| 0.158213 \| 0.9881425673 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 32 \| 0.180146 \| 0.174483 \| 1.032455884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 33 \| 0.156988 \| 0.158167 \| 0.9925458534 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 34 \| 0.182133 \| 0.176521 \| 1.031792251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 35 \| 0.169042 \| 0.156483 \| 1.080257919 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 36 \| 1.767821 \| 1.766254 \| 1.000887188 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 37 \| 1.059346 \| 1.058775 \| 1.000539302 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 38 \| 1.85755 \| 1.859429 \| 0.9989894747 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 39 \| 1.100417 \| 1.097683 \| 1.002490701 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 40 \| 1.843167 \| 1.847558 \| 0.9976233493 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 41 \| 1.090142 \| 1.093163 \| 0.9972364597 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 42 \| 0.480867 \| 0.251733 \| 1.910226311 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 43 \| 0.319246 \| 0.236479 \| 1.349997251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 44 \| 0.49315 \| 0.256408 \| 1.923301925 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 45 \| 0.316746 \| 0.227854 \| 1.390127011 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 46 \| 0.4912 \| 0.257762 \| 1.905633879 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 47 \| 0.324771 \| 0.229371 \| 1.41592006 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 48 \| 0.152904 \| 0.095079 \| 1.608178462 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 49 \| 0.102963 \| 0.089217 \| 1.154073775 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 50 \| 0.155158 \| 0.095429 \| 1.625899884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 51 \| 0.104338 \| 0.089979 \| 1.15958168 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 52 \| 0.153121 \| 0.096429 \| 1.587914424 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 53 \| 0.103642 \| 0.090254 \| 1.148336916 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 54 \| 0.191071 \| 0.165125 \| 1.157129447 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 55 \| 0.153971 \| 0.149021 \| 1.033216795 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 56 \| 0.193192 \| 0.166892 \| 1.157586942 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 57 \| 0.156617 \| 0.15215 \| 1.029359185 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 58 \| 0.178033 \| 0.167308 \| 1.06410333 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 59 \| 0.157425 \| 0.164404 \| 0.9575496947 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 60 \| 1.757638 \| 1.750896 \| 1.0038506 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 61 \| 1.048471 \| 1.047967 \| 1.000480931 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 62 \| 1.790708 \| 1.789767 \| 1.000525767 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 63 \| 1.054575 \| 1.054796 \| 0.9997904808 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 64 \| 1.785837 \| 1.784192 \| 1.000921986 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 65 \| 1.054713 \| 1.054492 \| 1.00020958 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 66 \| 0.478267 \| 0.261017 \| 1.832321266 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 67 \| 0.32005 \| 0.226654 \| 1.412064204 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 68 \| 0.484008 \| 0.254721 \| 1.900149575 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 69 \| 0.321 \| 0.218842 \| 1.466811672 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 70 \| 0.482087 \| 0.248771 \| 1.937874591 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 71 \| 0.316558 \| 0.230533 \| 1.373156988 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 72 \| 0.137842 \| 0.085088 \| 1.619993419 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 73 \| 0.100671 \| 0.0769 \| 1.309115735 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 74 \| 0.148321 \| 0.086967 \| 1.705485989 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 75 \| 0.101392 \| 0.075454 \| 1.343759112 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 76 \| 0.150208 \| 0.083742 \| 1.793699697 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 77 \| 0.099587 \| 0.075825 \| 1.313379492 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 78 \| 0.622546 \| 0.602729 \| 1.03287879 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 79 \| 0.531696 \| 0.5067 \| 1.049330965 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 80 \| 0.626646 \| 0.617038 \| 1.015571164 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 81 \| 0.530354 \| 0.525367 \| 1.009492412 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 82 \| 0.633933 \| 0.577775 \| 1.097197006 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 83 \| 0.533067 \| 0.526954 \| 1.011600633 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 84 \| 3.372867 \| 3.386412 \| 0.9960001914 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 85 \| 1.155975 \| 1.156604 \| 0.9994561665 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 86 \| 3.401921 \| 3.39755 \| 1.001286515 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 87 \| 1.202829 \| 1.192538 \| 1.008629494 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 88 \| 3.23675 \| 3.220238 \| 1.005127571 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 89 \| 1.077067 \| 1.085613 \| 0.9921279498 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 90 \| 1.572925 \| 0.925625 \| 1.699311276 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 91 \| 0.791204 \| 0.793454 \| 0.9971642969 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 92 \| 1.572742 \| 0.922729 \| 1.704446268 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 93 \| 0.784292 \| 0.788871 \| 0.9941955022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 94 \| 1.526546 \| 0.925708 \| 1.649057802 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 95 \| 0.769321 \| 0.787675 \| 0.9766985114 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 96 \| 0.736033 \| 0.612808 \| 1.201082558 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 97 \| 0.574625 \| 0.530925 \| 1.082309177 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 98 \| 0.722021 \| 0.614488 \| 1.174996094 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 99 \| 0.563171 \| 0.533721 \| 1.055178642 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 100 \| 0.735725 \| 0.613992 \| 1.198264798 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 101 \| 0.583487 \| 0.532513 \| 1.095723485 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 102 \| 0.656383 \| 0.575313 \| 1.140914598 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 103 \| 0.559796 \| 0.509079 \| 1.099625009 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 104 \| 0.662046 \| 0.572362 \| 1.156691045 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 105 \| 0.552633 \| 0.508671 \| 1.086425214 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 106 \| 0.634108 \| 0.574629 \| 1.103508525 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 107 \| 0.534013 \| 0.510996 \| 1.045043405 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 108 \| 7.056642 \| 7.066717 \| 0.9985743026 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 109 \| 4.144275 \| 4.142658 \| 1.000390329 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 110 \| 7.172683 \| 7.189867 \| 0.9976099697 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 111 \| 4.162538 \| 4.158875 \| 1.000880767 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 112 \| 7.194233 \| 7.181837 \| 1.001726021 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 113 \| 4.294083 \| 4.196062 \| 1.023360236 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 114 \| 1.875692 \| 0.891071 \| 2.104986022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 115 \| 1.097479 \| 0.781175 \| 1.404907991 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 116 \| 1.8883 \| 0.89015 \| 2.121327866 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 117 \| 1.101329 \| 0.778542 \| 1.414604479 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 118 \| 1.872833 \| 0.893654 \| 2.095702587 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 119 \| 1.096712 \| 0.784579 \| 1.397835017 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 120 \| 0.513029 \| 0.374417 \| 1.370207549 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 121 \| 0.349546 \| 0.305763 \| 1.143192603 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 122 \| 0.518929 \| 0.377487 \| 1.374693698 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 123 \| 0.364662 \| 0.3145 \| 1.159497615 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 124 \| 0.521275 \| 0.375242 \| 1.389170189 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 125 \| 0.367488 \| 0.308354 \| 1.191773092 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 126 \| 0.652342 \| 0.569308 \| 1.145850752 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 127 \| 0.555696 \| 0.506892 \| 1.096280865 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 128 \| 0.654333 \| 0.570367 \| 1.147213987 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 129 \| 0.548925 \| 0.505825 \| 1.085207335 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 130 \| 0.655908 \| 0.571904 \| 1.146884792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 131 \| 0.560808 \| 0.508238 \| 1.103435792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 132 \| 6.949462 \| 6.949112 \| 1.000050366 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 133 \| 4.072913 \| 4.065013 \| 1.001943413 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 134 \| 7.200896 \| 7.197792 \| 1.000431243 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 135 \| 4.291367 \| 4.218538 \| 1.017264038 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 136 \| 7.1823 \| 7.306933 \| 0.9829431856 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 137 \| 4.151175 \| 4.149592 \| 1.000381483 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 138 \| 1.781279 \| 0.884288 \| 2.014365229 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 139 \| 1.050804 \| 0.774362 \| 1.356993241 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 140 \| 1.860758 \| 0.884637 \| 2.103414169 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 141 \| 1.099908 \| 0.775887 \| 1.417613647 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 142 \| 1.857387 \| 0.885738 \| 2.096993693 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 143 \| 1.105279 \| 0.77365 \| 1.428655077 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 144 \| 0.489408 \| 0.269583 \| 1.815426047 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 145 \| 0.322525 \| 0.236979 \| 1.360985573 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 146 \| 0.515475 \| 0.265813 \| 1.93923924 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 147 \| 0.315525 \| 0.228146 \| 1.382995976 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 148 \| 0.503438 \| 0.277204 \| 1.816128194 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 149 \| 0.335421 \| 0.228275 \| 1.469372467 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 150 \| 5.72495 \| 4.909554 \| 1.166083518 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True} 151 \| 4.45215 \| 4.251333 \| 1.047236243 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False} 152 \| 29.953021 \| 29.879879 \| 1.002447868 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 153 \| 9.854683 \| 9.839517 \| 1.001541336 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 154 \| 6.178033 \| 5.697375 \| 1.084364817 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True} 155 \| 6.280317 \| 5.712525 \| 1.099394226 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False} 156 \| 10.256062 \| 11.336527 \| 0.9046917103 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True} 157 \| 9.469546 \| 11.33705 \| 0.8352742556 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False} 158 \| 0.119087 \| 0.0797 \| 1.494190715 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 159 \| 0.098713 \| 0.047173 \| 2.092574142 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 160 \| 0.960812 \| 0.675762 \| 1.421820108 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True} 161 \| 0.536546 \| 0.485958 \| 1.104099531 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False} 162 \| 2.555225 \| 1.791567 \| 1.426251432 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 163 \| 1.419087 \| 1.305137 \| 1.087308842 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 164 \| 5.182008 \| 3.48085 \| 1.488719135 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True} 165 \| 2.831779 \| 2.498537 \| 1.133374851 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False} 166 \| 8.546038 \| 5.7783 \| 1.478988284 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True} 167 \| 4.731004 \| 4.161975 \| 1.136720908 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False} 168 \| 0.084754 \| 0.07435 \| 1.139932751 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 169 \| 0.057933 \| 0.043096 \| 1.344277891 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 170 \| 2.568592 \| 1.802117 \| 1.425319222 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 171 \| 1.433054 \| 1.307342 \| 1.096158465 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 172 \| 10.3213 \| 7.111604 \| 1.451332217 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True} 173 \| 5.680525 \| 5.168129 \| 1.099145358 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False} 174 \| 1.02255 \| 1.01375 \| 1.008680641 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False} 175 \| 3.074233 \| 3.094383 \| 0.993488201 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True} 176 \| 1.016812 \| 1.030575 \| 0.9866453194 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 177 \| 3.053658 \| 3.089504 \| 0.9883974903 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 178 \| 1.025863 \| 1.032088 \| 0.9939685376 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False} 179 \| 3.798942 \| 3.799213 \| 0.9999286694 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True} 180 \| 4.492979 \| 4.493421 \| 0.999901634 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False} 181 \| 51.543363 \| 51.266204 \| 1.005406271 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True} 182 \| 1.018008 \| 1.001587 \| 1.016394981 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False} 183 \| 3.035404 \| 3.003113 \| 1.010752509 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True} 184 \| 0.610421 \| 0.56 \| 1.0900375 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False} 185 \| 1.138983 \| 0.757296 \| 1.504012962 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True} 186 \| 0.641558 \| 0.557808 \| 1.150141267 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False} 187 \| 1.181475 \| 0.754725 \| 1.565437742 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True} 188 \| 1.03045 \| 1.026904 \| 1.003453098 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False} 189 \| 3.041421 \| 3.0263 \| 1.00499653 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True} 190 \| 0.609929 \| 0.572304 \| 1.065743032 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False} 191 \| 1.146875 \| 0.756446 \| 1.516135983 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True} 192 \| 0.645187 \| 0.561708 \| 1.148616363 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False} 193 \| 1.181721 \| 0.758054 \| 1.558887625 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True} 194 \| 0.927654 \| 0.925946 \| 1.0018446 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False} 195 \| 2.749983 \| 2.740354 \| 1.00351378 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True} </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876 Approved by: https://github.com/malfet	2025-08-08 16:40:10 +00:00
Animesh Jain	a4f69a5da0	[dynamo][guards] Remove guards on stdlib modules (#159913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159913 Approved by: https://github.com/StrongerXi	2025-08-08 16:26:04 +00:00
Adam J. Stewart	231c72240d	CMake build: preserve PYTHONPATH (#160144 ) Fixes #160092 I'm very new to CMake, so let me know if there's a fancier way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160144 Approved by: https://github.com/malfet Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-08-08 16:03:49 +00:00
gaoyvfeng	50f23ff6f8	rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883 ) Fixes #159399 "Modified torch.testing._internal.inductor_utils and test/inductor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883 Approved by: https://github.com/janeyx99	2025-08-08 15:44:52 +00:00
zpcore	8a37f0c903	improve gather and scatter_add strategy (#160140 ) As title. This PR made a small fix on top of https://github.com/meta-pytorch/autoparallel/pull/81. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160140 Approved by: https://github.com/fmassa	2025-08-08 15:06:24 +00:00
Edward Z. Yang	b5fd7223b1	Improve pin_memory error message on CPU-only systems (#159994 ) ## Summary - clarify pin_memory error message when no accelerator backend is available ## Testing - `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory) - `lintrunner -a` ------ https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994 Approved by: https://github.com/albanD	2025-08-08 14:36:45 +00:00
Edward Yang	9fa8ce26cf	Working setup with runnable PyTorch on Codex. (#159968 ) Sample transcript: https://chatgpt.com/s/cd_68938effc1a88191ae78bc82a8cefe94 This makes use of https://github.com/pytorch/pytorch/pull/159965 to bypass doing an actual build and use nightly. Things to improve: - Once USE_NIGHTLY is in main can remove the patching - We should just keep using the latest nightly, instead of a hard coded one Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159968 Approved by: https://github.com/wdvr	2025-08-08 14:34:15 +00:00
David Berard	62bac07981	[inductor][triton] support profile_scratch launcher arg (#159772 ) This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel. Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4. Fixes: * static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`) * AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`) Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772 Approved by: https://github.com/NikhilAPatel, https://github.com/eellison	2025-08-08 14:27:38 +00:00
Isalia20	7f4cb4a3e0	[MPS] coalesce for sparse tensors (#159729 ) MPS coalesce function for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-08 13:49:55 +00:00
Aidyn-A	556e2a73f4	[Test][Easy] Use float16 dtype in test_sort_large (#159939 ) The test fails with: >RuntimeError: var_mean only support floating point and complex dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159939 Approved by: https://github.com/eqy	2025-08-08 09:56:44 +00:00
Xuehai Pan	178515d0ff	[BE][PYFMT] remove `black`: finish `black -> ruff format` migration (#144557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144557 Approved by: https://github.com/ezyang	2025-08-08 07:46:10 +00:00
codingwithsurya	3a56237440	[SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels (#159788 ) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- TODO: This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756	2025-08-08 05:20:42 +00:00
codingwithsurya	e0d8a315c5	[SymmMem] Add helpful docstrings for all NVSHMEM APIs (#159756 ) Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755	2025-08-08 05:20:42 +00:00
codingwithsurya	bfff2e3592	[SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch (#159755 ) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/159755 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734	2025-08-08 05:20:36 +00:00
codingwithsurya	1c881440f4	[SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name (#159734 ) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159734 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701	2025-08-08 05:20:29 +00:00
codingwithsurya	7c4f7b9340	[SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) (#159701 ) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. Triton 3.4 Compatibility - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. Fix CI Errors - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) Peer Rank Calculation Fix - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159701 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215	2025-08-08 05:20:22 +00:00
codingwithsurya	1783d6e966	[SymmMem] Fix flaky wait_until test (#159215 ) When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136	2025-08-08 05:20:16 +00:00
codingwithsurya	ea7fe0ecf6	[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity (#159136 ) Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159136 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718	2025-08-08 05:20:09 +00:00
codingwithsurya	b0b229b197	[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access (#158718 ) Both approaches functionally return the default process group created by `init_process_group()` but `_get_default_group()` is a dedicated function with [better error handling and type safety](`4869f71170/torch/distributed/distributed_c10d.py (L1300-L1310)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158718 Approved by: https://github.com/Skylion007, https://github.com/fduwjj ghstack dependencies: #158515	2025-08-08 05:20:02 +00:00
codingwithsurya	b5c937259b	[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton (#158515 ) Implements sum_reduce, min_reduce, and max_reduce collective operations for NVSHMEM Triton kernels. Enables parallel reduction computations across PE teams for int64 data types. Tests: `python test/distributed/test_nvshmem_triton.py` <details> <summary> Quick debug print for sanity check </summary> ```markdown ============================================================ [Rank 1] Starting min/max reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Starting min/max reduction test with world_size=2 ============================================================ [Rank 0] Source data for min/max: [10, 20] [Rank 1] Source data for min/max: [15, 5] [Rank 1] All values across PEs: [Rank 0] All values across PEs: - Position 0: [10, 15] - Position 0: [10, 15] - Position 1: [20, 5] - Position 1: [20, 5] [Rank 1] Expected min: [10, 5] [Rank 0] Expected min: [10, 5] [Rank 1] Expected max: [15, 20] [Rank 0] Expected max: [15, 20] [Rank 0] Executing MIN reduction... [Rank 1] Executing MIN reduction... [Rank 0] Executing MAX reduction... [Rank 1] Executing MAX reduction... /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 1] Results: [Rank 0] Results: [Rank 1] MIN reduction result: [10, 5] [Rank 1] MAX reduction result: [15, 20] [Rank 0] MIN reduction result: [10, 5] [Rank 0] MAX reduction result: [15, 20] [Rank 1] ============================================================ [Rank 1] Min/Max reduction test PASSED ✓ [Rank 1] ============================================================ [Rank 0] ============================================================ [Rank 0] Min/Max reduction test PASSED ✓ [Rank 0] ============================================================ ...... ============================================================ ============================================================ [Rank 0] Starting sum reduction test with world_size=2 [Rank 1] Starting sum reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Configuration: [Rank 1] Configuration: - nreduce: 3 (number of separate reductions) - nreduce: 3 (number of separate reductions) - dtype: torch.int64 - dtype: torch.int64 [Rank 1] Source data: [2, 4, 6] [Rank 1] Contribution explanation: [Rank 0] Source data: [1, 2, 3] [Rank 0] Contribution explanation: - Element 0: 2 = (rank=1+1) * (index=0+1) - Element 0: 1 = (rank=0+1) * (index=0+1) - Element 1: 4 = (rank=1+1) * (index=1+1) - Element 1: 2 = (rank=0+1) * (index=1+1) - Element 2: 6 = (rank=1+1) * (index=2+1) - Element 2: 3 = (rank=0+1) * (index=2+1) [Rank 1] Initial destination: [-1, -1, -1] [Rank 0] Initial destination: [-1, -1, -1] [Rank 0] Expected results after reduction: [3, 6, 9] [Rank 1] Expected results after reduction: [3, 6, 9] [Rank 0] Executing sum reduction... [Rank 1] Executing sum reduction... [Rank 1] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Results after reduction: [Rank 0] Destination buffer: [3, 6, 9] [Rank 1] Results after reduction: [Rank 0] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ [Rank 1] Destination buffer: [3, 6, 9] - Reduction 2: PE0: 3 + PE1: 6 = 9 [Rank 1] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 9, Match: ✓ Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ - Reduction 2: PE0: 3 + PE1: 6 = 9 Result: 9, Match: ✓ [Rank 0] ============================================================ [Rank 0] Sum reduction test PASSED ✓ [Rank 0] All 3 reductions computed correctly across 2 PEs [Rank 0] ============================================================ [Rank 1] ============================================================ [Rank 1] Sum reduction test PASSED ✓ [Rank 1] All 3 reductions computed correctly across 2 PEs [Rank 1] ============================================================ ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158515 Approved by: https://github.com/mandroid6, https://github.com/ngimel	2025-08-08 05:19:55 +00:00
PyTorch UpdateBot	24257f5bfa	[vllm hash update] update the pinned vllm hash (#159822 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159822 Approved by: https://github.com/pytorchbot	2025-08-08 04:13:48 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
xinan.lin	2ea40fba84	[Linter] Improve device-bias linter by adding detection for `with torch.device("cuda")`. (#159926 ) ``` For example, detect the following situation: >>>Lint for test/dynamo/test_modes.py: Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode `with torch.device('cuda')`, suggest to use torch.device(GPU_TYPE) 687 \| flex_attention as flex_attention_eager, 688 \| ) 689 \| >>> 690 \| with torch.device("cuda"): 691 \| flex_attention = torch.compile(flex_attention_eager, dynamic=False) 692 \| 693 \| with self.assertRaisesRegex( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159926 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #159759	2025-08-08 03:20:42 +00:00
Aaron Gokaslan	beb4d7816d	[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 ) Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107 Approved by: https://github.com/albanD	2025-08-08 03:14:59 +00:00
Guilherme Leobas	3fcd79e023	Fix infinite loop when iterating over an empty zip (#159673 ) Dynamo would enter in an infinite recursion when `ZipVariable.next_variable(tx)` was called and there was no iterable to be iterated Pull Request resolved: https://github.com/pytorch/pytorch/pull/159673 Approved by: https://github.com/williamwen42	2025-08-08 02:50:21 +00:00
bobrenjc93	05c417715f	integrate kernacle into inductor (#160121 ) This adds integration into inductor in two parts 1) It kicks off the best config lookup at lowering time within mm.py 2) It awaits the future at scheduling time in select_algorithm.py Notably this does not do the following 1) Support for enumerating between mm, addmm and bmm 2) Support for enumerating between exhaustive/max 3) Enumerating different hardware SKUs eg. H100, A100, etc. those will come in the next diffs Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121 Approved by: https://github.com/izaitsevfb	2025-08-08 02:14:44 +00:00
Georgia Phillips	ba4ccf5d67	turn on executon frame clenaup by default (#160110 ) Summary: Turning execution frame cleanup back on since D78621408 is done Test Plan: See D78621408 Rollback Plan: Differential Revision: D79730674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160110 Approved by: https://github.com/jingsh	2025-08-08 02:13:48 +00:00
Wenyuan Chi	d68c323692	Log max_autotune exceptions (#159687 ) (#159688 ) Summary: Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` The exception tracebacks: ``` # inner exception traceback: File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers launchers.append(result.make_launcher()) ^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher self.kernel.load_kernel(device) File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel( # wrapped exception traceback: File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(self.args, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout choice.precompile() File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile self.bmreq.precompile() File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile getattr(mod, self.kernel_name).precompile() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile self._make_launchers() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Differential Revision: D79420953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688 Approved by: https://github.com/stashuk-olek	2025-08-08 01:30:08 +00:00
Edward Z. Yang	03b254e49f	Extend torch function support to ALL arguments, not just scalar type (but not insides of list) (#145089 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145089 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-08-07 23:43:53 +00:00
PyTorch MergeBot	195b5c2e27	Revert "dynamo: Remove passing or deleted dynamo_expected_failures (#159691 )" This reverts commit 36f46d082a4954921cb8493223f000f2aab79ed7. Reverted https://github.com/pytorch/pytorch/pull/159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/159691#issuecomment-3166067241))	2025-08-07 22:55:51 +00:00
Anshul Sinha	f077c2402e	[replicate][be] improved readability of test case description (#160128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160128 Approved by: https://github.com/mori360	2025-08-07 22:51:58 +00:00
Patrick C. Toulme	d46768db04	[MTIA] Allow users who know what they are doing to ignore all device mismatches in tracing and take a preferred device. (#159931 ) Summary: Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical. Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations. If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors. Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing. Users can set ``` torch._functorch.config.fake_tensor_prefer_device_type = 'mtia' ``` to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA. Test Plan: Added two unit tests. Rollback Plan: Differential Revision: D79698438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931 Approved by: https://github.com/jansel	2025-08-07 22:37:15 +00:00
clr	36f46d082a	dynamo: Remove passing or deleted dynamo_expected_failures (#159691 ) partially generated with ``` for TESTCASE in $(ls \| cut -f1 -d'.' \| grep -v CPython \| uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else sl rm "$TESTCASE"* ; fi; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159691 Approved by: https://github.com/xmfan	2025-08-07 21:41:50 +00:00
Sherlock Huang	8147370733	Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985 ) Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead. Test Plan: CI Rollback Plan: Reviewed By: kqfu, henryoier Differential Revision: D79744512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985 Approved by: https://github.com/jerryzh168, https://github.com/henryoier	2025-08-07 21:22:29 +00:00
Angela Yi	e619c6bb90	[export] Apply move_to_device_pass to all submodules (#159992 ) Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992 Approved by: https://github.com/yiming0416	2025-08-07 18:51:15 +00:00
Will Constable	3cf7b4024e	[DTensor] Support user-supplied Generator for random ops (#159933 ) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol	2025-08-07 18:47:22 +00:00
Xu Han	21392c0e06	[inductor] disable flex decoding on Windows. (#160072 ) Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160072 Approved by: https://github.com/angelayi	2025-08-07 18:07:36 +00:00
Aleksei Nikiforov	ee1fb43450	Fix docker image creation (#158634 ) Since switching from wheel 0.34.2 to wheel 0.45.1 python symlinks are no longer correctly created. Migrate to packaging package for symlink creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/158634 Approved by: https://github.com/malfet	2025-08-07 17:41:47 +00:00
Aidyn-A	0bd3af4fb8	Further fix failing tests in test/inductor/test_analysis.py (#160070 ) This is a follow up on #159800 as other tests are still failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160070 Approved by: https://github.com/aorenste	2025-08-07 17:32:58 +00:00
Ankita George	8399cf88ce	Use only safetensors APIs in HFStorageReader (#159681 ) Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681 Approved by: https://github.com/saumishr ghstack dependencies: #159405, #159406	2025-08-07 17:23:03 +00:00
Ankita George	0b187b3114	DCP HF reader: use safe_open instead of reading the bytes (#159406 ) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406 Approved by: https://github.com/saumishr ghstack dependencies: #159405	2025-08-07 17:23:03 +00:00
Ankita George	69cc606fda	HF component update to not use fsspec components (#159405 ) Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405 Approved by: https://github.com/saumishr	2025-08-07 17:22:54 +00:00
Markus Hoehnerbach	57f738b635	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-07 17:07:26 +00:00
Markus Hoehnerbach	e167c7d0f3	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-07 17:07:26 +00:00
Shivam Raikundalia	b1a602762e	[Profiler] Update README (#159816 ) Summary: Updated README with code structure and explanation of core features within profiler Test Plan: N/A Rollback Plan: Differential Revision: D79604189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159816 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2025-08-07 16:44:41 +00:00
Han, Xu	e1cf0d496e	[inductor] unification for inductor debug. (#159998 ) Unification inductor debug build, follow @desertfire 's suggestion: https://github.com/pytorch/pytorch/pull/159938#pullrequestreview-3093803196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159998 Approved by: https://github.com/angelayi	2025-08-07 16:38:00 +00:00
Xu Han	06824f3c72	[inductor] fix test_dynamo_timed on Windows. (#159981 ) Fixed `test_dynamo_timed `: <img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159981 Approved by: https://github.com/angelayi	2025-08-07 16:37:52 +00:00
PyTorch MergeBot	f3a4d742ec	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	74da2604c9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	c4e64467b5	Revert "Add UT for torch.accelerator memory-related API (#155200 )" This reverts commit 4604f0482c2b4a3001b62e5bc5085149a9bb053c. Reverted https://github.com/pytorch/pytorch/pull/155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
Zain Rizvi	90b78ee50f	Move xla jobs to unstable workflow (#159272 ) Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily. If you want to run these xla tests, add the ciflow/unstable label to your PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159272 Approved by: https://github.com/atalman, https://github.com/malfet	2025-08-07 16:22:52 +00:00
Xilun Wu	e248719ac0	[DTensor] support _StridedShard in view op (#159656 ) Summary Some thoughts on view-op and `_StridedShard` interaction: 1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned) compared to `Shard`. It only changes how shards permute across the devices. 2. `view()` op on DTensor strictly forbids shard redistribution which means if `view()` may cause shard permutation across devices, it should be rejected. This is enforced in today's sharding prop for `view()`. 3. Since DTensor `view()` won't introduce any redistribution, it's certain that `placements` won't change except the inner `dim` attribute of `Shard` or `_StridedShard`. Therefore, to support `_StridedShard` in `view()` op, the only change required is to keep `_StridedShard` as `_StridedShard` in the output spec. Test `pytest test/distributed/tensor/test_view_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656 Approved by: https://github.com/wconstab	2025-08-07 15:59:25 +00:00
Aleksei Nikiforov	f60454cce8	S390X: update test dependencies (#158636 ) numba currently doesn't build from source due to https://github.com/numba/numba/pull/10073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158636 Approved by: https://github.com/malfet	2025-08-07 15:58:30 +00:00
rzou	8ab5868a21	Actually run the einops tests in CI (#159776 ) The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi	2025-08-07 15:23:06 +00:00
Wang, Chuanqi	d20c4c20e6	[CI] Update xpu ci use rolling driver for new features (#158340 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <xinan.lin@intel.com>	2025-08-07 15:18:51 +00:00
Zhengxu Chen	83875cdb55	[nativert] Expose ModelRunner to public through pmpl type ModelRunnerHandle. (#159989 ) Summary: Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`. It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition. Test Plan: CI Rollback Plan: Differential Revision: D79751098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989 Approved by: https://github.com/dolpm	2025-08-07 14:23:21 +00:00
PyTorch MergeBot	a53d14d5f8	Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 )" This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762. Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))	2025-08-07 13:09:33 +00:00
Dev Sashidhar	8cb91e20bc	Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908 ) This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-08-07 11:24:44 +00:00
Huy Do	b0df7715e8	Remove benchmark dependencies from regular ROCm CI images (#160047 ) Instead, use a new `pytorch-linux-jammy-rocm-n-py3-benchmarks` image for Docker benchmark job. This addresses 2 issues: * The current ROCm failures in trunk w.r.t librosa version https://github.com/pytorch/pytorch/actions/runs/16789466749/job/47549950994 that TorchBench pulls in. * Reduce the size of the regular ROCm CI images by removing TorchBench models, which is needed only for benchmarking jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160047 Approved by: https://github.com/malfet, https://github.com/izaitsevfb	2025-08-07 09:26:58 +00:00
Avik Chaudhuri	422bd6808b	dataclass pytree fix (#159916 ) Differential Revision: D79687243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159916 Approved by: https://github.com/XuehaiPan, https://github.com/angelayi	2025-08-07 08:22:41 +00:00
thenumberouscode	24f43d0da7	[inductor] [cpu] fix the dype hardcoded to int64 in store_reduction (#157904 ) ## Fixes https://github.com/pytorch/pytorch/issues/157683 ## mini repro * Just copy the code from the issue to reproduce it. ```python import torch device = "cpu" # Input tensors v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) v5_0 = v0_0.amax(dim=0) return v6_0, v4_0, v1_0, v0_0, v5_0 v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) compiled_model = torch.compile(my_model, backend="inductor") v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) print("v1_0", v1_0.shape) print("v0_0", v0_0.shape) print("v5_0", v5_0.shape) ``` error_stack ``` /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` ## summary The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail. The original code that caused the problem. ``` def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) // The original code that caused the problem. v5_0 = v0_0.amax(dim=0) ``` ## proof procedure The c++ kernel generated by inductor: ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const int32_t* in_ptr0, int32_t* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L)) { { int32_t tmp_acc0_arr[16]; for (int i = 0; i < 16; i++) { tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min(); } int32_t tmp_acc0 = std::numeric_limits<int32_t>::min(); at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min()); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416Lx1), static_cast<int64_t>(16)); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416Lx1)]; tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0); } } } } if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { // impossible data type conversion which would caused the g++ compilation to fail. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)]; } } } } } } ``` the compilers complains ```text /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` so the following line have problem ```c++ // this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>. and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions. `d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904 Approved by: https://github.com/jgong5	2025-08-07 08:03:05 +00:00
Sherlock Huang	aa75e917bd	[Export Schema] Remove deviceAllocationMap field (#159653 ) Summary: This field is not used today, and it's not useful either. The device allocation is configured at model loading time, specified by user. It shouldn't be part of the model definition. Test Plan: CI Rollback Plan: Differential Revision: D79385513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653 Approved by: https://github.com/zhxchen17	2025-08-07 07:31:42 +00:00
PyTorch UpdateBot	3f1636ebef	[audio hash update] update the pinned audio hash (#160046 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160046 Approved by: https://github.com/pytorchbot	2025-08-07 04:16:35 +00:00
IlyasMoutawwakil	c859ba7114	Make onnx export SDPA match aten behavior (#159973 ) This PR makes onnx sdpa export match the behavior of aten sdpa when boolean mask is used. @justinchuby ```python import onnxruntime as ort import torch class ScaledDotProductAttention(torch.nn.Module): def forward(self, query, key, value, attn_mask): return torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask) model = ScaledDotProductAttention() attn_mask = torch.ones(2, 4, 8, 8).bool() # boolean mask for attention attn_mask[0, 0, 0, :] = False # masking an entire row (padding token) query = key = value = torch.randn(2, 4, 8, 16) output = model(query, key, value, attn_mask) torch.onnx.export( model, (query, key, value, attn_mask), "scaled_dot_product_attention.onnx", input_names=["query", "key", "value", "attn_mask"], output_names=["output"], dynamo=false, # or True, ) ort_session = ort.InferenceSession("scaled_dot_product_attention.onnx") np_inputs = {"query": query.numpy(), "key": key.numpy(), "value": value.numpy(), "attn_mask": attn_mask.numpy()} onnx_outputs = ort_session.run(None, np_inputs)[0] torch.testing.assert_close(output, torch.tensor(onnx_outputs), equal_nan=True) ``` fails the assertion because the ort model outputs nans. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159973 Approved by: https://github.com/xadupre, https://github.com/titaiwangms	2025-08-07 04:06:07 +00:00
Simon Fan	d4c1a08c89	Relax unclaimed successes in dtype op tests when running under TEST_WITH_DYNAMO/TEST_WITH_INDUCTOR (#159976 ) This PR changes the behavior for compile wrapped op tests: - supported_but_unclaimed_forward - supported_but_unclaimed_backward These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile. ~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976 Approved by: https://github.com/ezyang	2025-08-07 02:38:45 +00:00
Nikita Shulga	81d72fb1f7	Move smoke binary builds to 3.12 (#159993 ) And limit them just to stable CUDA version (as there weren't any recent instances when only one of those jobs failed to build) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159993 Approved by: https://github.com/ngimel ghstack dependencies: #159986, #159990	2025-08-07 01:59:30 +00:00
Nikita Shulga	d0226719a9	[BE][EZ] Delete remains of split-build logic (#159990 ) Hopefully last piece of https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159990 Approved by: https://github.com/atalman ghstack dependencies: #159986	2025-08-07 01:59:30 +00:00
Edward Yang	38d65c6465	Add a USE_NIGHTLY option to setup.py (#159965 ) If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X. Coded with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965 Approved by: https://github.com/malfet	2025-08-07 01:44:20 +00:00
Yu, Guangye	2ba2f598f3	[Dynamo] Add torch.xpu.stream to trace rules (#159844 ) # Motivation Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844 Approved by: https://github.com/jansel	2025-08-07 01:35:50 +00:00
Laith Sakka	1bb5e6c076	update expected results (#159867 ) refresh due to https://github.com/pytorch/pytorch/pull/159696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867 Approved by: https://github.com/masnesral	2025-08-07 01:18:36 +00:00
Denghui Dong	8b0be7b65a	[Profiler] Fix unexpected C return events (#159574 ) The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there. I recently found a new cause for this problem. `1df5d00145/Python/ceval.c (L5807-L5836)` In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur. Here is the code to reproduce this problem. ``` import threading import time import torch from threading import Event, Lock lock = Lock() lock.acquire() event1 = Event() event2 = Event() event3 = Event() def run(): event1.set() event2.wait() lock.acquire() event3.set() threading.Thread(target=run).start() with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): event1.wait() event2.set() time.sleep(1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): lock.release() event3.wait() ``` <img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" /> To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields. In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed). Do you think the fix is necessary? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574 Approved by: https://github.com/sraikund16	2025-08-07 01:17:55 +00:00
Xuehai Pan	5cedc5a0ff	[BE][PYFMT] migrate PYFMT for `torch/[p-z]*/` to `ruff format` (#144552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552 Approved by: https://github.com/ezyang	2025-08-07 00:09:56 +00:00
William Wen	fd606a3a91	[dynamo] update pytorch-labs -> meta-pytorch in graph break URLs (#159975 ) Related PR: https://github.com/meta-pytorch/compile-graph-break-site/pull/30 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159975 Approved by: https://github.com/Lucaskabela	2025-08-06 23:57:31 +00:00
Animesh Jain	3daef4d128	[dynamo] Trace nn.Module __delattr__ (#159969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159969 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/StrongerXi	2025-08-06 23:43:19 +00:00
PyTorch MergeBot	cb4b29b754	Revert "[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 )" This reverts commit 9fd5b5f73589cf08dca60910368cc0f05c7906c8. Reverted https://github.com/pytorch/pytorch/pull/159874 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/159874#issuecomment-3161896978))	2025-08-06 23:21:29 +00:00
drisspg	a6bc296207	[FlexAttention] Update the guard semantics for divisibility (#159884 ) We don't add guards unless we know (and another guard has ensured this) that this is a safe optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/159884 Approved by: https://github.com/Chillee	2025-08-06 23:12:44 +00:00
Thomas Bohnstingl	64dc30c213	[HOP, map] Rework of map autograd to the new interface (#153343 ) This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343 Approved by: https://github.com/ydwu4	2025-08-06 23:02:42 +00:00
Nathan Brown	93da9952a7	gloo: fix building system gloo with CUDA/HIP (#146637 ) Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support. This had been updated when building/linking with vendored Gloo, but not when using system Gloo. Fixes: #146239 Reported-by: Adam J Stewart <ajstewart426@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637 Approved by: https://github.com/malfet	2025-08-06 22:56:31 +00:00
christinaburge	3a2c3c8ed3	unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 ) These tests now pass on AArch64 in our downstream CI. `test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786 Approved by: https://github.com/jerryzh168, https://github.com/malfet	2025-08-06 22:41:07 +00:00
Jovian Anthony Jaison	9fd5b5f735	[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. Test Plan: See: D79456310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159874 Approved by: https://github.com/c00w	2025-08-06 22:33:04 +00:00
Xiaochang Wu	2507ae63f2	Partitioner: Fix to align partition node order with original graph (#157892 ) Fixes #157891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892 Approved by: https://github.com/ezyang	2025-08-06 22:12:47 +00:00
Lucas Kabela	40c4d61f9a	[Dynamo][Better Engineering] Typing `torch/_dynamo/guards.py` (#159315 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/guards.py` Running ``` mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2030 \| 3945 \| 51.46% \| 70 \| 138 \| 50.72% \| \| This PR \| 4055 \| 4055 \| 100.00% \| 138 \| 138 \| 100.00% \| \| Delta \| +2025 \| +90 \| +48.54% \| +68 \| 0 \| +49.28% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159315 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-08-06 21:52:14 +00:00
Tom Ritchford	a5725965ea	Remove unnecessary "# noqa: set_linter" comments (#159467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467 Approved by: https://github.com/eellison	2025-08-06 21:31:52 +00:00
Ruben Rodriguez Buchillon	289f62ce8a	[inductor][ez] fixup scaled_mm (#159948 ) Summary: This reverts the part of #159383 for scaled_mm where now, like before, we pass through the normal input_nodes (not the triton_input_nodes) to select_algorithm - #159383 refactored how kwargs are retrieved - it introduced this notion of KernelInputs that wrap input_nodes - scaled_mm uses unsqueezed input nodes for triton to retrieve params - the issue: it uses a squeezed (regular) bias for select_algorithm instead This fixes that by passing the original input nodes rather than the triton input nodes. Test Plan: ``` buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)' buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' ``` This set of tests was failing, and is passing now Side note: these tests were failing I believe because the unsqueezed bias made the ATEN choice no longer eligible, and there is some minor numerical discrepancy between ATEN and Triton for this. I'm not sure the test should be written like that, as we're implicitly relying on ATEN being the choice here. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159948 Approved by: https://github.com/izaitsevfb, https://github.com/eellison	2025-08-06 21:25:48 +00:00
Nikita Shulga	512b4730e3	[EZ] Remove useless `cross_compile_arm64` (#159986 ) As we don't have any Intel Mac runners in CI for last 2+ years Pull Request resolved: https://github.com/pytorch/pytorch/pull/159986 Approved by: https://github.com/atalman	2025-08-06 21:01:05 +00:00
Xia, Weiwen	d2368aa6f3	[CPUBLAS] add macros for brgemm APIs for versioning (#158629 ) Summary Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629 Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang	2025-08-06 20:54:05 +00:00
Mwiza Kunda	0afaeb7c4e	Improve `extract_test_fn` (#158637 ) The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637 Approved by: https://github.com/jbschlosser	2025-08-06 20:45:21 +00:00
Alan Du	50580b5053	Add minimal nn.functional.log_softmax support for NestedTensor (#159662 ) This only works for the jagged layout and for the non-batch and non-jagged dimensions. I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662 Approved by: https://github.com/jbschlosser	2025-08-06 20:34:02 +00:00
Frank Seide	b8ef60b6bc	Enable XNNPACK aarch64 builds (#159762 ) Summary: This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device. Thanks to andrewjcg for proposing this fix. Rollback Plan: Reviewed By: andrewjcg Differential Revision: D79497613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762 Approved by: https://github.com/frankseide, https://github.com/malfet Co-authored-by: Frank Seide <seide@meta.com>	2025-08-06 20:20:32 +00:00
Nikita Shulga	0de2a45a48	[BE] Merge 3 CUDA build jobs into one (#159890 ) Before this change there were build+test jobs: - s89 build+tests - sm75 build+distributed_test - sm_75 build+pr_time_benchmark test This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159890 Approved by: https://github.com/clee2000, https://github.com/seemethere	2025-08-06 20:09:55 +00:00
xinan.lin	12a54e4ac1	[Inductor UT][Fix XPU CI] Fix case failures introduced by community. (#159759 ) Fixes #159631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159759 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-06 20:02:20 +00:00
Nikita Shulga	d10e9e4781	[MPS] Remove all pre-MacOS14 logic (#159912 ) Delete older enums, checks for MacOS-13.3+ for int64 support, etc Fixes https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912 Approved by: https://github.com/manuelcandales	2025-08-06 19:48:12 +00:00
Xu Han	c71950907d	[inductor] add _get_inductor_debug_symbol_cflags for debug symbol control. (#159938 ) We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol. On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG. On Linux, it should create some debug sections in binary file. I added UT for it also. It works well on Windows inductor debug. <img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159938 Approved by: https://github.com/jansel, https://github.com/angelayi	2025-08-06 19:31:45 +00:00
Divyansh Khanna	6fa3592dc6	Dataloader benchmark script (#159432 ) This script adds a simple dataloading benchmark tracking throughput and memory. The output looks like this ``` System Information: PyTorch version: 2.9.0a0+gitf87d117 PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py Torchvision version: 0.24.0a0+f52c4f1 Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py CUDA available: True CUDA device: NVIDIA PG509-210 CPU count: 192 Physical CPU cores: 96 Total system memory: 1510.11 GB Loading dataset from imagenet/val (1 copies) Dataset size: 50000 --- Benchmarking DataLoader with worker_method=multiprocessing --- Memory before DataLoader creation: 500.59 MB Detailed memory information: USS (Unique Set Size): 499.00 MB PSS (Proportional Set Size): 500.74 MB RSS (Resident Set Size): 497.39 MB Memory after DataLoader creation: 1127.61 MB Memory increase: 627.02 MB Starting training loop with 1 epochs (max 100 batches per epoch) Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB Results: Worker method: multiprocessing DataLoader init time: 0.1553 seconds Average batch time: 0.3408 seconds Samples per second: 375.53 Peak memory usage: 12738.76 MB Memory increase: 12238.17 MB ``` > TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432 Approved by: https://github.com/ramanishsingh	2025-08-06 19:05:19 +00:00
PyTorch MergeBot	ba37f589d4	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit ee62177c196d716fc3a2d641370bed8a673a45d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/159696#issuecomment-3161196192))	2025-08-06 18:41:05 +00:00
Bin Bao	44dd3684d2	[AOTI] Fix memory leak from all_reduce (#159818 ) Summary: This PR solves two issues: 1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak. 2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818 Approved by: https://github.com/henryhu6, https://github.com/yushangdi	2025-08-06 18:11:14 +00:00
Georgia Phillips	c669b0ab87	Fix execution frame cleanup logic (#158717 ) Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic. Test Plan: ``` buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details ``` Rollback Plan: Differential Revision: D78621408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158717 Approved by: https://github.com/dolpm	2025-08-06 18:04:24 +00:00
Luca Wehrstedt	d7a855d67d	[async-TP] Make scaled-mm + reduce-scatter preserve alignment of scales (#159957 ) After https://github.com/pytorch/pytorch/pull/157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159957 Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre	2025-08-06 17:42:26 +00:00
Meet Vadakkanchery	4c01991b38	[DCP][Prototype] Checkpoint replication via PGTransport (#157963 ) (#159801 ) Summary: ### PR Context Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners". Test Plan: CI Rollback Plan: Differential Revision: D79590797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801 Approved by: https://github.com/saumishr	2025-08-06 16:52:03 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
Mikayla Gawarecki	d87161c3c8	[Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159904 Approved by: https://github.com/janeyx99	2025-08-06 15:09:18 +00:00
Zhengxu Chen	79eca4677b	[precompile] Skip serializing unnecesssary objects for guards. (#158926 ) Summary: The following type of objects don't need to be serialized for precompile: 1. PyCapsule because we don't guard on C binding objects in meaningful ways. 2. Code object because we only id matching on these but id matches will always be dropped for precompile. 3. Nested function objects since we also ban CLOSURE_MATCH. Test Plan: buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects Rollback Plan: Differential Revision: D78816888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926 Approved by: https://github.com/jamesjwu	2025-08-06 15:00:28 +00:00
PyTorch MergeBot	2855688a1d	Revert "Replace C array with std::array in formatSockAddr (#159812 )" This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072. Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see `2231c3ca3a/1` ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))	2025-08-06 14:55:48 +00:00
Nikita Shulga	2231c3ca3a	[CI][CD] Fix `install_nvshem` function (#159907 ) When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907): ``` 2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to: '' 2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at: '' 2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB: '/usr/local/cuda/lib64/libnvshmem_host.so' 2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB: '/usr/local/cuda/lib64/libnvshmem_device.a' 2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR: '/usr/local/cuda/include' 2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-08-06 14:44:37 +00:00
can-gaa-hou	c03a734ba1	[OpenReg] Disable automatic inclusion of data files (#159845 ) # Background After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime. # Motivation This PR aims to remove the stub.c file and any unused file when running torch_openreg. Changes: - Setting include_package_data keyword to false in the setup function Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845 Approved by: https://github.com/albanD	2025-08-06 10:35:13 +00:00
Benji Beck	98316e5896	[WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325 ) Summary This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849 Motivation A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads Implementation - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Benchmark Results: ``` [Shape B=256, K=1024, N=512] CPU and CUDA outputs match Max abs diff: 2.59e-04, max rel diff: 0.75 CPU: 144.14 ms, CUDA: 303.67 µs Speedup: ×474.6 [Shape B=512, K=2048, N=1024] CPU and CUDA outputs match Max abs diff: 5.49e-04, max rel diff: 0.15 CPU: 1173.27 ms, CUDA: 2.40 ms Speedup: ×488.5 ``` Rollback Plan: Differential Revision: D79042656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325 Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168	2025-08-06 10:28:08 +00:00
angelayi	23cf241039	[aoti][mps] Initialize mps kernels first (#159753 ) In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance: ``` if ... auto mps_lib_0_func = ... mps_lib_0_func->run() // since we already used mps_lib_0 once, we don't re-initialize it mps_lib_0_func->run() // error, mps_lib_0_func not initialized ``` So the solution we took here is to initialize all the kernels at the beginning: ``` const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ... if ... get_mps_lib_0()->run() get_mps_lib_0()->run() // success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753 Approved by: https://github.com/malfet ghstack dependencies: #159456, #159695	2025-08-06 07:54:29 +00:00
Will Constable	e7feedf6a9	Replace C array with std::array in formatSockAddr (#159812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812 Approved by: https://github.com/Skylion007	2025-08-06 07:44:29 +00:00
Will Constable	dad2a05bec	[DTensor] Set up DTensorContinuousTestBase (#159885 ) Also migrate `test_common_rules.py` since it was a short file `python test/distributed/tensor/test_common_rules.py` Before: Ran 10 tests in 91.516s After: Ran 10 tests in 5.604s Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885 Approved by: https://github.com/ezyang	2025-08-06 07:40:31 +00:00
Colin L Reliability Rice	0495cab545	Wire in pt2_triton_builds (#159897 ) Summary: This allows us to start seeing the failure rate on these models (and potentially alert on it). Test Plan: ``` FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 \| tee out ``` P1889607054 Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon. Rollback Plan: Reviewed By: masnesral Differential Revision: D79308333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897 Approved by: https://github.com/masnesral	2025-08-06 07:39:51 +00:00
Mengtian Xu	abfe403981	[AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391 ) Summary: This feature is Meta internal only Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub . The rollout will be controlled by JK. Test Plan: MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21 {F1980593060} * If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility' * The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page. * Test fails because of the export checks. I'll export after addressing all the comments from reviewers. Rollback Plan: Reviewed By: pianpwk Differential Revision: D78526522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391 Approved by: https://github.com/jingsh	2025-08-06 07:39:39 +00:00
Jane Xu	1690c0c3a0	[Reland] Migrate ScalarType to headeronly (#159911 ) The non ghstack version of #159416, to make sure we don't get reverted again Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911 Approved by: https://github.com/mikaylagawarecki	2025-08-06 07:36:37 +00:00
Aidyn-A	e9d27aa8fd	[CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854 ) CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854 Approved by: https://github.com/eqy	2025-08-06 06:03:58 +00:00
PyTorch MergeBot	2457e62c90	Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382 )" This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2. Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see `d0fccbc99c/1` ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))	2025-08-06 05:30:20 +00:00
Nikita Shulga	d0fccbc99c	[CI] Delete sm86 tests from pull (#159903 ) And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903 Approved by: https://github.com/huydhn	2025-08-06 05:16:55 +00:00
PyTorch UpdateBot	3461988a4b	[audio hash update] update the pinned audio hash (#159823 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823 Approved by: https://github.com/pytorchbot	2025-08-06 05:02:35 +00:00
Will Constable	9764981116	Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814 ) Allow overriding nop compilers with real ones when using this flow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814 Approved by: https://github.com/fmassa	2025-08-06 04:50:56 +00:00
Michael Lazos	704594eb23	[Dynamo] make HOPs hashable (#159910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910 Approved by: https://github.com/yf225	2025-08-06 04:02:17 +00:00
eqy	bfc27cf468	[Distributed] Fix `@parametrize` on unordered iterable in distributed test (#159793 ) seems to fix https://github.com/pytorch/pytorch/issues/145807 sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2025-08-06 03:51:42 +00:00
bobrenjc93	311f74089a	remove print (#159917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917 Approved by: https://github.com/laithsakka	2025-08-06 03:48:23 +00:00
Tianhao Huang	14c7358c64	Enable fr_trace to read local traces from multiple hosts. (#159490 ) Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case. Test Plan: Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run ``` buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps ``` Before this diff, fr_trace cannot locate any trace files, giving the following assertion error: ``` AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_ ``` After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like ``` dump = pickle.load(infile) ^^^^^^^^^^^^^^^^^^^ EOFError: Ran out of input ``` (since the trace files are fake and empty). Rollback Plan: Differential Revision: D79224727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490 Approved by: https://github.com/fduwjj	2025-08-06 03:15:34 +00:00
Dave Lei	8ce81bcee1	[Torch Package] Make get names of OrderedImporters support fallback to importers (#155743 ) Summary: OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class. This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers. Differential Revision: D76463252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743 Approved by: https://github.com/jcwchen, https://github.com/jingsh	2025-08-06 02:26:10 +00:00
Yu, Guangye	4604f0482c	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-06 02:22:18 +00:00
Yu, Guangye	15f1173e5d	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-06 02:22:18 +00:00
henrylhtsang	e16c48ae97	[BE] Fix type hint in AOTIRunnerUtil (#159577 ) Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577 Approved by: https://github.com/Skylion007	2025-08-06 01:20:45 +00:00
Yu, Guangye	f7a66da5f9	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-06 00:40:29 +00:00
Animesh Jain	3eb3da9b4b	[dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888 Approved by: https://github.com/williamwen42	2025-08-06 00:36:43 +00:00
Jane Xu	3ddfd46bd2	Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-08-06 00:29:56 +00:00
Zhengxu Chen	6a82da392e	[export] Fix generated schema for C++20/23 (#159871 ) Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838 Test Plan: buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ Rollback Plan: Differential Revision: D79647167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871 Approved by: https://github.com/malfet	2025-08-06 00:23:05 +00:00
Simon Fan	22bedc429f	Extract some HOP utils to be importable (#159705 ) Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705 Approved by: https://github.com/zou3519 ghstack dependencies: #159134	2025-08-05 23:59:47 +00:00
Huy Do	49abc0e3f8	[Take 2] Setup TorchBench in Docker (#159300 ) Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker. ### Testing MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300 Approved by: https://github.com/ZainRizvi	2025-08-05 23:47:42 +00:00
Xu Han	1052604acd	fix logging setup issue for Windows.. (#159887 ) When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html Such as: TORCH_LOGS="+schedule,+inductor,+output_code" On Linux, it shows as: ```cmd declare -x SSH_TTY="/dev/pts/0" declare -x TERM="xterm" declare -x TORCH_LOGS="+schedule,+inductor,+output_code" declare -x USER="xu" ``` On Windows, it shows as: ```cmd TORCHINDUCTOR_WINDOWS_TESTS=1 TORCH_LOGS="+schedule,+inductor,+output_code" UCRTVersion=10.0.22000.0 ``` For Linux, it shows quotes by default, And Windows is not shows quotes. Besides that, Windows would auto assemble quotes when env var processing. On Linux, we will get variable: "+schedule,+inductor,+output_code" On Windows, we will get variable: '"+schedule,+inductor,+output_code"' So, we need remove the outer quotes for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887 Approved by: https://github.com/angelayi	2025-08-05 23:44:38 +00:00
Alex Malyshev	fe8984a9f4	Set PYTHONHOME for inductor subprocesses using torch (#159382 ) Summary: This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`. There are more `sys.executable` subprocesses in torch/ but it seems like they're fine. Test Plan: Local inference runs. Reviewed By: aorenste Differential Revision: D79124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382 Approved by: https://github.com/aorenste	2025-08-05 23:32:48 +00:00
angelayi	74a754aae9	Add meta kernel for sdpa_math_for_mps (#159695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695 Approved by: https://github.com/malfet ghstack dependencies: #159456	2025-08-05 22:27:06 +00:00
angelayi	b1ec088113	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-05 22:27:06 +00:00
angelayi	fb35a9ea4a	[export] Improve error messages (#159881 ) Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881 Approved by: https://github.com/yushangdi	2025-08-05 22:26:48 +00:00
Sandeep Narendranath Karjala	8034b2a732	[inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730 ) Summary: - debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format - test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730 Approved by: https://github.com/yushangdi ghstack dependencies: #159190	2025-08-05 22:06:32 +00:00
anwang	64cc6f06b1	[Inductor] Revert minimal changes to avoid internal test failures (#159809 ) The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures. I'll continue the debugging, and re-land the reverted changes once we find out a forward fix. Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-05 22:05:26 +00:00
PyTorch MergeBot	410812763b	Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 )" This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231. Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))	2025-08-05 22:00:24 +00:00
Michael Lazos	bdb07a2bc5	[Cutlass] Allow offsets to be passed as arguments to kernel (#159761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761 Approved by: https://github.com/henrylhtsang ghstack dependencies: #159760	2025-08-05 21:59:07 +00:00
Simon Fan	8085edc8f9	[autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840 ) This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix. FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840 Approved by: https://github.com/albanD	2025-08-05 21:57:49 +00:00
Nikita Shulga	882d50c5bf	[C10] Add `Scalar::isUnsigned()` method (#159877 ) That returns true if Scalar hold unsigned integral value With the implications of `Tag::HAS_u` semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-05 21:43:21 +00:00
Catherine Lee	b52a4d0821	[ez][CI] Remove some unused docker images (#159171 ) Removes unused docker images from the docker build workflow Then removes unused definitions in build.sh The only one I left is the vllm one because I'm pretty sure it's going to be used in the future I assume everything not mentioned is old and we forgot to remove them Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171 Approved by: https://github.com/yangw-dev	2025-08-05 21:31:53 +00:00
Nikita Shulga	a45a840926	[CI] Disable check-labels and check_mergeability (#159900 ) See https://github.com/pytorch/pytorch/issues/159825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900 Approved by: https://github.com/clee2000	2025-08-05 21:16:12 +00:00
Nikita Shulga	9b953bb3fb	[BE] Update TensorPipe pin (#159834 ) No functional changes, just: - Update C++ standard to C++17 - Update `cmake` min version to 3.18 - Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10) - Replace boost optional implementation with `std::optional` wrapper - Make it compilable with gcc-14.x plus by including `cstddef` in few headers - Avoid using deprecated enums for MacOS builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834 Approved by: https://github.com/Skylion007	2025-08-05 20:45:09 +00:00
eellison	eb25a95a6e	Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569 ) With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline: ``` When an operation mutates a buffer in-place, the scheduler creates a new buffer name to track the "before" and "after" states, even though they share the same memory. The mutated buffer represents a rename with zero allocation and deallocation cost. During dependency tracking, we transfer dependencies from the mutated name back to the original buffer, ensuring the original memory is only freed when all aliases are done. This handles cases where a buffer has multiple non-overlapping aliases - rather than trying to assign free costs to individual aliases, we forward all alias dependencies to the original buffer. Consider: buf0 = op0() buf1 = mutation_op_(buf0) del buf0 ... op(buf1) del buf1 The only memory events are the creation prior to op0, and the deletion following buf1. ``` As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect. This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569 Approved by: https://github.com/IvanKobzarev	2025-08-05 19:58:11 +00:00
eqy	9884d0351e	[CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522 ) Otherwise we see `CUDA error: too many resources requested for launch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522 Approved by: https://github.com/janeyx99	2025-08-05 19:26:25 +00:00
Eli Uriegas	d7c83972d5	tools: Add mode to find python automatically (#159820 ) Add support for automatically finding Python interpreters in manylinux environments to our wheel building script. Scaffolding for sequential builds Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820 Approved by: https://github.com/malfet	2025-08-05 19:19:22 +00:00
Nikita Shulga	e06b110f73	[Testing] Add MPS to NATIVE_DEVICES (#153835 ) This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely: - Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64` - Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())` - Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed) - Add lots of `expectedFailureIfMPS` - Delete all `@onlyNativeDeviceTypesAnd("mps")` <sarcasm> I love how well documented this variable are </sarcasm> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835 Approved by: https://github.com/Skylion007	2025-08-05 18:57:35 +00:00
Zheng, Zhaoqiong	0ba09a6d34	fix link for tutorial of inductor on windows (#159853 ) fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853 Approved by: https://github.com/sekyondaMeta Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>	2025-08-05 18:37:47 +00:00
Luca Wehrstedt	aeb5321b63	Allow controlling PG backend and options via init_device_mesh (#159371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol	2025-08-05 12:44:14 +00:00
Ruben Rodriguez Buchillon	625108ede2	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-08-05 11:42:25 +00:00
Edward Z. Yang	09e5a93fcb	Improve graph output alias with subclass error message (#159619 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619 Approved by: https://github.com/albanD	2025-08-05 06:47:31 +00:00
Yu, Guangye	908c5cc4c0	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312, #156165	2025-08-05 04:08:42 +00:00
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
Yu, Guangye	ae1a706444	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #159629	2025-08-05 04:08:04 +00:00
Yu, Guangye	56d19a5ced	Fix AllocatorConfig potential SIO issue (#159629 ) # Motivation As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process. Co-authored-by: @ScottTodd Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629 Approved by: https://github.com/ScottTodd, https://github.com/albanD	2025-08-05 04:07:51 +00:00
Lucas Kabela	b6c53383fe	[Dynamo][Better Engineering] Type annotation for `torch/_dynamo/output_graph.py` (#159602 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/output_graph.py` Running ``` mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602 Approved by: https://github.com/Skylion007	2025-08-05 03:50:54 +00:00
Divyansh Khanna	4fd5fabee9	skip XPU for dataloader CPU only unit test (#159811 ) Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811 Approved by: https://github.com/izaitsevfb	2025-08-05 03:44:01 +00:00
Nick Riasanovsky	bbc0df1094	[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Relying on CI. Should be a NFC. Rollback Plan: Reviewed By: davidberard98 Differential Revision: D79378792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777 Approved by: https://github.com/davidberard98	2025-08-05 03:29:13 +00:00
Mark Harfouche	33ec6e3e9a	Remove pin on libuv from instructions (#159504 ) This package doesn't exist at conda-forge and causes some confusion for users. see https://anaconda.org/conda-forge/libuv/files?version=1.39.0 libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge. see: https://github.com/conda-forge/libuv-feedstock/issues/80 Hopefully this can help future users. Fixes https://github.com/conda-forge/libuv-feedstock/issues/80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504 Approved by: https://github.com/seemethere	2025-08-05 03:18:42 +00:00
CaoE	efc4b460b3	Add cascade sum support for Inductor CPP backend (#156296 ) Fixes #154703 Add cascade summation support for Inductor CPP backend to improve precision for large size summation. Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = tmp_acc0_vec + tmp0; } } } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` After adding cascade sum support: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0); CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L)); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0); } } } tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0); tmp_acc0_vec = cascade_sum_final(&cascade_helper0); masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0); tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` This will inevitably reduce performance when cascade sum is turned on. For the case shown in #154703: performance reduced by ~3%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-05 02:54:32 +00:00
Nikita Shulga	1ca8388442	[BE][MPS] Remove unused size12 variable (#159832 ) Fixes following compilation warning ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable] auto size12 = input_sizes[1] * input_sizes[2]; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832 Approved by: https://github.com/dcci	2025-08-05 02:32:06 +00:00
dolpm	b69497351d	[nativert] force resize to zero. (#159683 ) Summary: this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB. Rollback Plan: Differential Revision: D79476454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-08-05 02:25:31 +00:00
Will Constable	482f069c41	[C10D] fix slow init due to repeated dns resolution failure (#159596 ) It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596 Approved by: https://github.com/d4l3k	2025-08-05 02:15:26 +00:00
Benjamin Hottell	85d931f29e	Use uppercase OR when checking for system XNNPACK (#159527 ) This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`. --- For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error: ``` CMake Error at cmake/Dependencies.cmake:566 (if): if given arguments: "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY" Unknown arguments specified Call Stack (most recent call first): CMakeLists.txt:868 (include) ``` Upon making the change in this PR (changing `or` to `OR`), the process continued as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527 Approved by: https://github.com/janeyx99	2025-08-05 02:10:53 +00:00
Jack Taylor	8a2f53c523	Recursively sync fbgemm submodules before build (#159477 ) ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622 ``` 2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’ 2025-07-27T08:00:32.3444080Z 389 \| x86::Xmm partial_sum_xmm(partial_sum_vreg.id()); ``` It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477 Approved by: https://github.com/pruthvistony, https://github.com/eqy	2025-08-05 02:00:54 +00:00
Kurt Mohler	b59b61a099	Add `avg_pool3d` backward pass for MPS (#159089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089 Approved by: https://github.com/malfet	2025-08-05 01:55:38 +00:00
Cui, Yifeng	57ab39f7e4	Update torch-xpu-ops commit pin (#159621 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](`1f7a57f507`) includes: - Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization - Add optional NaN checks to XCCL - Fix NllLossForwardReduce2DKernelFunctor accuracy - Extend the existing communication logging to include the reduction operation for collective calls - [Reland] Install xpu codegen header to torch/include Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621 Approved by: https://github.com/EikanWang	2025-08-05 01:46:15 +00:00
Michael Lazos	182975e01a	[Dynamo] Enable torch function dispatch on HOPs (#159708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708 Approved by: https://github.com/zou3519, https://github.com/XilunWu ghstack dependencies: #159707	2025-08-05 01:43:22 +00:00
Michael Lazos	9f8cfe7476	[Dynamo] Fix arg ordering in tf modes (#159707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707 Approved by: https://github.com/zou3519	2025-08-05 01:43:21 +00:00
Oguz Ulgen	e273ff028a	Fix failing test (#159800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800 Approved by: https://github.com/aorenste	2025-08-05 00:28:51 +00:00
David Berard	5e0fc2c9a9	[AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433 ) Motivation / Context: (what I _think_ is happening here) In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation. But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA. Solution space Export allows users to specify which dimension are dynamic, and to provide ranges of valid sizes. One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32. However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above. To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`. If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges. Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-08-05 00:17:09 +00:00
Shangdi Yu	bc4b04e058	DeviceCopy should have the same layout as input (#159615 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159612 - Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input - The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy ``` Rollback Plan: Differential Revision: D79411407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615 Approved by: https://github.com/eellison	2025-08-04 23:56:58 +00:00
David Berard	6b414f56a4	Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 )" (#159798 ) This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1. Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model) ``` import torch def ri(inp, repeats, output_size): return torch.repeat_interleave(inp, repeats, output_size=output_size) inp = torch.arange(0, 4, device="cuda").reshape(-1, 1) x = torch.tensor([1, 2, 3, 4], device="cuda") ri_c = torch.compile(ri) print(ri(inp, x, 10)) print(ri_c(inp, x, 10)) ``` which leads to errors like ``` /tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed. ``` Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798 Approved by: https://github.com/danzimm	2025-08-04 23:39:20 +00:00
PyTorch MergeBot	fb8f32ef52	Revert "[mps] Turn on inductor dynamic shapes tests (#159456 )" This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf. Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))	2025-08-04 23:11:05 +00:00
Michael Lazos	7ba996bbaa	[Cutlass] Fix wrapper code generation breakage (#159760 ) Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355 The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760 Approved by: https://github.com/angelayi	2025-08-04 23:03:03 +00:00
henrylhtsang	ddbdcdc710	[cutlass backend][test] Expand FP8 tests to FP16 (#159538 ) Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538 Approved by: https://github.com/mlazos	2025-08-04 23:01:55 +00:00
angelayi	19f1f9960d	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-04 22:44:31 +00:00
yewentao256	fd6655a0f5	Feature: Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach. (#123020 ) Fixes #115611 Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency. Test Case: ```c++ #include <torch/torch.h> #include <iostream> #include <ATen/ATen.h> #include <ATen/cuda/CUDAContext.h> int main() { auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA)); auto weight = torch::randn({3}, torch::device(torch::kCUDA)); auto bias = torch::randn({3}, torch::device(torch::kCUDA)); auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA)); auto running_var = torch::ones({3}, torch::device(torch::kCUDA)); bool training = true; double exponential_average_factor = 0.1; double epsilon = 1e-5; auto output = torch::empty_like(input); auto save_mean = torch::empty({3}, torch::device(torch::kCUDA)); auto save_var = torch::empty({3}, torch::device(torch::kCUDA)); auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve); auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon); bool is_close_output = torch::allclose(output, std::get<0>(outputs)); bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs)); bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs)); bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs)); std::cout << "Is output close: " << is_close_output << std::endl; std::cout << "Is save_mean close: " << is_close_save_mean << std::endl; std::cout << "Is save_var close: " << is_close_save_var << std::endl; std::cout << "Is reserve close: " << is_close_reserve << std::endl; return 0; } ``` Please CC @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020 Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD	2025-08-04 22:40:33 +00:00
Lucas Kabela	a7f3bdf550	[Dynamo][Better Engineering] Type coverage for `torch/_dynamo/utils.py` (#159580 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/utils.py` Running ``` mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580 Approved by: https://github.com/williamwen42	2025-08-04 21:51:53 +00:00
Xu Han	510e8b4ae0	[inductor] use writable temp file on windows (#159738 ) Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-08-04 21:51:02 +00:00
PyTorch MergeBot	83ba3f1101	Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 )" This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4. Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
PyTorch MergeBot	1fad16aacb	Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 )" This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e. Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
Markus Hoehnerbach	444e2381d0	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-04 21:42:05 +00:00
Markus Hoehnerbach	6085bf7565	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-04 21:22:11 +00:00
Natalia Gimelshein	8201dbf4bc	check driver to be >=12.4 to use fabric handles (#159697 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697 Approved by: https://github.com/malfet	2025-08-04 21:05:39 +00:00
atalman	26d045bb60	Linux py 3.14 wheel builds (#157559 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559 Approved by: https://github.com/malfet, https://github.com/albanD	2025-08-04 20:55:19 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Kurt Mohler	d4109a0f99	[MPS] Add max_unpool1d/2d/3d (#159789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789 Approved by: https://github.com/malfet	2025-08-04 20:00:59 +00:00
Arsh Zahed	7ea789ccfb	Revert #156868 : Bring back symint check for sharding propagation cache (#159671 ) Fixes #159601 Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671 Approved by: https://github.com/xmfan	2025-08-04 19:58:48 +00:00
PyTorch MergeBot	7e8197e34d	Revert "Migrate ScalarType to headeronly (#159416 )" This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449. Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))	2025-08-04 19:55:09 +00:00
Benjamin Glass	50eac811a6	[typing] Constrain OrderedSet generic to be Hashable (#159684 ) Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684 Approved by: https://github.com/Skylion007	2025-08-04 18:08:01 +00:00
ILCSFNO	4e0f179d0b	Update the signature and test of torch.hamming_window() (#152682 ) Fixes #146590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682 Approved by: https://github.com/albanD	2025-08-04 17:50:42 +00:00
Tan Hoang	36e59d9b12	[c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557 ) Summary: Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK ``` fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 154 \| virtual at::Tensor get_buffer(int \| ^ fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here 56 \| virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0; ``` Test Plan: Build test + CI Rollback Plan: Differential Revision: D78813586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557 Approved by: https://github.com/kwen2501	2025-08-04 17:46:30 +00:00
angelayi	fc340d0ca3	[export] Allow comparing device w/o index with device w/ index (#159665 ) In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665 Approved by: https://github.com/yushangdi	2025-08-04 17:00:07 +00:00
Animesh Jain	53e47af0f7	[dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754 Approved by: https://github.com/jansel ghstack dependencies: #159752	2025-08-04 16:51:27 +00:00
Animesh Jain	66ad881fc7	[dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752 Approved by: https://github.com/jansel	2025-08-04 16:51:27 +00:00
amdfaa	1d3eef27ac	[ROCm CI] Migrate to MI325 Capacity (#159649 ) Migrate mi300s to gfx942. Related to https://github.com/pytorch/pytorch/pull/159059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649 Approved by: https://github.com/huydhn	2025-08-04 16:48:12 +00:00
Xu Han	dd95900cec	[AOTI] normalize_path_separator file path for Windows. (#159726 ) `normalize_path_separator` file path for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:57:19 +00:00
yuchengliu1	1cdd665526	fix test_verbose_logs_dynamic_shapes with MSVC (#159573 ) Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:56:53 +00:00
Tan Hoang	7cb2dcd2dd	[c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558 ) (#159561 ) Summary: Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically Test Plan: CI + sample run Rollback Plan: Differential Revision: D79177525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561 Approved by: https://github.com/kwen2501	2025-08-04 15:40:29 +00:00
Aleksei Nikiforov	e5a81aa7ba	Fix conversion of values in libtorch agnostic tests (#155115 ) Due to different byteorder, when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back. This change fixes test TestLibtorchAgnosticCPU::test_my_ones_like_cpu from cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115 Approved by: https://github.com/huydhn	2025-08-04 13:40:22 +00:00
Andrey Talman	3e2aa4b0e3	Update pin to include Python 3.14 support (#159725 ) Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725 Approved by: https://github.com/davidberard98	2025-08-04 13:30:12 +00:00
Aleksei Nikiforov	6646461764	S390X: fix detection of magic number placeholder in inductor (#157784 ) This change fixes multiple tests in test/inductor/test_aot_inductor_arrayref.py such as test_cond_with_parameters_cpu_with_stack_allocation, test_issue_140766_cpu_with_stack_allocation, test_model_modified_weights_cpu_with_stack_allocation, test_nested_tensor_from_jagged_cpu_with_stack_allocation. Enable tests in test/inductor/test_aot_inductor_arrayref.py This change is split off from https://github.com/pytorch/pytorch/pull/150116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784 Approved by: https://github.com/huydhn	2025-08-04 12:42:31 +00:00
PyTorch UpdateBot	f74da2a136	[xla hash update] update the pinned xla hash (#159758 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758 Approved by: https://github.com/pytorchbot	2025-08-04 11:21:45 +00:00
eqy	d35b27dde5	[CUDA] Add some more missing `@serialTest` decorators (#159672 ) Seems to fix #159663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672 Approved by: https://github.com/Skylion007	2025-08-04 07:44:35 +00:00
anwang	a9dc1566d4	[MTIA Aten Backend] Migrate arange.start_out (#159540 ) Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540 Approved by: https://github.com/malfet, https://github.com/nautsimon	2025-08-04 07:38:05 +00:00
Jiang, Yanbing	33a1996714	Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024 ) This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024 Approved by: https://github.com/mingfeima	2025-08-04 05:49:46 +00:00
Animesh Jain	ee62177c19	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159534	2025-08-04 05:12:44 +00:00
Animesh Jain	64cbaa876c	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel	2025-08-04 05:12:44 +00:00
Animesh Jain	4516c59f5f	[dynamo][source] Add special source for __code__ and __closure__ (#159722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722 Approved by: https://github.com/jansel	2025-08-04 05:02:05 +00:00
PyTorch UpdateBot	8bc843a9ec	[vllm hash update] update the pinned vllm hash (#159610 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610 Approved by: https://github.com/pytorchbot	2025-08-04 04:06:09 +00:00
Jason Ansel	e39a62c70d	Fix warnings in triton_helpers.py (#159719 ) ``` /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '\|' instead equal \|= a_isnan and b_isnan ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719 Approved by: https://github.com/Skylion007	2025-08-04 03:21:09 +00:00
Laith Sakka	978e3a9142	refresh expected results (#159727 ) Just regular update due to recent <10% changes CI is stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727 Approved by: https://github.com/anijain2305	2025-08-03 22:47:50 +00:00
Nikita Shulga	e2a5c42e7e	[BE][MPS] Build metal kernels of MacOS-14+ (#159733 ) Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary. Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic. Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733 Approved by: https://github.com/dcci ghstack dependencies: #159731, #159732	2025-08-03 20:53:58 +00:00
Nikita Shulga	5116c49b52	[BE] Remove macos-13 guard from bench_mps_ops (#159732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732 Approved by: https://github.com/dcci ghstack dependencies: #159731	2025-08-03 20:53:58 +00:00
Nikita Shulga	fecdebe385	[CI][MPS] Fix compile benchmark correctness (#159731 ) By passing `fullgraph=True` attribute and increasing cache size limit to 2**16 Otherwise, compiler might decide not to fall back to eager to avoid recompilations Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731 Approved by: https://github.com/dcci	2025-08-03 20:53:50 +00:00
Nikita Shulga	e136a9175b	[BE] Fix dev warning in `Dependencies.cmake` (#159702 ) Namely ``` CMake Warning (dev) in cmake/Dependencies.cmake: A logical block opening on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if) closes on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif) with mis-matching arguments. ``` Introduced by https://github.com/pytorch/pytorch/pull/143846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-03 18:45:07 +00:00
Francisco Massa	9a680e14b7	[bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723 ) The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` . This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723 Approved by: https://github.com/IvanKobzarev	2025-08-03 09:16:55 +00:00
PyTorch MergeBot	805a102beb	Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 )" This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445. Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
PyTorch MergeBot	6e8d705a22	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit be71000ff5292293d1976f313218e2df4d5046d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
anwang	9c18901bfd	[MTIA Aten Backend] Migrate all.out (#159539 ) Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539 Approved by: https://github.com/malfet ghstack dependencies: #159098	2025-08-03 02:08:35 +00:00
Oguz Ulgen	a29ed5e1ac	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-08-02 23:23:17 +00:00
Francisco Massa	d2792f51b2	[bucketing] Use max of input/output size for bucketing (#159717 ) The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717 Approved by: https://github.com/wconstab	2025-08-02 22:42:22 +00:00
Animesh Jain	be71000ff5	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159186, #159534	2025-08-02 21:40:38 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Animesh Jain	1616777cd2	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel ghstack dependencies: #159186	2025-08-02 18:04:35 +00:00
rajeshvshiyal	38895c0ac2	Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712 ) RuntimeError message updated in is_nonzero(input) method from bool to Boolean. Case 1: t = torch.tensor([]) torch.is_nonzero(t) Case 2: t = torch.tensor([1,2]) torch.is_nonzero(t) Existing Error message in documentation: for case 1: RuntimeError: bool value of Tensor with no values is ambiguous for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous Proposed Error message in documentation: for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous Fixes #159710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712 Approved by: https://github.com/malfet	2025-08-02 17:23:45 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
Nichols A. Romero	e11b1cd97e	[ROCm] fix nightly wheel due to rocBLAS environment variable (#159570 ) Fixes #159070 The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570 Approved by: https://github.com/malfet	2025-08-02 06:54:43 +00:00
Wenyuan Chi	b599d91738	Log autotune choices and benchmark result to scuba/chrome trace (#159496 ) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. Example Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496 Approved by: https://github.com/masnesral	2025-08-02 05:34:17 +00:00
Xiao, Wang	fd6a6658c3	Enable _int_mm on Intel GPU (#157769 ) # Moativation This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao. # Model Test Result: We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below: Precision : torch.bfloat16 quantization configuration : Int8DynamicActivationInt8WeightConfig dataset : wikitext Result: The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-08-02 05:16:01 +00:00
PyTorch UpdateBot	04973496a8	[audio hash update] update the pinned audio hash (#159611 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611 Approved by: https://github.com/pytorchbot	2025-08-02 05:15:47 +00:00
Sam Larsen	1548b011ea	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-08-02 03:54:41 +00:00
angelayi	e57a92734d	[export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625 Approved by: https://github.com/yushangdi	2025-08-02 02:52:42 +00:00
Dylan Maloy	79ff3b320b	Back out "[ez] get rid of unused var" (#159677 ) Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops. Test Plan: backout Rollback Plan: Differential Revision: D79474114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677 Approved by: https://github.com/georgiaphillips	2025-08-02 01:50:16 +00:00
nandesuka	426f249f20	Fix launch grid calculation (#159497 ) Summary: The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages. In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation. Test Plan: CI Rollback Plan: Differential Revision: D79275534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497 Approved by: https://github.com/blaine-rister	2025-08-02 01:12:58 +00:00
Edward Z. Yang	d33a484763	Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545 Approved by: https://github.com/xmfan, https://github.com/wconstab ghstack dependencies: #159336, #159337	2025-08-02 00:33:41 +00:00
Natalia Gimelshein	a81ffbc5f5	improve shape checks for grouped_mm (#159666 ) Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666 Approved by: https://github.com/danielvegamyhre, https://github.com/eqy	2025-08-02 00:12:25 +00:00
Huy Do	465fe4d9f7	Enable sample nightly PT2 benchmark on B200 (#158011 ) Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200 ### Testing https://github.com/pytorch/pytorch/actions/runs/16615101382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-01 23:47:44 +00:00
Natalia Gimelshein	9477af1063	fix compilation on cuda < 12.3 (#159657 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657 Approved by: https://github.com/kwen2501	2025-08-01 23:40:55 +00:00
Lucas Kabela	dcc36e38bb	[Graph Breaks] Remove unsupported Additional Info field (#159658 ) Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658 Approved by: https://github.com/williamwen42	2025-08-01 23:25:50 +00:00
Zain Rizvi	efd78584a8	[EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668 ) Since it's required to be run on every PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668 Approved by: https://github.com/malfet	2025-08-01 23:19:08 +00:00
Oguz Ulgen	135762ea20	Unpin helion (#159579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579 Approved by: https://github.com/jansel	2025-08-01 23:08:06 +00:00
Sherlock Huang	e2ee9cfaa2	[NativeRT] Turn on enableStaticCPUKernels by default (#159422 ) Summary: As title. Test Plan: Need to manual test on production models. Rollback Plan: Differential Revision: D78747742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422 Approved by: https://github.com/dolpm	2025-08-01 22:27:07 +00:00
Andy Lugo	06d28de17a	Update CK Kernel generation and update ck submodule (#157964 ) changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964 Approved by: https://github.com/842974287	2025-08-01 22:24:27 +00:00
anwang	df9720b8b5	[MTIA Aten Backend] Migrate all foreach ops (#159098 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate all foreach operators to in-tree, including: - _foreach_abs - _foreach_abs_ - _foreach_add.List - _foreach_add_.List - _foreach_add_.Scalar - _foreach_add_.Tensor - _foreach_addcmul.Scalar - _foreach_addcmul_.Scalar - _foreach_copy - _foreach_copy_ - _foreach_mul.List - _foreach_mul_.List - _foreach_mul_.Scalar - _foreach_mul.Tensor - _foreach_mul_.Tensor - _foreach_norm.Scalar - _foreach_sqrt_ Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098 Approved by: https://github.com/malfet	2025-08-01 22:10:12 +00:00
Sandeep Narendranath Karjala	85e74d5ace	[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 ) This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs. - Iterates over scheduler.nodes, filters for _CollectiveKernel nodes - Extracts each op’s python_kernel_name - Emits a structured JSON payload under the inductor_collective_schedule artifact name - Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact - Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190 Approved by: https://github.com/yushangdi, https://github.com/xmfan	2025-08-01 21:51:42 +00:00
Sheng Fu	0450f05658	Output tensor meta data for FX graph node (#159311 ) FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware. This DIFF is to modify the Graph::Node::format_node to include tensor meta data. Before this DIFF, the triton kernel FX graph segment looks like the following: ``` # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos After this DIFF: # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos ``` If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py. Differential Revision: D77973076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311 Approved by: https://github.com/angelayi	2025-08-01 21:40:29 +00:00
zeshengzong	595a65f5c2	[dynamo] Replace unimplemented with unimplemented_v2 in `torch/_dynamo/variables/script_object.py` (#159343 ) Fixes part of #147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-08-01 21:30:41 +00:00
tiandeyu-cs	8c6c2e40eb	Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542 ) As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-08-01 21:20:25 +00:00
henrylhtsang	32840d19f9	[cutlass backend] skip stream k if shape is dynamic (#159442 ) Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/) Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442 Approved by: https://github.com/ColinPeppler	2025-08-01 20:42:24 +00:00
Xuehai Pan	2040f00112	[BE][Easy] respect `os.environ` in subprocess calls in tools/nightly.py (#159572 ) Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572 Approved by: https://github.com/Skylion007	2025-08-01 20:40:31 +00:00
Lucas Kabela	c137f9da0b	[Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py` Running ``` mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 425 \| 1553 \| 27.37% \| 17 \| 62 \| 27.42% \| \| This PR \| 1623 \| 1623 \| 100.00% \| 62 \| 62 \| 100.00% \| \| Delta \| +1198\| +0 \| +72.63% \| +45 \| 0 \| +72.58% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518 Approved by: https://github.com/xmfan	2025-08-01 20:24:58 +00:00
Howard Huang	5e8b95605f	[PP] Support OVERLAP_F_B computation type (#158978 ) Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591) The IR looks like: ``` [0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9] [1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9] [2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9] [3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9] ``` In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does. `814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978 Approved by: https://github.com/wconstab	2025-08-01 20:22:30 +00:00
Jane Xu	8ea86a6e31	Actually test STD_TORCH_CHECK, add testfile to CMake (#159603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-08-01 19:53:41 +00:00
PyTorch MergeBot	acad808545	Revert "[inductor] consolidate common GEMM triton param retrieval (#159383 )" This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443. Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))	2025-08-01 19:49:21 +00:00
PyTorch MergeBot	c687446374	Revert "Fix rand_like decomposition to preserve strides (#159294 )" This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889. Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))	2025-08-01 19:19:51 +00:00
Will Constable	dd22ba09b4	[C10D] Document barrier interaction with device_id (#159389 ) Addresses #159262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389 Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj	2025-08-01 18:12:21 +00:00
Yu, Guangye	c0e0126399	Remove unused input parameter in ExpandableSegment (#159356 ) # Motivation While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356 Approved by: https://github.com/ngimel, https://github.com/albanD ghstack dependencies: #159159	2025-08-01 17:47:51 +00:00
Ivan Zaitsev	e4b123b5e4	Revert direct updates (#159654 ) reverts: ``` commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:32:52 2025 -0700 Update test_utils.py commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:27:54 2025 -0700 Update utils.py commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:26:05 2025 -0700 ``` (commits pushed directly to main by mistake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654 Approved by: https://github.com/atalman	2025-08-01 16:54:51 +00:00
Jovian Anthony Jaison	5711a8f069	Update test_utils.py	2025-08-01 09:32:52 -07:00
Jovian Anthony Jaison	b4b71d011e	Update utils.py	2025-08-01 09:27:54 -07:00
Jovian Anthony Jaison	52376b9b6f	Update convert_frame.py	2025-08-01 09:26:05 -07:00
Jane Xu	1371a98b0e	Migrate ScalarType to headeronly (#159416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416 Approved by: https://github.com/albanD ghstack dependencies: #159415, #159411	2025-08-01 16:07:01 +00:00
linmin	2a286cbdf4	Allow register_buffer with Tensor-like object (#159455 ) As torch allows extending the tensor with `__torch_function__`, it would be desirable to allow registering it as a buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159455 Approved by: https://github.com/mikaylagawarecki	2025-08-01 15:31:38 +00:00
Scott Todd	7c37b8e1e0	[ROCm][Windows] Switch __builtin_clz ifdef from WIN32 to MSC_VER. (#159273 ) PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606. This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140): ``` [7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument] clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument] In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36: B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded 14 \| inline unsigned int __builtin_clz(unsigned int x) { \| ~~~~~~~~~~~~ ^ C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept' 60 \| return 31 - __builtin_clz((unsigned int)__A); \| ^ 1 error generated. [7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj ``` > [!NOTE] > I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct. There are a few other variations on fixes to this floating around, such as: * `a97a957af0/lz4.c (L34-L43)` (checking with `__has_builtin`) * `c98c55ec7e/lj92.c (L31-L46)` (the same code as here, but with `_MSC_VER`) * `2760e5a2bb/def.h (L23-L25)` (using `__lzcnt` instead of a custom implementation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273 Approved by: https://github.com/Skylion007, https://github.com/m-gallus	2025-08-01 15:21:26 +00:00
Raphael Reme	ee2649219c	Fix max_width computation in _tensor_str._Formatter (#126859 ) Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy. Now, the code first checks if it should be in sci_mode, then compute `max_width` Here is an example to test the behavior: ```python A = torch.tensor([10, 1e-1, 1e-2]) B = torch.tensor([10, 1e-1, 1e-1]) print("================= Default =================") print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=False =================") with torch._tensor_str.printoptions(sci_mode=False): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=True =================") with torch._tensor_str.printoptions(sci_mode=True): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") ``` In the current version this prints: ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7 ``` On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`) Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading) After this commit, this will print ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10 ``` This also allows to align A with B for `sci_mode=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859 Approved by: https://github.com/malfet	2025-08-01 15:05:41 +00:00
Howard Huang	b0b3e6e48b	[PP] Refactor test_schedule_multiproc (#158780 ) This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780 Approved by: https://github.com/wconstab	2025-08-01 15:02:18 +00:00
Xilun Wu	3967dbedf4	[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 ) Summary This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch `FlexAttentionHOP.__call__` to it. This PR makes the following changes: - add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask` which masks over the attention result of Q shard and KV global. - add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo recompilations. - add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch won't work correctly without this line) What's not in this PR: - QKV load balancing - Test on other masking besides `causal_mask`. - Support on small attention (i.e. qkv size is smaller than 128) because the block mask rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`. Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention` Followup 1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask` to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`. 2. Merge `_ContextParallelGlobalVars` and `_cp_options`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692 Approved by: https://github.com/drisspg	2025-08-01 06:49:01 +00:00
gaoyufeng	4396b15aa7	remove co_lnotab in favor of co_linetable (#159227 ) Fixes #158833 DeprecationWarning: remove co_lnotab in favor of co_linetable Pull Request resolved: https://github.com/pytorch/pytorch/pull/159227 Approved by: https://github.com/ezyang	2025-08-01 06:34:38 +00:00
zpcore	bb6766053b	fix strategy hashing arg mismatch (#159506 ) Reland https://github.com/pytorch/pytorch/pull/159289. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506 Approved by: https://github.com/XilunWu	2025-08-01 05:42:40 +00:00
tiandeyu-cs	a4fc051c9a	Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend. (#159549 ) Fixes #159548 * Throw an error message when the input tensors for the distributed `gather` are noncontiguous. This behaviour is consistent with the distributed `all_gather`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159549 Approved by: https://github.com/d4l3k	2025-08-01 03:26:06 +00:00
PyTorch MergeBot	5cc6a0abc1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit dfacf11f66d6512396382bdf5088f0ba9de00406. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	90f13f3b2a	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	cb9b74872b	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" This reverts commit d3ce45012ed42cd1e13d5048b046b781f0feabe0. Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
Catherine Lee	c964204829	[CI] Disable executorch jobs (#159595 ) The current executorch pin needs to be updated The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966 The failure is that the pin uses a version of the nightly that has been removed from the nightly index ``` #62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu) #62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601 ``` The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392 ``` 2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is' 2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is 2025-07-31T01:56:57.0249969Z + exit 1 2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1 ``` I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi? Which then causes triton to be installed. I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595 Approved by: https://github.com/malfet	2025-08-01 02:18:03 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	2ac45c2752	Fix autocast context manager when there is exception (#159565 ) Summary: When exception occurs inside context manager, we need to either return False OR properly propagage exceptions via __exit__(exc_type, exc_val). But previously while tracing, we don't actually run the exit node so we end up swallowing the exception in a very weird way as outlined in https://github.com/pytorch/pytorch/issues/153202. This PR fixes it Test Plan: new test case Rollback Plan: Differential Revision: D79348382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159565 Approved by: https://github.com/zou3519, https://github.com/yushangdi	2025-08-01 02:12:24 +00:00
Xia, Weiwen	83e2ea8135	[CPU] fix _weight_int8pack_mm with large output shape (#158341 ) Summary `_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by ```c++ auto* C_ptr = C_data + mb_start * N + nb_start; ``` where both `mb_start` and `N` are `int` and when they are large their product may overflow. The solution is simple: declare these variables as `int64_t` so that the product won't overflow. Test plan ``` pytest -sv test/test_linalg.py -k test__int8_mm_large_shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341 Approved by: https://github.com/mingfeima, https://github.com/drisspg	2025-08-01 01:55:48 +00:00
Raman Kumar	d994027a41	[Doc fix] fix spelling of enough (#159587 ) fixes typo in word `enought` to correct `enough` at 3 places in these files ``` aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu aten/src/ATen/native/cuda/CuFFTPlanCache.h ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587 Approved by: https://github.com/ezyang	2025-08-01 01:50:57 +00:00
PyTorch MergeBot	cb4f41e125	Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566 )" This reverts commit 8e07c9870d07c5a318ab21bb16b3fa27576851e6. Reverted https://github.com/pytorch/pytorch/pull/157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](https://github.com/pytorch/pytorch/pull/157566#issuecomment-3141840110))	2025-08-01 01:27:45 +00:00
rzou	690fc9cf88	[merge_rules] add some expected failure and skips (#159581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159581 Approved by: https://github.com/anijain2305	2025-08-01 01:18:40 +00:00
henrylhtsang	eb853e222b	[cutlass upgrade] Ignore unused-but-set-variable for AsyncMM.cu (#159578 ) Fixes inductor-perf-nightly-h100. This was caused by cutlass upgrade https://github.com/pytorch/pytorch/pull/158854. I missed it in https://github.com/pytorch/pytorch/pull/159276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159578 Approved by: https://github.com/Skylion007	2025-08-01 00:10:59 +00:00
Sam Larsen	06395276e4	Remove dynamo_timed from the CachingAutotuner.coordinate_descent_tuning() hot path. (#159588 ) Summary: When coordinate_descent_tuning==True, CachingAutotuner.coordinate_descent_tuning() is called for every call of CachingAutotuner.run() (at least for Triton templates), but immediately returns the launcher. Move the dynamo_timed call after the check for triton template so we don't incur the context manager overhead on every call. Fixes https://github.com/pytorch/pytorch/issues/159525 Test Plan: Used the repro in https://github.com/pytorch/pytorch/issues/159525 to make sure the overhead goes away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159588 Approved by: https://github.com/eellison	2025-07-31 23:33:10 +00:00
Rob Timpe	8becf646ef	[dynamo] Make filter handle None as filter function (#159500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159500 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519 ghstack dependencies: #158774, #159102	2025-07-31 23:28:57 +00:00
Rob Timpe	fa68216ca1	[itertools] Implement itertools.cycle with a polyfill (#159102 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159102 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519 ghstack dependencies: #158774	2025-07-31 23:28:57 +00:00
angelayi	25ef3d315d	[aoti][mps] Dynamic reductions (#159355 ) Dynamic kernel: ```cpp [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant long& r0_numel, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int x0 = xindex; threadgroup float tmp_acc_0[32]; float tmp_acc_1 = 0; for(auto r0_1_cnt = 0; r0_1_cnt < static_cast<int>(metal::floor(static_cast<float>(0.99902343750000000 + 0.00097656250000000000r0_numel))); ++r0_1_cnt) { int r0_1 = 1024 r0_1_cnt + r0_index; if (r0_1 >= r0_numel) break; auto tmp0 = in_ptr0[x0 + 5r0_1]; tmp_acc_1 += tmp0; } auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index 1, metal::min(static_cast<decltype(1024+r0_numel)>(1024), static_cast<decltype(1024+r0_numel)>(r0_numel))); if (r0_index == 0) out_ptr0[x0] = static_cast<float>(tmp1); } void AOTInductorModel::run_impl(...) { ... auto arg0_1_size = arg0_1.sizes(); int64_t s77 = arg0_1_size[0]; inputs.clear(); [[maybe_unused]] auto& kernels = static_cast<AOTInductorModelKernels&>(this->kernels_.get()); static constexpr int64_t int_array_0[] = {5LL, }; static constexpr int64_t int_array_1[] = {1LL, }; AtenTensorHandle buf0_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle)); RAIIAtenTensorHandle buf0(buf0_handle); auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel"); auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get()); mps_lib_0_func->runCommandBlock([&] { mps_lib_0_func->startEncoding(); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1); aoti_torch_mps_set_arg_int(mps_lib_0_func_handle, 2, s77); mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}, {static_cast<uint64_t>(1), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}); }); arg0_1.reset(); output_handles[0] = buf0.release(); } // AOTInductorModel::run_impl ``` Static kernel: ```cpp kernel void generated_kernel( device float out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = in_ptr0[5 + x0]; auto tmp3 = in_ptr0[10 + x0]; auto tmp5 = in_ptr0[15 + x0]; auto tmp2 = tmp0 + tmp1; auto tmp4 = tmp2 + tmp3; auto tmp6 = tmp4 + tmp5; out_ptr0[x0] = static_cast<float>(tmp6); } void AOTInductorModel::run_impl(...) { ... static constexpr int64_t int_array_0[] = {5LL, }; static constexpr int64_t int_array_1[] = {1LL, }; AtenTensorHandle buf0_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle)); RAIIAtenTensorHandle buf0(buf0_handle); auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel"); auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get()); mps_lib_0_func->runCommandBlock([&] { mps_lib_0_func->startEncoding(); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1); mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL)}); }); arg0_1.reset(); output_handles[0] = buf0.release(); } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159355 Approved by: https://github.com/malfet	2025-07-31 23:15:02 +00:00
Xu Han	7e00f2ec9d	[AOTI] add zero size consts asm handler (#159225 ) Add `get_zero_consts_asm_code` to handle zero size consts to object. This function is used to handle zero consts situation. Because cpp standard does not allow zero size array: https://stackoverflow.com/questions/9722632/what-happens-if-i-define-a-0-size-array-in-c-c 1. On Windows, MSVC will report error C2466: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2466?view=msvc-170 So, we can use assmbely compiler to handle this situation. 2. On Windows, why not use Win32 asm to handle all path? Because ml64 only supports up to align `16`, it is not aligned to pytorch's `64`. Reference: https://learn.microsoft.com/en-us/cpp/assembler/masm/ml-and-ml64-command-line-reference?view=msvc-170 ``` Packs structures on the specified byte boundary. The alignment can be 1, 2, 4, 8, or 16. ``` 3. It function can handle zero size case on both Windows and Linux, as that: A. On Linux, we added `-pedantic` to disable zero size array on C++ compiler. `8e07c9870d/torch/_inductor/cpp_builder.py (L580)` B. On Windows, msvc is not support zero size array by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159225 Approved by: https://github.com/desertfire	2025-07-31 22:46:33 +00:00
PyTorch MergeBot	490cb3f1a4	Revert "[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 )" This reverts commit bb62e1f769ef51e2ec149d7256c135d09425aaa0. Reverted https://github.com/pytorch/pytorch/pull/159190 on behalf of https://github.com/clee2000 due to broke [GH job link](https://github.com/pytorch/pytorch/actions/runs/16658705097/job/47150840171) [HUD commit link](`bb62e1f769`) on mac ([comment](https://github.com/pytorch/pytorch/pull/159190#issuecomment-3141513921))	2025-07-31 22:22:13 +00:00
Jane Xu	b95cf5c91d	Move complex to headeronly (#159411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411 Approved by: https://github.com/albanD ghstack dependencies: #159415	2025-07-31 22:05:43 +00:00
Jane Xu	5e2ef2a465	Move Float8 variations to headeronly (#159415 ) This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!). Why is it not a clean copy paste? - For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs - Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that. Outside of the copy paste, I: - changed the tests to call torch::headeronly instead of c10 - updated header_only_apis.txt - added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415 Approved by: https://github.com/albanD	2025-07-31 22:05:43 +00:00
zpcore	9f753f8c0d	[DTensor] Improve `sort` strategy (#159189 ) - Sort strategy now supports sharding on non sorted dim. ~~- Fix histc xfail.~~ - ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy, the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189 Approved by: https://github.com/XilunWu	2025-07-31 21:52:42 +00:00
Jane Xu	db437690d1	Add myself as a reviewer for when someone touches headeronly or stable (#159583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159583 Approved by: https://github.com/mikaylagawarecki	2025-07-31 21:30:05 +00:00
Simon Fan	669009bcd1	[inductor] respect layout tags for ops with registered lowerings (#159134 ) scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue. ~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work. FIXES https://github.com/pytorch/pytorch/issues/159097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134 Approved by: https://github.com/eellison	2025-07-31 21:29:40 +00:00
Svetlana Karslioglu	e4e2701429	Add the RunLLM widget to the website (#152055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055 Approved by: https://github.com/albanD	2025-07-31 20:53:53 +00:00
Rob Timpe	64cc649275	[itertools] Fix accumulate (#158774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158774 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519	2025-07-31 20:32:02 +00:00
PyTorch MergeBot	b1fb552974	Revert "Fix ep deepcopy when there is python builitin name (#159478 )" This reverts commit de7376537f2a11783169fee2b3bc276d266898bf. Reverted https://github.com/pytorch/pytorch/pull/159478 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159478#issuecomment-3141228423))	2025-07-31 20:20:53 +00:00
Sandeep Narendranath Karjala	bb62e1f769	[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 ) This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs. - Iterates over scheduler.nodes, filters for _CollectiveKernel nodes - Extracts each op’s python_kernel_name - Emits a structured JSON payload under the inductor_collective_schedule artifact name - Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact - Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190 Approved by: https://github.com/yushangdi, https://github.com/xmfan	2025-07-31 19:58:07 +00:00
Dylan Maloy	327e2ca580	[ez] get rid of unused var (#159571 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D79320299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159571 Approved by: https://github.com/houseroad, https://github.com/georgiaphillips	2025-07-31 19:11:57 +00:00
Neil Tenenholtz	1ebcba4e1b	Fix typo in link to torch memory_viz tool (#159214 ) Fixes a small typo in the torch_cuda_memory docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214 Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007	2025-07-31 18:50:54 +00:00
Divyansh Khanna	5f7eae697d	Deprecate DataLoader pin_memory_device param (#158323 ) Build on top of https://github.com/pytorch/pytorch/pull/146821 - Moves enabling pin_memory back inside `_BaseDataLoaderIter` - This is required for `StatefulDataloader` which leveraged `_BaseDataLoaderIter` directly and not the `Dataloader` class init - Add a simple test for CPU only env where setting `pin_memory=True` is a no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158323 Approved by: https://github.com/ramanishsingh Co-authored-by: zeshengzong <zesheng.zong@outlook.com>	2025-07-31 18:42:07 +00:00
Sherlock Huang	c1722db0f7	[NativeRT] Make VariadicOpConverter and FuseListUnpackConverter for cpu nodes only (#159519 ) Summary: VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels. Currently, the graph passes are ran if static_dispatch is enabled. As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU. Test Plan: CI Rollback Plan: Differential Revision: D79295640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519 Approved by: https://github.com/dolpm, https://github.com/henryoier	2025-07-31 18:17:21 +00:00
PyTorch MergeBot	8a233d6000	Revert "[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 )" This reverts commit 07fad04181321d18963b71e9566d44f86a25c9f7. Reverted https://github.com/pytorch/pytorch/pull/158692 on behalf of https://github.com/yangw-dev due to failed some internal testapf.metrics.tests.generate_graph_def_test.GenerateGraphDefTest: test_aps_generate_inference_graph_def_with_justknobs1) AssertionError: Expected 'check' to be called once. Called 3 times., please fix the internal test and reland it ([comment](https://github.com/pytorch/pytorch/pull/158692#issuecomment-3140873894))	2025-07-31 18:00:30 +00:00
Aleksandar Samardžić	bf3ebd7ad4	Fix grouped MM load along K when TMA loads are not used (#159485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485 Approved by: https://github.com/ngimel	2025-07-31 17:58:02 +00:00
PyTorch MergeBot	c07bb277a0	Revert "fix strategy hashing arg mismatch (#159506 )" This reverts commit 3a556762002ec0027b2120a7e6675182c0e50dbd. Reverted https://github.com/pytorch/pytorch/pull/159506 on behalf of https://github.com/yangw-dev due to failed the internal tests test_get_bwd_hook (torch.equal(output * 2, input_tensor.grad)) ([comment](https://github.com/pytorch/pytorch/pull/159506#issuecomment-3140858905))	2025-07-31 17:54:29 +00:00
Markus Hoehnerbach	f89c28cc6b	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-07-31 17:00:32 +00:00
Pian Pawakapan	8fedcfa59a	[export] _ccode for PythonMod (#158851 ) Summary: Adds ccode impl to PythonMod Test Plan: test_export Rollback Plan: Differential Revision: D76463347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158851 Approved by: https://github.com/kalpit-meta-1	2025-07-31 16:46:51 +00:00
henrylhtsang	6662a76f59	[cutlass backend] Fix EVT tests post buf name change (#159541 ) Differential Revision: [D79317791](https://our.internmc.facebook.com/intern/diff/D79317791/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159541 Approved by: https://github.com/mlazos	2025-07-31 16:39:49 +00:00
eqy	05aade1b6d	[CUDA] Add `serialTest` decorator to `largeTensorTest` in `test_cuda.py` (#159271 ) Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-07-31 16:27:16 +00:00
Nikita Shulga	f946b25865	[MPS] Speedup `argmax`/`argmin` (#159524 ) By using efficient `threadgroup_arg[max\|min]` primitives. - Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test - Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton: ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float16) \| 285.8 \| 272.2 \| 422.3 \| 354.5 \| 721.6 \| 683.5 \| 2224.0 \| 1979.1 max (torch.float32) \| 300.2 \| 267.0 \| 389.6 \| 342.5 \| 769.4 \| 682.6 \| 2995.7 \| 2609.8 max (torch.int32) \| 299.6 \| 275.4 \| 390.0 \| 361.7 \| 758.7 \| 686.1 \| 3103.4 \| 2646.5 max (torch.int64) \| 297.5 \| 275.5 \| 417.0 \| 382.1 \| 856.1 \| 722.6 \| 5467.7 \| 3156.8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #158990	2025-07-31 16:18:32 +00:00
Bin Bao	d2e02585b8	[AOTI] Explicitly delete wait_tensor returned tensor (#159502 ) Summary: In the Python wrapper codegen, the returned tensor from wait_tensor is not assigned or used anywhere, because wait_tensor always returns its input, see more discussion in https://github.com/pytorch/pytorch/issues/126773. Similarly, we should just immediately delete the returned tensor handle from aoti_torch_cpu__c10d_functional_wait_tensor in the cpp wrapper codegen, otherwise it may cause tensor's lifetime expansion and even cause OOM in some cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159502 Approved by: https://github.com/yushangdi, https://github.com/jingsh ghstack dependencies: #159476, #159487	2025-07-31 15:33:36 +00:00
Bin Bao	3dd7ebf418	[BE] Fix buf name mismatch in test_c10d_functional_native.py (#159487 ) Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487 Approved by: https://github.com/yushangdi ghstack dependencies: #159476	2025-07-31 15:33:36 +00:00
Bin Bao	8273ee0646	[BE] Fix global config leak in test_c10d_functional_native.py (#159476 ) Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476 Approved by: https://github.com/yushangdi	2025-07-31 15:33:36 +00:00
Jane Xu	c57382a493	Move BFloat16.h to headeronly (#159412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412 Approved by: https://github.com/desertfire	2025-07-31 15:29:17 +00:00
Ruben Rodriguez Buchillon	e7cc42df58	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-07-31 13:05:04 +00:00
cyy	72c69e731f	set MSVC debug information only on debug builds (#159533 ) Fixes: https://github.com/pytorch/pytorch/issues/159515 To reduce the binary size increment in release builds by removing debug information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533 Approved by: https://github.com/atalman	2025-07-31 12:57:33 +00:00
Tom Ritchford	78b9dea754	[inductor] Fix set_linter's handling of f-strings for Python 3.12 and up (fix #159056 ) (#159252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159252 Approved by: https://github.com/Skylion007	2025-07-31 12:56:09 +00:00
LifengWang	838924436e	update the baseline for nightly max_autotune tests (#154973 ) Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data: In the latest nightly test, two models require baseline updates: - vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly. - detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model. ``` vision_maskrcnn IMPROVED: graph_breaks=29, expected=30 Improvement: 1 models have fixed dynamo graph breaks: vision_maskrcnn ``` ``` detectron2_fcos_r_50_fpn XFAIL detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22 Error: 1 models have new dynamo graph breaks: detectron2_fcos_r_50_fpn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973 Approved by: https://github.com/desertfire	2025-07-31 11:38:55 +00:00
xinan.lin	2ffb510942	[Break XPU][Indutor UT] Fix failures introduced by community. (#159463 ) Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463 Approved by: https://github.com/jansel	2025-07-31 08:37:41 +00:00
Michael Lazos	20b5f694f8	[Dynamo] Make frozen dataclasses hashable (#159529 ) Fixes https://github.com/pytorch/pytorch/issues/159424 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159529 Approved by: https://github.com/oulgen ghstack dependencies: #159513	2025-07-31 07:03:01 +00:00
Michael Lazos	447e300d55	[Dynamo] Frozen dataclass attr access test (#159513 ) Verifies https://github.com/pytorch/pytorch/issues/159424, but perhaps the issue is not fixed yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159513 Approved by: https://github.com/oulgen	2025-07-31 07:03:01 +00:00
Pian Pawakapan	5b2ad9279c	[draft export] logging (#159004 ) Summary: adds logging for draft export Test Plan: loggercli stage actualize-stage TorchDraftExportUsageLoggerConfig Rollback Plan: Differential Revision: D78308105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159004 Approved by: https://github.com/angelayi	2025-07-31 05:52:13 +00:00
Georgia Phillips	78d7f0cdec	disable execution frame cleanup (#159531 ) Summary: Want to disable execution frame cleanup until fix in D78621408 is merged Test Plan: CI Rollback Plan: Differential Revision: D79306602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159531 Approved by: https://github.com/SherlockNoMad	2025-07-31 05:02:36 +00:00
Xu Han	d5c719ec3c	[inductor] fix open temp file failed on Windows. (#159342 ) Fix open temp file failed on Windows. Error message: <img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" /> Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile 1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows 2. Use `WritableTempFile` to handle this case on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342 Approved by: https://github.com/jansel	2025-07-31 04:58:02 +00:00
yewentao256	c44efc3755	[Refactor] Fix Compile Warning: `possibly dangling reference to a temporary` (#159517 ) ```bash DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:25: warning: possibly dangling reference to a temporary [-Wdangling-reference] DEBUG 1388 \| for (const at::IValue& elt : lst) { DEBUG \| ^~~ DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:1: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::operator().c10::impl::ListElementReference<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const c10::IValue&, c10::IValue>()’ DEBUG 1388 \| for (const at::IValue& elt : lst) { DEBUG \| ^ ``` This PR fixes this warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/159517 Approved by: https://github.com/xmfan	2025-07-31 04:49:43 +00:00
Boyuan Feng	6b9473469f	[Graph Partition] add log for graph partition reasons and #partitions (#159425 ) Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped. In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425 Approved by: https://github.com/eellison	2025-07-31 04:21:06 +00:00
Natalia Gimelshein	7a4167a164	support fabric handles with symmetric memory (#159319 ) enable fabric handles for symmetric memory Enables handle exchange via CU_MEM_HANDLE_TYPE_FABRIC on the systems that support it. This is needed to enable symmetric memory on NVLS72 systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159319 Approved by: https://github.com/malfet, https://github.com/kwen2501	2025-07-31 04:16:20 +00:00
PyTorch UpdateBot	8e67a6ae89	[vllm hash update] update the pinned vllm hash (#159320 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159320 Approved by: https://github.com/pytorchbot	2025-07-31 04:08:14 +00:00
Animesh Jain	c68ad1bd6a	[dynamo][guards] Always record user.stack for informative tlparse guards (#159526 ) Before <img width="1146" height="280" alt="image" src="https://github.com/user-attachments/assets/4ddb11b2-dec8-4010-a28d-63b3cd4a7929" /> After <img width="1248" height="248" alt="image" src="https://github.com/user-attachments/assets/8aafc5be-92cd-4468-bb8f-ad966de8c717" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159526 Approved by: https://github.com/Lucaskabela	2025-07-31 03:18:33 +00:00
PyTorch MergeBot	3e5e094615	Revert "Fix large_tensor_test skipping cpu (#158617 )" This reverts commit debc0591b888f211bfe846bdc7cfa0626a5f6f6a. Reverted https://github.com/pytorch/pytorch/pull/158617 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16631113381/job/47062415099) [HUD commit link](`debc0591b8`) ([comment](https://github.com/pytorch/pytorch/pull/158617#issuecomment-3138387762))	2025-07-31 02:57:22 +00:00
clr	c65efc8ea1	torch.compile: Record a pt2_compile_event for combo kernels (#159306 ) This is off by default, but some jobs have it on. Having this show up in perfetto and be globally queryable would be useful to see how expensive this is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159306 Approved by: https://github.com/masnesral	2025-07-31 02:51:38 +00:00
Animesh Jain	a9049413e2	[dynamo] Turn on recursive dict tag optimization (#159186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159186 Approved by: https://github.com/jansel	2025-07-31 02:36:37 +00:00
ILCSFNO	d7a5ec9355	Fix the Doc of `padding` in `avg_poolnd` (#159142 ) Fixes #159141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159142 Approved by: https://github.com/mikaylagawarecki	2025-07-31 02:02:48 +00:00
Sam Larsen	2c46922ce4	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-07-31 01:36:50 +00:00
wengshiy	668d414ae7	[CPU] Fix bias dtype issue for FP8 qlinear (#159125 ) Fixes `RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float` With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias. In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref. Add this case into ut and reproduce: `python test/test_quantization.py -k test_qlinear_fp8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125 Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE	2025-07-31 01:26:45 +00:00
Nick Riasanovsky	4541509237	[Triton] [Inductor] Fix an incorrect descriptor (#159407 ) Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`. Test Plan: Found by inspection. Rollback Plan: Reviewed By: NoamPaz Differential Revision: D79178538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407 Approved by: https://github.com/NikhilAPatel	2025-07-31 00:34:19 +00:00
eellison	6c7f88c2c9	Check addmm dtypes (#159509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159509 Approved by: https://github.com/eqy	2025-07-31 00:15:46 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
Eddie Yan	25c3a7e317	[CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305 ) Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd	2025-07-30 23:31:38 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	de7376537f	Fix ep deepcopy when there is python builitin name (#159478 ) Summary: title Test Plan: CI Rollback Plan: Differential Revision: D79261007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159478 Approved by: https://github.com/pianpwk	2025-07-30 23:14:31 +00:00
Shangdi Yu	fd2c64e286	Fix duplicated sources in inductor provenance tracking (#159484 ) Summary: The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources. This also means that if there are two nested pass like: ``` with GraphTransformObserver(gm, "outer"): with GraphTransformObserver(gm, "inner"): ..... ``` We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`. Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace ``` Rollback Plan: Differential Revision: D79203058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484 Approved by: https://github.com/angelayi	2025-07-30 23:03:11 +00:00
Lucas Kabela	2b1ae29960	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) (#159491 ) Summary: X-link: https://github.com/pytorch/executorch/pull/12986 As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Test Plan: Imported from GitHub, without a `Test Plan:` line. Rollback Plan: Reviewed By: JacobSzwejbka, yangw-dev Differential Revision: D79199389 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491 Approved by: https://github.com/anijain2305, https://github.com/yangw-dev	2025-07-30 22:57:50 +00:00
Nikita Shulga	1293405c8d	[MPS] Add `simd_[arg][max\|min]` (#158990 ) And add eager tests for those. Re-implement `threadgroup_[max\|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script ```python import itertools import timeit import torch from torch.utils.benchmark import Compare, Measurement, Timer def bench_unary_op(func, x, label) -> Measurement: sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else "" t = Timer( stmt=f"f(x);{sync_cmd}", globals={"f": func, "x": x}, language="python", timer=timeit.default_timer, sub_label=f"{func.__name__} ({str(x.dtype)})", description=label, env=torch.__version__, ) return t.blocked_autorange() def bench_reduction( reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32 ) -> list[Measurement]: rc = [] # Bench 2D with reduction over dim=0 def f(t): return reduction_func(t, dim=0)[0] f.__name__ = reduction_func.__name__ f_c = torch.compile(f, dynamic=False, fullgraph=True) for size in (512, 1024, 2048, 4096): x = torch.testing.make_tensor(size, size, device=device, dtype=dtype) rc_c, rc_e = f(x), f_c(x) rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e) rc.append(bench_unary_op(f, x, f"eager-{size}x{size}")) rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}")) return rc def main() -> None: #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64] dtypes = [torch.float32, torch.int32, torch.int64] # Profile reduction ops rc = [] for op, dtype in itertools.product([torch.max], dtypes): rc.extend(bench_reduction(op, dtype=dtype)) Compare(rc).print() if __name__ == "__main__": torch._dynamo.config.cache_size_limit = 2**16 main() ``` Produces the following table before ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 297.3 \| 531.6 \| 394.1 \| 2550.5 \| 773.0 \| 4904.7 \| 3647.2 \| 9682.0 max (torch.int32) \| 297.8 \| 359.2 \| 387.7 \| 1179.4 \| 768.2 \| 2175.0 \| 3677.1 \| 4495.9 max (torch.int64) \| 278.7 \| 541.4 \| 410.2 \| 2873.3 \| 858.9 \| 5620.4 \| 6107.2 \| 11176.1 Times are in microseconds (us). ``` And after ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 307.9 \| 265.3 \| 401.0 \| 340.8 \| 766.5 \| 661.9 \| 3463.5 \| 2829.5 max (torch.int32) \| 293.5 \| 263.1 \| 405.0 \| 338.8 \| 761.4 \| 672.5 \| 3050.0 \| 2688.6 max (torch.int64) \| 308.2 \| 255.7 \| 417.4 \| 341.4 \| 877.0 \| 695.0 \| 5812.2 \| 5762.2 ``` `argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there. Also fixes `torch.max/min` compilation for half-precision types, added regression types for it. This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-07-30 21:57:25 +00:00
William Wen	3a65ff84b6	[dynamo, easy] add comment on skipping sys.monitoring frames (#159493 ) Add a comment so we know why we're doing this code (followup to https://github.com/pytorch/pytorch/pull/159369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159493 Approved by: https://github.com/azahed98, https://github.com/Lucaskabela, https://github.com/zou3519, https://github.com/jingsh ghstack dependencies: #159369	2025-07-30 21:54:38 +00:00
tiandeyu-cs	acf13a9b75	Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend (#158903 ) Fixes #158902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158903 Approved by: https://github.com/H-Huang	2025-07-30 21:44:29 +00:00
zpcore	3a55676200	fix strategy hashing arg mismatch (#159506 ) Reland https://github.com/pytorch/pytorch/pull/159289. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506 Approved by: https://github.com/XilunWu	2025-07-30 21:37:13 +00:00
Sam Larsen	af39144a93	Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480 ) Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead. Fixes https://github.com/pytorch/pytorch/issues/159440 Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480 Approved by: https://github.com/xmfan, https://github.com/oulgen	2025-07-30 21:29:38 +00:00
Aidyn-A	25343b343e	[ATen][CUDA][cuFFT] Guard against deprecated error codes (#159466 ) This PR adds a guard based on CUDA version, per latest cuFFT [documentation](https://docs.nvidia.com/cuda/cufft/index.html#return-value-cufftresult): >The following error codes are deprecated and will be removed in a future release: `CUFFT_INCOMPLETE_PARAMETER_LIST`, `CUFFT_PARSE_ERROR`, `CUFFT_LICENSE_ERROR`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159466 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/Skylion007	2025-07-30 21:10:32 +00:00
Xilun Wu	07fad04181	[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 ) Summary This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch `FlexAttentionHOP.__call__` to it. This PR makes the following changes: - add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask` which masks over the attention result of Q shard and KV global. - add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo recompilations. - add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch won't work correctly without this line) What's not in this PR: - QKV load balancing - Test on other masking besides `causal_mask`. - Support on small attention (i.e. qkv size is smaller than 128) because the block mask rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`. Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention` Followup 1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask` to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`. 2. Merge `_ContextParallelGlobalVars` and `_cp_options`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692 Approved by: https://github.com/drisspg	2025-07-30 21:01:53 +00:00
PyTorch MergeBot	7ac70ac4cd	Revert "Fix rand_like decomposition to preserve strides (#159294 )" This reverts commit a3a51282dbabe0220c2c3947a89f7d2ecc514d33. Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to failed internal build Failed to load config ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3137796767))	2025-07-30 20:59:19 +00:00
drisspg	e221a1c853	[Code Motion]Restructure flex attention kernel into flex subdirectory (#159437 ) Mostly code motion, updating relative paths, moving some imports that had to be lazy before to top level scope now that we are free from the curse. This will make it easier to add newer templates and provide some organization Pull Request resolved: https://github.com/pytorch/pytorch/pull/159437 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/Skylion007	2025-07-30 20:12:35 +00:00
Junjie Wang (PyTorch)	4defea1e2c	[c10d] Fix setGroupName and setGroupDesc in `group_split` and `merge_remote_group` (#159429 ) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Rollback Plan: Differential Revision: D79201132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429 Approved by: https://github.com/xunnanxu	2025-07-30 19:55:55 +00:00
saienduri	53d68b95de	[ROCm CI] Migrate to MI325 Capacity. (#159059 ) This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059 Approved by: https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: deedongala <deekshitha.dongala@amd.com>	2025-07-30 19:47:59 +00:00
Howard Huang	f74842d57f	[PP] Fix zero bubble schedules for eval() (#159475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475 Approved by: https://github.com/tianyu-l, https://github.com/Skylion007	2025-07-30 19:46:10 +00:00
Simon Fan	644fee2610	Fix TestAutogradFallback flaky tests under Dynamo: migrate to lib._destroy() (#159443 ) under dynamo, the libraries couldn't properly be cleared unless we manually did `gc.collect()`, but that's slow. it also worked if we just used the _destroy() method to tear down FIXES #159398 #159349 #159254 #159237 #159153 #159114 #159040 #158910 #158841 #158763 #158735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159443 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2025-07-30 19:30:55 +00:00
Jane Xu	7821fbc560	[BE] Clarify comment to not revert when command has been edited (#159495 ) This is mostly a nit. I was a bit confused when I saw <img width="1032" height="183" alt="image" src="https://github.com/user-attachments/assets/7a18f167-78c1-4c33-ba6f-3588914c642e" /> in https://github.com/pytorch/pytorch/pull/159172 So I decided I should clean up this message a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159495 Approved by: https://github.com/yangw-dev, https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2025-07-30 19:23:33 +00:00
Justin Chu	73ee323380	[ONNX] RMS Norm (#159377 ) - Implement rms norm using onnx RMSNormalization-23 - Use the correct eps for float32 `eaadd1282c/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L1844-L1866)` <img width="743" height="107" alt="image" src="https://github.com/user-attachments/assets/a6fd45aa-01d9-4667-924d-3012232cfcde" /> - Created facility to run tests with the reference runtime by extending ONNXProgram and assert_onnx_program. Fix https://github.com/pytorch/pytorch/issues/159257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159377 Approved by: https://github.com/titaiwangms	2025-07-30 18:55:47 +00:00
Justin Chu	176c6446f8	Update CODEOWNERS for ONNX (#159390 ) Update CODEOWNERS for ONNX to reflect current maintainers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159390 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2025-07-30 18:54:25 +00:00
drisspg	debc0591b8	Fix large_tensor_test skipping cpu (#158617 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158617 Approved by: https://github.com/BoyuanFeng	2025-07-30 18:48:07 +00:00
Nikita Shulga	0df78f0c11	Remove `/d2implyavx512upperregs-` flag (#159431 ) And reopen https://github.com/pytorch/pytorch/issues/145702 As this flag is not documented anywhere, slows down sccache accelerated build and per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath Fixes https://github.com/pytorch/pytorch/issues/159082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431 Approved by: https://github.com/clee2000	2025-07-30 18:47:03 +00:00
Guilherme Leobas	d0e8a0ec4c	Add CPython test for heapq (#159370 ) Not used directly but used internally by `collections.Counter` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159370 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2025-07-30 18:43:06 +00:00
Aaron Gokaslan	22492848b6	[BE]: Update CUTLASS submodule to 4.1.0 (#158854 ) Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854 Approved by: https://github.com/henrylhtsang	2025-07-30 17:44:38 +00:00
Rohit Manav	5c14315b05	fixed typo error (#159451 ) Fixes #159375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159451 Approved by: https://github.com/albanD	2025-07-30 17:41:30 +00:00
PaliC	1b99c1859c	[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 ) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /ignore_hermetic_tls=/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /ignore_hermetic_tls=/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); +} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427 Approved by: https://github.com/albanD	2025-07-30 17:29:43 +00:00
Boyuan Feng	435edbcb5d	[Graph Partition] add graph partition doc (#159450 ) This pr adds doc for graph partition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450 Approved by: https://github.com/eellison	2025-07-30 17:01:10 +00:00
PyTorch MergeBot	6c6e11c206	Revert "Fix max_width computation in _tensor_str._Formatter (#126859 )" This reverts commit 1465757959dd7e63715b7621650896eca977aefa. Reverted https://github.com/pytorch/pytorch/pull/126859 on behalf of https://github.com/yangw-dev due to broke trunk with test distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single - RuntimeError: Expected to find buf7 = empty but did not find it ([comment](https://github.com/pytorch/pytorch/pull/126859#issuecomment-3137137030))	2025-07-30 16:56:32 +00:00
Denghui Dong	a775c8e73e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-30 16:35:51 +00:00
Arsh Zahed	24d07b3a67	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-30 16:34:15 +00:00
James Wu	90fd06be71	Various bugfixes for running NanoGPT training (#159166 ) Fix various small bugs with running nanogpt on torchbenchmark in OSS under python 3.10. After these changes, the following now succeeds: ``` tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --training --backend inductor --caching-precompile --warm-start-latency ``` Cold start: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp12LuZ5/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm start (we are invesigating the recompile): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpT5YTB2/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159166 Approved by: https://github.com/zhxchen17	2025-07-30 16:30:22 +00:00
Meet Vadakkanchery	002f18807e	[DCP] Improve error handling for process based async checkpointing (#159374 ) Summary: ### PR Context - Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process. - When a checkpoint fails to save, log and return the error but continue the serving loop. Test Plan: CI Rollback Plan: Differential Revision: D79177410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374 Approved by: https://github.com/sibuachu	2025-07-30 16:25:28 +00:00
Jane Xu	259e79e3ff	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 16:11:58 +00:00
Aidyn-A	ee343ce60c	[RPC][TensorPipe] Fix import torch if compiled without TensorPipe (#159461 ) This is a follow up on the PR #154382, as the issue still persists: ``` File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module> from . import api, backend_registry, functions File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module> from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module> from torch._C._distributed_rpc import ( ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461 Approved by: https://github.com/lw	2025-07-30 16:04:02 +00:00
Avik Chaudhuri	ea5369113a	unflatten closure (#159418 ) Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call. Test Plan: added test, updated another Rollback Plan: Differential Revision: D79204669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418 Approved by: https://github.com/angelayi	2025-07-30 15:42:18 +00:00
Jane Xu	b268f22ab2	Move Float4 to headeronly (#159414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414 Approved by: https://github.com/desertfire	2025-07-30 15:34:01 +00:00
Animesh Jain	52a52d1b78	[dynamo][guards] Skip no tensor aliasing guard on inbuilt nn module buffers (#159453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159453 Approved by: https://github.com/jansel	2025-07-30 15:31:07 +00:00
PyTorch MergeBot	eaadd1282c	Revert "Move Half to headeronly (#159172 )" This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa. Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](`6d0f4566e2`). Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))	2025-07-30 15:10:26 +00:00
Raphael Reme	1465757959	Fix max_width computation in _tensor_str._Formatter (#126859 ) Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy. Now, the code first checks if it should be in sci_mode, then compute `max_width` Here is an example to test the behavior: ```python A = torch.tensor([10, 1e-1, 1e-2]) B = torch.tensor([10, 1e-1, 1e-1]) print("================= Default =================") print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=False =================") with torch._tensor_str.printoptions(sci_mode=False): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=True =================") with torch._tensor_str.printoptions(sci_mode=True): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") ``` In the current version this prints: ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7 ``` On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`) Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading) After this commit, this will print ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10 ``` This also allows to align A with B for `sci_mode=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859 Approved by: https://github.com/malfet	2025-07-30 14:01:00 +00:00
Ke Wen	17b9c618dd	[a2av] not returning out tensor from ops (#159435 ) torch.compile of `all_to_all_vdev_2d` hits the following error: ``` torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised: RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub. ``` This PR selects option (1). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435 Approved by: https://github.com/ngimel, https://github.com/xmfan	2025-07-30 08:30:25 +00:00
Yu, Guangye	d3ce45012e	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165	2025-07-30 06:37:15 +00:00
Yu, Guangye	1fc010a9d8	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-30 06:37:15 +00:00
Yu, Guangye	dfacf11f66	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-30 06:37:06 +00:00
Yu, Guangye	c8cf811995	Enable AcceleratorAllocatorConfig key check (#157908 ) # Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908 Approved by: https://github.com/albanD ghstack dependencies: #149601	2025-07-30 06:36:56 +00:00
Yu, Guangye	914b1a3873	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-30 06:36:46 +00:00
Animesh Jain	7eb5fdb358	[dynamo][guards] Recursive dict tag optimization (#159183 ) Design doc here - https://docs.google.com/document/d/1W29DrWID5miGWlZXspsQVN5U0zydE3kjZpziOXrhuaY/edit?tab=t.0#bookmark=id.sba04iw9sp68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159183 Approved by: https://github.com/jansel	2025-07-30 06:01:32 +00:00
Sheng Fu	f1fb57d854	Add user annotation for FX graph cache key (#159318 ) Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Rollback Plan: Differential Revision: D79019069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318 Approved by: https://github.com/sraikund16, https://github.com/jansel	2025-07-30 05:52:50 +00:00
Jane Xu	6d0f4566e2	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 05:02:13 +00:00
PyTorch UpdateBot	e785c087c5	[audio hash update] update the pinned audio hash (#159321 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159321 Approved by: https://github.com/pytorchbot	2025-07-30 04:35:01 +00:00
Svetlana Karslioglu	d214901133	Add a title to distributed._dist2.md (#159385 ) Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build: ``` WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385 Approved by: https://github.com/d4l3k	2025-07-30 04:09:41 +00:00
Jane Xu	96ac64d00c	Migrate easy q(u)int/bits stuff to torch/headeronly (#159302 ) Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly. It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-30 03:41:27 +00:00
Colin Peppler	46d34d6766	(should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm (#159184 ) Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm. `806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)` ## DDE ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul elif should_fold(tensor1, tensor2, is_out): File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold if guard_size_oblivious(t1.numel() == 0): torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12((u0//2)), 0) (unhinted: Eq(12((u0//2)), 0)). (Size-like symbols: none) Caused by: (_decomp/decompositions.py:4472 in should_fold) ``` ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul elif should_fold(tensor1, tensor2, is_out): File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold return all( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3((u0//2)), 3) (unhinted: Eq(3((u0//2)), 3)). (Size-like symbols: none) Caused by: (_decomp/decompositions.py:4483 in should_fold) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184 Approved by: https://github.com/ezyang ghstack dependencies: #158894	2025-07-30 03:12:14 +00:00
clr	880249adbc	dynamo: handle AttributeErrors from nn_module when infer_paramaters throws. (#158501 ) This only handles AttributeError, but in general, any exception coming from here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions. ``` File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION self.call_function(fn, args, {}) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward return getattr(self.realize(), name)(args, kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function initialize_lazy_module(tx, mod, args, kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module mod._infer_parameters(mod, fake_args, fake_kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters module.initialize_parameters(args, *kwargs) ..., File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__ raise AttributeError( torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...' ``` Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user. ``` ====================================================================== ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, *kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params exp_res = opt_m(x, y) File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__ return super().__call__(args, *kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(args, *kwargs) File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check self._orig_mod._infer_parameters(self._orig_mod, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters module.initialize_parameters(args, **kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters self.foo += 1 ^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__ raise AttributeError( f"'{type(self).__name__}' object has no attribute '{name}'" ) AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-07-30 02:41:41 +00:00
Xu Han	846ada4973	[AOTI] disable crashed AOTI UTs on Windows. (#159427 ) disable crashed AOTI UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159427 Approved by: https://github.com/angelayi	2025-07-30 02:23:27 +00:00
Yu, Guangye	badd0618e4	Remove unused paramter on CUDA AllocParams (#159159 ) # Motivation While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-30 02:05:25 +00:00
PaliC	a753a72b14	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158290, #158291	2025-07-30 01:36:03 +00:00
PaliC	b57d1ef110	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158290	2025-07-30 01:36:03 +00:00
PaliC	dd7c996d5c	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD	2025-07-30 01:36:03 +00:00
Kurt Mohler	70d2e9ba45	[MPS] Avoid outputing zeros from `exponential_` for MPS (#159386 ) Fixes #159103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386 Approved by: https://github.com/malfet	2025-07-30 00:20:31 +00:00
eqy	62f98dbb44	[CUDA][Convolution] Add `tf32_on_and_off` decorator to `test_deconv_freezing_cuda` (#159280 ) Blackwell seems to select TF32 kernels for this case Pull Request resolved: https://github.com/pytorch/pytorch/pull/159280 Approved by: https://github.com/zou3519, https://github.com/jingsh, https://github.com/Skylion007	2025-07-29 23:44:10 +00:00
PyTorch MergeBot	e288c258f7	Revert "Remove tensorexpr tests (#158928 )" This reverts commit d742a2896c571a535003d5928fe80397325575a5. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616))	2025-07-29 23:32:07 +00:00
William Wen	df58db8831	[dynamo, docs] add recompilation, observability, reporting issues docs (#159062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062 Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305	2025-07-29 23:23:51 +00:00
Nikita Shulga	15bb81ea4f	[2/N][CI] Remove MacOS-13 workarounds from tests (#159304 ) Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #159277, #159278	2025-07-29 23:12:13 +00:00
saienduri	8d37073bac	[ROCm] Update jit_utils.cpp trait modification based on HIP version. (#159292 ) The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5. Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292 Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now. Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292 Approved by: https://github.com/malfet	2025-07-29 22:45:27 +00:00
AaronWang04	dc286aef61	Fused RMSNorm Housekeeping (#159317 ) Small PR to address comments that were made from the original fused rmsnorm PR that were not landed Changes: - Warning message when input.dtype doesn't match weight.dtype - Ensure default epsilon value is correct Comments: https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005 https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy	2025-07-29 22:39:18 +00:00
Oguz Ulgen	b4619f0272	Pin Helion to 0.0.10 in PyTorch CI (#159420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159420 Approved by: https://github.com/aorenste, https://github.com/malfet	2025-07-29 22:06:50 +00:00
William Wen	477c2273e1	[dynamo] better way to skip tracing sys.monitoring callables (#159369 ) Better approach to https://github.com/pytorch/pytorch/pull/158171, according to https://github.com/python/cpython/issues/137178#issuecomment-3131617493. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159369 Approved by: https://github.com/Skylion007	2025-07-29 21:54:58 +00:00
Will Constable	2176d481c1	[DTensor] dispatch to sharding prop over decomps (#159324 ) Fixes #159110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324 Approved by: https://github.com/ezyang	2025-07-29 21:28:36 +00:00
Guilherme Leobas	b97274e8ac	[iter] Raise TypeError if iter arg cannot be iterable (#158410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158410 Approved by: https://github.com/XuehaiPan, https://github.com/zou3519 ghstack dependencies: #156371, #156416, #156460	2025-07-29 21:24:21 +00:00
Guilherme Leobas	f9be65cea4	[iter] Wrap iter(..) call in a ObjectIteratorVariable (#156460 ) This object keeps track when the iterator is exhausted (raise Stopiteration). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156460 Approved by: https://github.com/zou3519 ghstack dependencies: #156371, #156416	2025-07-29 21:24:20 +00:00
Guilherme Leobas	4e3e3dc0a7	[iter] support `iter(callable, sentinel)` (#156416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156416 Approved by: https://github.com/XuehaiPan, https://github.com/zou3519 ghstack dependencies: #156371	2025-07-29 21:24:20 +00:00
Guilherme Leobas	fcf59df2b6	[iter] Add support for sequence protocol in `iter(..)` (#156371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156371 Approved by: https://github.com/zou3519	2025-07-29 21:24:20 +00:00
Nick Riasanovsky	1bcb2f41e0	[BE] Eliminate workspace info in templates with new API (#159055 ) Summary: Moves the workspace info calculations to the old TMA API. Test Plan: NFC Rollback Plan: Differential Revision: D78904434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159055 Approved by: https://github.com/NikhilAPatel	2025-07-29 21:22:36 +00:00
Zhengxu Chen	8460131087	[nativert] Add OSS version of ModelRunner (#159268 ) Summary: Implement a ModelRunner from scratch with the minimum features for OSS only Test Plan: test_export -r NativeRT Rollback Plan: Differential Revision: D78979812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159268 Approved by: https://github.com/dolpm	2025-07-29 21:08:14 +00:00
PyTorch MergeBot	c0c24b61ff	Revert "Partitioner: Fix to align partition node order with original graph (#157892 )" This reverts commit 2d1e92307d3e67622f4fe8058d62e44fe4fa2f4e. Reverted https://github.com/pytorch/pytorch/pull/157892 on behalf of https://github.com/yangw-dev due to fails internal tests : [executorch/backends/xnnpack/partition/xnnpack_partitioner.py:101:24] Incompatible parameter type [6]: In call `Partition.__init__`, for argument `nodes`, expected `Optional[Iterable[Tuple[Node, Optional[int]]]]` but got `dict_keys[Node, str]`. ([comment](https://github.com/pytorch/pytorch/pull/157892#issuecomment-3134004881))	2025-07-29 20:41:45 +00:00
Sahan Paliskara	4fac43b21f	[BE] Move _freeze.py to torch/fb/utils (#159307 ) Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary Test Plan: It's a targets change so CI should suffice Rollback Plan: Differential Revision: D78910653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307 Approved by: https://github.com/albanD	2025-07-29 20:07:17 +00:00
Aaron Orenstein	b794e77b7b	Disable cudagraph GCs by default (#158649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158649 Approved by: https://github.com/eellison ghstack dependencies: #158193	2025-07-29 19:56:11 +00:00
PyTorch MergeBot	d987a6f7f0	Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 )" This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d. Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))	2025-07-29 19:49:40 +00:00
PyTorch MergeBot	5d93127c87	Revert "[HOP, map] Rework of map autograd to the new interface (#153343 )" This reverts commit 24b1f10ca13d682430725c511812e43a35fcd6a6. Reverted https://github.com/pytorch/pytorch/pull/153343 on behalf of https://github.com/yangw-dev due to a older pr this pr dependes on needed to revert, rebase it after it's in ([comment](https://github.com/pytorch/pytorch/pull/153343#issuecomment-3133816812))	2025-07-29 19:46:42 +00:00
Sam Larsen	a3a51282db	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-07-29 19:26:20 +00:00
PyTorch MergeBot	e557b3d5e5	Revert "[inductor] Fix mm decomposition evaluating symints (#158998 )" This reverts commit 52e180c3799a7638ee668b1291a711865ab8cfec. Reverted https://github.com/pytorch/pytorch/pull/158998 on behalf of https://github.com/yangw-dev due to it broke trunk with pr_time_benchmark test ([comment](https://github.com/pytorch/pytorch/pull/158998#issuecomment-3133696775))	2025-07-29 19:04:11 +00:00
eellison	f3a9e99036	Fix inductor cuda sort nan behavior (#159308 ) Fix for https://github.com/pytorch/pytorch/issues/152423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159308 Approved by: https://github.com/isuruf	2025-07-29 19:02:45 +00:00
Animesh Jain	f7d6e9f500	[dynamo][guards] More small guard optimizations (#159345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159345 Approved by: https://github.com/williamwen42 ghstack dependencies: #159288	2025-07-29 18:36:49 +00:00
Animesh Jain	e43e09e6c1	[dynamo][guards] Use lambda guards for object aliasing to improve object aliasing guards (#159288 ) # Note - On Lambda guarding of object aliasing # We previously installed object‑aliasing guards as relational guards, # but that undermined the recursive‑dict guard optimization: placing the # aliasing guard at a leaf prevented the parent dict node from # qualifying as a recursive‑dict guard root. Because aliasing guards are # rare, we now emit them as epilogue guards via a small Python lambda. # This repeats the access in Python—adding a bit of work—but the # overhead is outweighed by the gains from enabling recursive‑dict guard # optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288 Approved by: https://github.com/StrongerXi	2025-07-29 18:36:49 +00:00
milliechen	2004f8aa10	FXConverter handling of generic output in inductor fallback kernel (#159002 ) (#159297 ) Summary: A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case. Test Plan: Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`. Before this diff the modified test would fail with ``` File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate line.codegen_fx(self)(line) File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output inds = line.indices[0][1:] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: IndexError: list index out of range ``` (Full error paste in P1878839403) With this diff the error is no longer present. Rollback Plan: Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297 Approved by: https://github.com/blaine-rister	2025-07-29 18:29:01 +00:00
Edward Z. Yang	31b3b38e3a	Ensure export joint with descriptors + compile works (#159337 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159337 Approved by: https://github.com/wconstab ghstack dependencies: #159336	2025-07-29 17:43:52 +00:00
Edward Z. Yang	2f0db0444e	Track previous MetricsContext edits for ease of debugging. (#159336 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159336 Approved by: https://github.com/wconstab	2025-07-29 17:43:52 +00:00
PaliC	6162e650b0	[BE] remove torch deploy - conditionals (#158288 ) This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started. 1. Remove test_deploy_interaction as we no longer need to worry about this 2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1) 3. Remove `USE_DEPLOY` and switch to the default path always Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288 Approved by: https://github.com/albanD	2025-07-29 17:40:49 +00:00
Lucas Kabela	5d89634ca8	Graph break with error message (#158800 ) Fixes #157452 Test with ``` python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks ``` ### Release Notes Change to nn.Parameter Constructor Behavior in Dynamo Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800 Approved by: https://github.com/anijain2305	2025-07-29 17:34:49 +00:00
Arsh Zahed	52e180c379	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-29 17:29:38 +00:00
anwang	c55e72bea1	[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 ) The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error: <img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" /> I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function. So this diff/PR is an exactly copy of the previous one, except for adding the docstring. ------------- This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2025-07-29 17:03:24 +00:00
Sherlock Huang	750348b579	[NativeRT] Clean up use of TargetDevice in KernelFactory (#159298 ) Summary: Remove use of targetDevice in KernelFactory. AOTI would infer device when creating AOTIDelegateExecutor. Test Plan: CI Rollback Plan: Reviewed By: dolpm Differential Revision: D79007317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298 Approved by: https://github.com/dolpm	2025-07-29 16:24:33 +00:00
Kurt Mohler	52b9af163c	Add `avg_pool3d` for MPS (#158877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877 Approved by: https://github.com/malfet	2025-07-29 15:22:22 +00:00
James Wu	f4bfac11c7	[Precompile] [easy] API For Editable PrecompileCacheArtifacts (#158586 ) This adds an option for backend precompile artifacts to be editable, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them. This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is: - User runs AOTAutograd -> Inductor -> Triton - User saves to AOTAutogradCache the normal results - User runs autotuning - User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels. This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586 Approved by: https://github.com/zhxchen17	2025-07-29 14:53:21 +00:00
Howard Huang	8d00833fdb	[PP] Fix eval step under no_grad() (#159293 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293 Approved by: https://github.com/tianyu-l, https://github.com/wconstab	2025-07-29 14:42:33 +00:00
Justin Chu	de529ef002	[ONNX] onnx.md to simplify deprecated entities (#159312 ) Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312 Approved by: https://github.com/titaiwangms	2025-07-29 14:24:17 +00:00
PyTorch MergeBot	61aa2ae20f	Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341 )" This reverts commit e469414b59ceeaae2860e36708de8852b9892776. Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))	2025-07-29 13:56:20 +00:00
Mark Harfouche	9d32aa9789	Help fix numpy detection in cross compiled layouts (#137084 ) We had trouble at conda-forge getting numpy to get detected on aarch64 due to our splayed layout and cross compilation needs. see: * https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256 * https://github.com/conda-forge/pytorch-cpu-feedstock/issues/266 * https://github.com/conda-forge/pytorch-cpu-feedstock/pull/267 This is my attempt at making an "upstreamable patch" that tries to follow your structure. It could introduce a new environment variable `Python_NumPy_INCLUDE_DIR` if you want, but CMake doesn't use it as an environment variable, so I feel like that would be weird. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137084 Approved by: https://github.com/atalman	2025-07-29 12:08:56 +00:00
Francisco Massa	5cf77a0ea2	Fix redistribution costs for slice_scatter (#159223 ) We were previously assuming that the `input_strategy == src_strategy`, which is not true in all cases. This should fix this. On the side, I also realized that for `slice_scatter` some DTensorSpecs don't have TensorMeta, e.g., https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L524 It would be good to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159223 Approved by: https://github.com/ezyang, https://github.com/wconstab	2025-07-29 12:00:39 +00:00
Xuehai Pan	efcf87654e	[CI] update flake8 and mypy lint dependencies (#158720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720 Approved by: https://github.com/Skylion007	2025-07-29 08:05:56 +00:00
Laith Sakka	2523e58781	unbacked handling for view_copy (#159244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159244 Approved by: https://github.com/bobrenjc93	2025-07-29 07:10:46 +00:00
Jane Xu	222fa451a2	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-29 05:43:53 +00:00
Jason Ansel	6de24135e5	Fix flaky test_inductor_multiple_specializations (#159264 ) Summary: This test was using do_bench, so it was flaky performance is non-deterministic. Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled Rollback Plan: Differential Revision: D79098692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264 Approved by: https://github.com/jingsh	2025-07-29 05:16:55 +00:00
henrylhtsang	27ae72036d	[cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276 ) Differential Revision: [D79106238](https://our.internmc.facebook.com/intern/diff/D79106238/) This is in prep for cutlass upgrade. More context: https://github.com/NVIDIA/cutlass/issues/2487 Tested in https://github.com/pytorch/pytorch/pull/159115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159276 Approved by: https://github.com/adamomainz, https://github.com/njriasan, https://github.com/Skylion007	2025-07-29 04:40:24 +00:00
Sherlock Huang	e924df23a6	[NativeRT] Strengthen matcher check for StaticDispatch kernel (#159187 ) Summary: Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU. Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy. Prepare for turning on static dispatch kernel by default. Test Plan: I should add some test before land. Rollback Plan: Differential Revision: D78747600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187 Approved by: https://github.com/dolpm	2025-07-29 04:03:49 +00:00
fduwjj	67e68e0785	[c10d] Cleanup split_group logic using the newly built splitGroup (#158488 ) with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it. Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488 Approved by: https://github.com/d4l3k	2025-07-29 03:27:11 +00:00
Xuehai Pan	775788f93b	[BE][PYFMT] migrate PYFMT for `test/[i-z]*/` to `ruff format` (#144556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144556 Approved by: https://github.com/ezyang	2025-07-29 03:26:09 +00:00
Mu-Chu Lee	19ce1beb05	[AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight (#159279 ) Summary: Add test for enabling CUDACachingAllocator for AOTInductor's Weight. Implementation TBD Test Plan: N/A, commit is adding a test. Rollback Plan: Differential Revision: D79107507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159279 Approved by: https://github.com/desertfire, https://github.com/jingsh	2025-07-29 02:52:10 +00:00
Guilherme Leobas	a91ddea61f	Add CPython tests for `collections` module (#158950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158950 Approved by: https://github.com/zou3519	2025-07-29 02:24:27 +00:00
William Wen	ffccb90ff4	[dynamo, docs] add fullgraph=False docs (#159050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050 Approved by: https://github.com/svekars, https://github.com/anijain2305 ghstack dependencies: #157985, #158055, #158531	2025-07-29 01:53:47 +00:00
William Wen	f916f34739	[dynamo, docs] non-strict programming model docs (#158531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531 Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #157985, #158055 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:47 +00:00
William Wen	c32994ce4b	[docs, dynamo] add fullgraph=True, common graph breaks docs (#158055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055 Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305 ghstack dependencies: #157985 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:41 +00:00
William Wen	433e43cbec	[dynamo, docs] programming model dynamo core concepts (#157985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985 Approved by: https://github.com/svekars, https://github.com/anijain2305	2025-07-29 01:53:34 +00:00
Xia, Weiwen	e469414b59	[CPU] fix _weight_int8pack_mm with large output shape (#158341 ) Summary `_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by ```c++ auto* C_ptr = C_data + mb_start * N + nb_start; ``` where both `mb_start` and `N` are `int` and when they are large their product may overflow. The solution is simple: declare these variables as `int64_t` so that the product won't overflow. Test plan ``` pytest -sv test/test_linalg.py -k test__int8_mm_large_shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341 Approved by: https://github.com/mingfeima, https://github.com/drisspg	2025-07-29 01:14:50 +00:00
rzou	657e5e9aa6	All custom operators go through Inductor's graph.call_function (#159174 ) Fixes #158892 All custom operators should go through the graph.call_function path. The other fallback path is for aten/prim operations that don't have support for things (like torch.float8_e8m0fn). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174 Approved by: https://github.com/eellison	2025-07-29 00:31:57 +00:00
Nikita Shulga	f02b783aae	[1/N] Remove MacOS-13 MPS testing (#159278 ) Starts addressing https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278 Approved by: https://github.com/dcci ghstack dependencies: #159277	2025-07-28 23:52:47 +00:00
Xu Han	8ad96a563c	[inductor] normalize path of the code. (#159255 ) Error stack: <img width="1361" height="345" alt="image" src="https://github.com/user-attachments/assets/50fb2baa-34fd-4a48-a3e7-76e3185391d4" /> After fix: <img width="1103" height="398" alt="image" src="https://github.com/user-attachments/assets/ece5a9ba-a085-46fe-b061-0c2ebda3a2df" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159255 Approved by: https://github.com/desertfire	2025-07-28 23:42:11 +00:00
PyTorch MergeBot	59e261bbd8	Revert "[CI] update flake8 and mypy lint dependencies (#158720 )" This reverts commit f5130bf339f12ccf5c6296130c47685bdc4858e4. Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/yangw-dev due to this pr failed internally when build torchgen due to rror: fail: Unknown PyPI project: pyyaml, it seems like this is caused by change PyYAML into pyyaml, please fix it ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3129995414))	2025-07-28 22:02:10 +00:00
Catherine Lee	08ea8fccaf	[ez][docker] Remove some unused vars and scripts (#158680 ) `CUDNN_VERSION` isn't used in any Dockerfiles, it's picked automatically based on the cuda version in `install_cuda.sh` `install_cudnn.sh` isn't used anywhere, cudnn installation happens in `install_cuda.sh` I didn't find any mentions of `GRADLE_VERSION` or `TENSORRT_VERSION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158680 Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet	2025-07-28 21:44:47 +00:00
atalman	41754539be	Add 3.14 triton wheel build (#159261 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159261 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-28 20:34:16 +00:00
Nikita Shulga	716d52779f	[BE] Delete non-existing labels (#159277 ) As no such runners has been online for last 2+ month Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277 Approved by: https://github.com/clee2000	2025-07-28 20:28:57 +00:00
Michael Lazos	3bf41f26c8	[cutlass] rename EVT args within kernels for code caching (#159243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159243 Approved by: https://github.com/henrylhtsang	2025-07-28 19:01:40 +00:00
Eddie Yan	19aa8eb4f5	[TF32][Flex Attention] Turn off TF32 for reference computation in `test_flex_decoding` (#158979 ) Seems to avoid threshold (fudge factor) twiddling games as this causes the checks to go down the "very small ref error" path instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158979 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng, https://github.com/nWEIdia	2025-07-28 18:38:23 +00:00
Animesh Jain	8c0c5c58c7	[benchmarks] Set model name early to keep warmup and main model same (#159231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231 Approved by: https://github.com/williamwen42 ghstack dependencies: #159209	2025-07-28 18:18:16 +00:00
Xiaochang Wu	2d1e92307d	Partitioner: Fix to align partition node order with original graph (#157892 ) Fixes #157891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892 Approved by: https://github.com/ezyang	2025-07-28 17:36:29 +00:00
Lucca Bertoncini	399c89e15c	fix torch/distributed contributing doc (#158934 ) both pointers are pointing to a page of empty github issues. I'm moving this to point to all issues tagged with `pt_distributed_rampup` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158934 Approved by: https://github.com/d4l3k	2025-07-28 17:01:05 +00:00
PyTorch MergeBot	14d67eec05	Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262 )" This reverts commit 9b4d938f04c95cebe0fbd96974f64c935567e039. Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/ZainRizvi due to This was reverted internally. Somehow this PR didn't get reverted alongside it. See D78772867. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3128148475))	2025-07-28 16:58:27 +00:00
Benson Ma	9ad7dd54f9	[fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string (#158896 ) Summary: - Upgrade KernelLauncher kernelLaunchCheck to print help string, following D78440016 Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher ``` Rollback Plan: Differential break Revision: D78572009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158896 Approved by: https://github.com/atalman	2025-07-28 16:11:13 +00:00
Deepak Seshadri	387db86ef1	Name Inductor's Subproc pool threads. (#158815 ) Differential hack Revision: D78710371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158815 Approved by: https://github.com/d4l3k	2025-07-28 16:08:08 +00:00
dolpm	e5a1d839c5	[nativert] ensure planner once flag is class-local, not static. (#159116 ) Summary: att - otherwise only one global planner will be made even though we need it to be per-model if models are colocated. Differential hack Revision: D78939141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159116 Approved by: https://github.com/SherlockNoMad	2025-07-28 16:06:21 +00:00
zhxchen17	c06164a9c5	[nativert][ez] Remove unused dist collectives ops. (#159220 ) Removing dependency to c10d/ in ExecutionFrame.h. We don't need c10d::Work in the frame. Differential Revision: [D79041618](https://our.internmc.facebook.com/intern/diff/D79041618/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159220 Approved by: https://github.com/SherlockNoMad, https://github.com/dolpm	2025-07-28 16:03:14 +00:00
Karim Abou Zeid	c7586d4ed3	typo (#156560 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/156560 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-07-28 15:40:06 +00:00
thenumberouscode	8e07c9870d	[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566 ) inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation, it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation. Fixes https://github.com/pytorch/pytorch/issues/157399 @xmfan Pull Request resolved: https://github.com/pytorch/pytorch/pull/157566 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2025-07-28 12:44:22 +00:00
PyTorch UpdateBot	a76147c9e0	[xla hash update] update the pinned xla hash (#158223 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158223 Approved by: https://github.com/pytorchbot	2025-07-28 11:19:05 +00:00
pzzp	f3913ea641	[CUDA] fix nansum in non-JIT build (#158633 ) This change fix crash of ``` import torch a = torch.tensor([[1, 2]], dtype=torch.complex32).to('cuda') b = torch.nansum(a, dim=0) print(b) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158633 Approved by: https://github.com/ngimel	2025-07-28 08:11:32 +00:00
Sherlock Huang	1abff80fae	Reland D78841818 (#159216 ) Summary: Relanding D78841818 with fixes Test Plan: Tested all failing tests buck build --config fbcode.use_link_groups=true --flagfile fbcode//mode/dev-nosan fbcode//sigmoid/core/executor/memory/test:layout_planner_tests buck test 'fbcode//mode/opt' fbcode//sigmoid/inference/test:test_passes Rollback Plan: Reviewed By: hl475 Differential Revision: D79038615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159216 Approved by: https://github.com/dolpm	2025-07-28 07:39:35 +00:00
zeshengzong	799303f655	Fix atleast_{1,2,3}d() with no arguments description (#156042 ) Fixes #130667 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/7e3a6764-872a-4573-8bec-e7219f920a15) ![image](https://github.com/user-attachments/assets/194be00c-9a29-44cf-b6bc-4d261a12d04e) ![image](https://github.com/user-attachments/assets/21cd6a4f-0793-44e3-9073-7b8b801f997c) ### After ![image](https://github.com/user-attachments/assets/fdbaa2ff-f13c-4fa9-bf52-0810faa698bd) ![image](https://github.com/user-attachments/assets/0374b474-4c6b-4b7d-abea-70e3df0c0a06) ![image](https://github.com/user-attachments/assets/9f9dc188-60e2-4c0f-9e23-36a39310008c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156042 Approved by: https://github.com/zou3519	2025-07-28 06:25:23 +00:00
PyTorch MergeBot	d26ab281d2	Revert "Setup TorchBench in Docker (#158613 )" This reverts commit d72ebefe3fa7d3ee0e9c9b399f5c07611e790664. Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/XuehaiPan due to checkout_install_torchbench function is removed but still referenced in trunk ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3125695250))	2025-07-28 06:19:00 +00:00
PyTorch MergeBot	1cffb217ef	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit e88f804a2eecf967dbbf95c5643248352626dafd. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/XuehaiPan due to Breaks Windows wheels ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3125566269))	2025-07-28 05:29:37 +00:00
PyTorch UpdateBot	c8342b7231	[vllm hash update] update the pinned vllm hash (#159235 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159235 Approved by: https://github.com/pytorchbot	2025-07-28 04:16:31 +00:00
Animesh Jain	f63673626d	[dynamo][guards] Skip guards on constant func.__defaults__ elements (#159209 ) Func.__defaults__ is a tuple. Therefore, we can skip guards on immutable elements. Mutable elements are still guarded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159209 Approved by: https://github.com/jansel	2025-07-27 22:46:17 +00:00
Sampath Victor	37638c303e	Addressing some linter errors (#158670 ) Summary: Addressing the linter errors reported in the changed files. Test Plan: ``` buck test mode/opt deeplearning/fbgemm:QuantUtilsTest ``` https://www.internalfb.com/intern/testinfra/testrun/11821949118528688 ``` buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test ``` https://www.internalfb.com/intern/testinfra/testrun/7881299627525465 Rollback Plan: Differential Revision: D78352311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158670 Approved by: https://github.com/excelle08, https://github.com/cyyever, https://github.com/digantdesai	2025-07-27 21:55:50 +00:00
Max Podkorytov	ee2edf3d37	[ROCm][CK][Inductor] enable gfx950 for max autotune with CK (#159195 ) + update inductor config for new gfx arch + fixes in codegen for conv2d and ck-tile matmul + use appropriate fp8 dtypes + test cleanup Pull Request resolved: https://github.com/pytorch/pytorch/pull/159195 Approved by: https://github.com/chenyang78	2025-07-27 20:47:13 +00:00
Chinmay Shrivastava	51eb41a57e	Enable dynamic shapes for foreach operations by default (#158985 ) ## Summary This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`. ## Context The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows. ## Why This Change? After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations. ### Current behavior: - Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes` - Without this flag, foreach operations may fail with dynamic shapes - This creates friction and confusion ### With this change: - Foreach operations work seamlessly with dynamic compilation - No manual configuration needed - Better "it just works" experience ## Testing Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1: - ✅ Various tensor sizes (8, 16, 32, 64, 128) - ✅ Multiple tensors in operations (tested up to 20) - ✅ Nested foreach operations - ✅ Mixed operations (foreach + standard operations) - ✅ Both CPU and CUDA devices - ✅ Symbolic shapes with dynamic compilation ## Impact Assessment - Performance: No impact - this only affects compilation behavior - Backward Compatibility: Fully maintained - users can still set to `False` - Risk: Minimal - feature has been stable since August 2024 ## References - Original implementation: PR #134477 by @qchip - This completes the feature rollout by making it available by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-27 19:56:07 +00:00
Howard Huang	ede6186c86	[PP] Allow intermediate nodes in ZB to have multiple grads (#159084 ) Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792) Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests. This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2): ```python class Func(torch.autograd.Function): @staticmethod def forward(ctx, x): return x, 2 @staticmethod def backward(ctx, gI1, gI2): assert gI2 is None return gI1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084 Approved by: https://github.com/tianyu-l	2025-07-27 19:16:51 +00:00
Nikita Shulga	6d071bd65d	Remove numpy dependency from onnx (#159177 ) One should not expect numpy to be there during onnx import Forward fix for : https://github.com/pytorch/pytorch/pull/157734 Added regression test to `test_without_numpy` function Test plan: Run `python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"` with/without this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/159177 Approved by: https://github.com/atalman, https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/andrewboldi	2025-07-27 13:23:03 +00:00
cyy	d742a2896c	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-07-27 07:13:27 +00:00
Xu Han	11d6559a58	[inductor] disable failed UTs of test_misc.py (#159210 ) Disable failed UTs. <img width="1195" height="118" alt="image" src="https://github.com/user-attachments/assets/da0933fb-3c4c-44c9-ba85-45971f03405f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159210 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-07-27 05:41:44 +00:00
PyTorch UpdateBot	e7667e5702	[vllm hash update] update the pinned vllm hash (#159217 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159217 Approved by: https://github.com/pytorchbot	2025-07-27 04:16:35 +00:00
cyy	f6c89c1ef3	Detach tensor before clone in SGD optimiser and other code (#159204 ) Reverse the pattern of tensor clone followed by detach in SGD and other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159204 Approved by: https://github.com/Skylion007	2025-07-27 03:31:12 +00:00
Huy Do	d72ebefe3f	Setup TorchBench in Docker (#158613 ) Signed-off-by: Huy Do <huydhn@gmail.com>	2025-07-26 12:56:03 -07:00
Anatoly Myachev	46b925681c	[inductor] Update `to(tl.int8).to(tl.uint8)` workaround from #94717 to handle entire range of `torch.uint8` (#158567 ) https://github.com/pytorch/pytorch/pull/94717/files#r2210265070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-07-26 19:11:37 +00:00
PyTorch MergeBot	fe0ff12dab	Revert "[Inductor] Support native Inductor as backend for MTIA (#158526 )" This reverts commit cd68559d0451185f8521912c23e77b83d76b87cf. Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))	2025-07-26 17:58:00 +00:00
Francisco Massa	7dafab6a93	Fix SDPA sharding when `return_debug_mask` is False (#159205 ) If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions. This PR fixes it for `scaled_dot_product_flash_attention_strategy`. Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205 Approved by: https://github.com/wconstab	2025-07-26 17:41:42 +00:00
Xuehai Pan	f5130bf339	[CI] update flake8 and mypy lint dependencies (#158720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720 Approved by: https://github.com/Skylion007	2025-07-26 17:12:29 +00:00
PyTorch MergeBot	f62772f365	Revert "Remove tensorexpr tests (#158928 )" This reverts commit 517eebc1dd4ae6430a95818b16c5f8b4b10fd1bc. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks trunk test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_frac_cpu_bfloat16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16534544469/job/46768022799) [HUD commit link](`517eebc1dd`) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3122158944))	2025-07-26 17:01:54 +00:00
Xu Han	e2b2685f84	[inductor] enable compiled autograd on CPU windows - v2 (#159185 ) The first version: https://github.com/pytorch/pytorch/pull/158432 compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code. However these code can be compiled on CPU. This PR enable these code on CPU windows. But the first version changed ifdef block logical, and caused torch audio build fail: https://github.com/pytorch/audio/issues/3992 Here is the version two, which keep the original logical. # Local test torch audio build pass: <img width="874" height="1043" alt="image" src="https://github.com/user-attachments/assets/9657be86-04f7-4c66-b8c6-802ec2a7c5c8" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159185 Approved by: https://github.com/xmfan	2025-07-26 16:21:28 +00:00
PyTorch MergeBot	3db8623dcb	Revert "[NativeRT] Apply Device placement once when loading the graph (#158996 )" This reverts commit 28ee8be5bfeebb2e44daace6551462b52557e451. Reverted https://github.com/pytorch/pytorch/pull/158996 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158996#issuecomment-3121540050))	2025-07-26 09:05:26 +00:00
anwang	cd68559d04	[Inductor] Support native Inductor as backend for MTIA (#158526 ) This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526 Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison	2025-07-26 08:16:34 +00:00
PyTorch UpdateBot	62a49d929b	[vllm hash update] update the pinned vllm hash (#159198 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159198 Approved by: https://github.com/pytorchbot	2025-07-26 04:44:38 +00:00
Laith Sakka	c6b479bc09	remove guard_or_x from allowlist_for_publicAPI (#159181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159181 Approved by: https://github.com/albanD	2025-07-26 01:22:17 +00:00
cyy	517eebc1dd	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-07-26 01:21:01 +00:00
zpcore	7f266020de	add softmax_backward_strategy missing field (#159167 ) Add input_specs in softmax_backward_strategy, as is needed by AutoParallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159167 Approved by: https://github.com/XilunWu	2025-07-26 00:53:53 +00:00
Meet Patel	e06798191b	Split out C++ code from fused adagrad PR (#159008 ) The original fused Adagrad pull request was: PR#153038 This PR contains only the c++ code of that original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159008 Approved by: https://github.com/janeyx99	2025-07-26 00:36:59 +00:00
eqy	c89fa88acb	[conv][cuDNN][64-bit indexing] reduce memory usage of depthwise conv 64-bit indexing test (#158981 ) Use half instead for reduced memory usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/158981 Approved by: https://github.com/soulitzer, https://github.com/Skylion007	2025-07-25 23:58:45 +00:00
gaoyufeng	f5cf05c983	Throw invalid_argument instead of RuntimeError when parameters exceed… (#158267 ) Throw invalid_argument instead of RuntimeError when parameters exceed limits (for torch.int32 dtype) Fixes #157707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158267 Approved by: https://github.com/albanD	2025-07-25 23:49:46 +00:00
NikhilAPatel	21a95bdf7c	[Inductor] [Triton] Enabling TMA for flex-attention for supported device types (#157822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157822 Approved by: https://github.com/drisspg ghstack dependencies: #159123	2025-07-25 23:45:26 +00:00
Colin Peppler	fb029accb7	(is_non_overlapping_and_dense) gso to guard_or_false in when checking length 1 (#158894 ) Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides. `2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)` We think it's safe because Laith tested whether this fallback would fail any tests. It did not. https://github.com/pytorch/pytorch/pull/158157 ## Data-dependent exceptions (DDE) ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy x_tensor = torch._prims.convert_element_type(x_tensor, dtype) ... File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta if torch._prims_common.is_non_overlapping_and_dense(a): File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense if guard_size_oblivious(length == 1): GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)). (Size-like symbols: u0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894 Approved by: https://github.com/pianpwk, https://github.com/laithsakka	2025-07-25 23:43:38 +00:00
drisspg	26f4dd5160	Scaled MM Fix NVfp4 (#159170 ) Fixes mm on B200: Before: ```Shell def _addmm_nvfp4_dispatch( a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear. The only difference is whether bias is None or not. """ assert a._data.is_contiguous() assert b._data.t().is_contiguous() assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}" assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}" M, K = a.shape[0], a.shape[1] N = b.shape[1] # Swizzle Dizzle if a._is_swizzled_scales: a_scale_blocked = a._scale_e4m3 # Already swizzled else: a_scale = a._scale_e4m3.view(M, K // a._block_size) a_scale_blocked = to_blocked(a_scale) if b._is_swizzled_scales: b_scale_blocked = b._scale_e4m3 # Already swizzled else: b_scale = b._scale_e4m3.view(N, K // b._block_size) b_scale_blocked = to_blocked(b_scale) # Merge double quant scales into 1 scale for Scale_In^D if a._per_tensor_scale is not None: assert b._per_tensor_scale is not None scale_result = a._per_tensor_scale * b._per_tensor_scale else: assert b._per_tensor_scale is None and a._per_tensor_scale is None scale_result = None # THIS IS A WORKAROUND: # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling # When we have per-tensor scaling, we need to apply it before bias # since bias is not quantized should_add_bias_separately = (scale_result is not None) and (bias is not None) # should_add_bias_separately = bias is not None > result = torch._scaled_mm( a._data.view(torch.float4_e2m1fn_x2), b._data.view(torch.float4_e2m1fn_x2), a_scale_blocked.view(torch.float8_e4m3fn), b_scale_blocked.view(torch.float8_e4m3fn), bias=None if should_add_bias_separately else bias, out_dtype=a._orig_dtype, # scale_result=scale_result, # Not supported yet ) E RuntimeError: Invalid scaling configuration. E - For TensorWise scaling, a and b should be float8, scales should be float and singletons. E - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous. E - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major. E - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides). E - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous. E - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous. E Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170 Approved by: https://github.com/ngimel	2025-07-25 23:34:03 +00:00
Menglu Yu	b9e3eb64a7	[Optimus] Support decompose mm with dynamic shapes (#158821 ) Summary: The current implementation will not do the decompose for GEMM with dynamic shapes, thus we add one more option for users to enable this feature Test Plan: ### how to enable Step 1: Set decompose_mem_bound_mm = false Step 2: Add the decompose_mm_pass pattern to the post_grad_fusion_options json config example: "post_grad_fusion_options": { "decompose_mm_pass": { "min_first_dimension_decomposition": 10240, -> default value "max_other_dimention_decomposition": 32, -> default value "skip_dynamic_shape_dim_check": true, -> default is false } }, yaml config example ``` post_grad_fusion_options: decompose_mm_pass: skip_dynamic_shape_dim_check: true ``` Note that all these hyper-parameters can be set by the users, if nothing gives, a default value will be used ### unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_dynamic_shape_decompose_addmm ``` Buck UI: https://www.internalfb.com/buck2/a98eb4b3-da1d-4450-9e49-472ba98b2267 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924745731095 Network: Up: 86KiB Down: 1.3MiB (reSessionID-96cf35cc-5189-4372-8f25-1fc6a52a3963) Executing actions. Remaining 0/3 1.4s exec time total Command: test. Finished 2 local Time elapsed: 2:00.6s Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E before: aps-DPA_new_v0_amd_20250716-e7927755df after: aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### qps and NE {F1980635506} {F1980635505} - 12.5% qps improvement with NE neutral ### trace analysis baseline:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716-e7927755df%2F0%2Frank-1.Jul_22_22_28_01.4592.pt.trace.json.gz&bucket=aps_traces {F1980633952} proposal:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb%2F0%2Frank-1.Jul_24_14_37_59.4576.pt.trace.json.gz&bucket=aps_traces {F1980633966} ``` unsqueeze_default: "bf16[32s54, 8, 1][8, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_2, 2) unsqueeze_default_1: "bf16[1, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_3, 0); constant_pad_nd_default_3 = None mul_tensor: "bf16[32s54, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.mul.Tensor(unsqueeze_default, unsqueeze_default_1); unsqueeze_default = unsqueeze_default_1 = None ``` ### what have been decomposed P1880443593 Rollback Plan: Differential Revision: D78716034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158821 Approved by: https://github.com/Yuzhen11	2025-07-25 23:19:53 +00:00
PyDevC	69cc99525c	[nn]: updated type alias for padddingmode in module/conv.py (#158843 ) Fixes #152280 Changed type of `padding_mode` from `str` to `Literal["zeros", "reflect", "replicate", "circular"]` cc @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158843 Approved by: https://github.com/mikaylagawarecki	2025-07-25 23:05:02 +00:00
Edward Z. Yang	72af19dadf	Add aot_autograd.fx_utils (#159005 ) See docblock for details. The API here has been validated by use in autoparallel but I'm always open to suggestions for tweaks. One particular choice I made is to make most of the functions return dicts by default; this isn't strictly necessary for inputs but it is very convenient for outputs as the output desc lives on the output node, not the argument that feeds into the node. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005 Approved by: https://github.com/wconstab	2025-07-25 22:52:33 +00:00
IvanKobzarev	8aebf01287	[bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663 ) Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function. `all_gather_merge_fn_to_trace` `reduce_scatter_merge_fn_to_trace` (Instead of creating nodes and doing FakeTensor prop manually) This allows to experiment with merge function. Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_` Adding topological sort after bucketing passes (comment in post_grad.py): ``` # Fx collectives bucketing passes require topological sort for the cases: # when bucketed collectives have users before the last collective in the bucket # AND when inputs of bucketed collective have ancestors after the first collective in the bucket. # # In this case we can not manually pick the place for bucketed collective insertion. # But we are guaranteed by the bucketing (independent collectives in the bucket), # that it is possible to reorder nodes to satisfy all ordering requirements. # # --- before bucketing --- # in0 = ... # wait_ag0 = ag(in0) # user0(wait_ag0) # ... # pre_in1 = ... # in1 = transform(pre_in1) # wait_ag1 = ag(in1) # user1(wait_ag1) # # --- after bucketing --- # # in0 = ... # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective. # # pre_in1 = ... # in1 = transform(pre_in1) # ag_bucket(in0+in1) # wait_bucket # wait_ag0 = wait_bucket[0] # wait_ag1 = wait_bucket[1] # user1(wait_ag1) ```` Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel: <img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" /> <img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663 Approved by: https://github.com/wconstab	2025-07-25 22:49:51 +00:00
Yu Guo	bc5dbbbb78	support scalar tensor for functional all_gather (#149913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149913 Approved by: https://github.com/H-Huang ghstack dependencies: #149912	2025-07-25 22:38:08 +00:00
Mikayla Gawarecki	36cf8f1ed8	[BE] Use .md instead of .rst for nn.aliases doc (#158666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491, #158654	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	1e79872f2e	[BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	9e8f27cc79	[BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change) 3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs. #### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html) This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except - NLLLoss2d (deprecated) - Container (deprecated) - CrossMapLRN2d (what is this?) - NonDynamicallyQuantizableLinear This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491 Approved by: https://github.com/janeyx99	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	e65ab9a868	Enable generating generic c_shim that doesn't bypass dispatcher (#158974 ) Adds `c_shim_aten.{h/cpp}` and use this for `fill_` This is the generated `c_shim_aten.cpp` for reference ```cpp // WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND. // See `7e86a7c015/torchgen/gen.py (L2424-L2436)` for details // This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py #include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h> #include <torch/csrc/inductor/aoti_torch/utils.h> #ifndef AT_PER_OPERATOR_HEADERS #include <ATen/Functions.h> #include <ATen/CompositeExplicitAutogradFunctions.h> #include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h> #include <ATen/CompositeImplicitAutogradFunctions.h> #else #include <ATen/ops/fill.h> #endif // AT_PER_OPERATOR_HEADERS using namespace torch::aot_inductor; AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) { AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({ at::fill_( *tensor_handle_to_tensor_pointer(self), value ); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974 Approved by: https://github.com/albanD, https://github.com/janeyx99	2025-07-25 21:59:14 +00:00
Pian Pawakapan	bfe6765d6b	[export] assert fix in serdes (#159060 ) Summary: catch asserts on True Test Plan: T232064560 Rollback Plan: Differential Revision: D78907485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159060 Approved by: https://github.com/yiming0416	2025-07-25 21:46:20 +00:00
Denghui Dong	e88f804a2e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-25 21:44:57 +00:00
raghavhrishi	7ef3c3357d	NUMA binding integration with elastic agent and torchrun (#149334 ) Implements #148689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334 Approved by: https://github.com/d4l3k Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>	2025-07-25 21:19:49 +00:00
Thomas Bohnstingl	24b1f10ca1	[HOP, map] Rework of map autograd to the new interface (#153343 ) This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343 Approved by: https://github.com/ydwu4	2025-07-25 21:17:06 +00:00
Yidi Wu	0006dd5c43	[test][torchbind] don't allow set torchbind attr at runtime (#158608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158608 Approved by: https://github.com/zou3519 ghstack dependencies: #158583, #158606, #158607	2025-07-25 20:55:41 +00:00
Yidi Wu	0f31e9a656	[torchbind] fix fakifying a staitc tensor returns dynamic accidentally (#158607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158607 Approved by: https://github.com/zou3519 ghstack dependencies: #158583, #158606	2025-07-25 20:55:41 +00:00
Yidi Wu	0427e439aa	[test][torchbind] turn on inductor backend for compile torchbind tests (#158606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158606 Approved by: https://github.com/zou3519 ghstack dependencies: #158583	2025-07-25 20:55:41 +00:00
Yidi Wu	4aa69ae336	[torchbind] support register_autocast for torchbind custom op (#158583 ) Fix https://github.com/pytorch/pytorch/issues/158414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158583 Approved by: https://github.com/zou3519	2025-07-25 20:55:41 +00:00
dolpm	14c314b30d	[nativert] make per-node benchmark work with memory planning (#159117 ) Summary: this will use-after-free otherwise Rollback Plan: Differential Revision: D78934104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159117 Approved by: https://github.com/SherlockNoMad	2025-07-25 20:46:17 +00:00
Pian Pawakapan	0b01e11416	[ez][export] add sym_sum to verified ops (#159111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159111 Approved by: https://github.com/angelayi	2025-07-25 20:42:42 +00:00
NikhilAPatel	806d9e3fe7	[Inductor][TMA] Split config-gated and pure compatibility logic for TMA template eligibility checks (#159123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159123 Approved by: https://github.com/drisspg	2025-07-25 20:35:49 +00:00
Yu Guo	d90ce83027	add a util function _make_all_gather_out_tensor to reduce code duplication (#149912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149912 Approved by: https://github.com/H-Huang	2025-07-25 20:29:01 +00:00
Xu Han	dfcb07bdfa	[Inductor] disable windows failed UTs temporary. (#159163 ) Disable windows failed UTs temporary. <img width="1238" height="107" alt="image" src="https://github.com/user-attachments/assets/c8a40408-a793-4016-99bb-19c1bb09860a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159163 Approved by: https://github.com/desertfire	2025-07-25 20:25:36 +00:00
Sam Larsen	fa0355c18d	Fix full_like decomposition to preserve strides (#158898 ) Summary: See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again. Rollback Plan: Differential hack Revision: D78783627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898 Approved by: https://github.com/eellison	2025-07-25 20:21:36 +00:00
Sherlock Huang	28ee8be5bf	[NativeRT] Apply Device placement once when loading the graph (#158996 ) Summary: Placement is leaked to too many classes! In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement. After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device. The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta. graph's tensorMeta becomes the only ground truth! Test Plan: Need to add some tests before landing. This is a big change. Rollback Plan: Differential Revision: D78841818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996 Approved by: https://github.com/henryoier	2025-07-25 20:11:35 +00:00
Yidi Wu	ed472257d1	[associative_scan] stop manually set example inputs in dynamo (#159065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159065 Approved by: https://github.com/zou3519 ghstack dependencies: #159063, #159064	2025-07-25 20:08:08 +00:00
Yidi Wu	57eea56a9a	[scan] stop manually set example inputs in dynamo (#159064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159064 Approved by: https://github.com/zou3519 ghstack dependencies: #159063	2025-07-25 20:08:08 +00:00
Yidi Wu	dd681f7f59	[while_loop] stop manually setting example inputs in dynamo (#159063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159063 Approved by: https://github.com/zou3519	2025-07-25 20:08:08 +00:00
Howard Huang	0d4d3e8a89	[TCPStore] Allow ping to be retried (#159165 ) On client setup we retry connections with server: `f8fafdc7a6/torch/csrc/distributed/c10d/TCPStore.cpp (L313-L350)` I noticed `ping()` raises `TORCH_INTERNAL_ASSERT` AKA a runtime error rather than a `DistNetworkError`. So updating that so it can be retried as well. We have seen this pop up internally: - https://fb.workplace.com/groups/319878845696681/permalink/1478849733132914/ - https://fb.workplace.com/groups/319878845696681/permalink/1479368959747658/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/159165 Approved by: https://github.com/d4l3k	2025-07-25 20:03:00 +00:00
cz2h	ee4c5c7cd2	Add torchcheck for replication_pad3d_backward (#151986 ) Fixes #142833 Add check on channel dimension, logic same to the CUDA implementation `78bbb468c6/aten/src/ATen/native/cuda/ReplicationPadding.cu (L347)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151986 Approved by: https://github.com/mikaylagawarecki	2025-07-25 19:48:51 +00:00
Sameer	51cd6697cd	Fix: Use memory_order_relaxed instead of memory_order_relaxed (#159105 ) Addresses #159074 by using `memory_order_release` instead of `memory_order_relaxed` here: `9c10760662/c10/core/DeviceType.cpp (L161)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159105 Approved by: https://github.com/colesbury	2025-07-25 19:39:04 +00:00
Xu Han	ba949c54a7	[inductor] fix test_save_graph_repro on Windows. (#159148 ) The issue is caused by Windows path separator work as escape character. Fixed by `normalize_path_separator` in torch front end codegen. Error message: <img width="855" height="542" alt="image" src="https://github.com/user-attachments/assets/ad08b521-05e6-4c93-9507-ad19c68ac7b5" /> Fixed: <img width="855" height="312" alt="image" src="https://github.com/user-attachments/assets/4a0a142a-2dbe-4226-a4cb-8eacfab2c3fc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159148 Approved by: https://github.com/desertfire	2025-07-25 19:11:08 +00:00
morrison-turnansky	2a528e80ce	Add more type hints for _inductor/ir.py (#159049 ) Fixes #146167 Incremental step to add type hints for _inductor/ir.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/159049 Approved by: https://github.com/Skylion007	2025-07-25 18:56:30 +00:00
Edward Z. Yang	56c45f863b	Add aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors (#158715 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158715 Approved by: https://github.com/fmassa, https://github.com/wconstab, https://github.com/xmfan ghstack dependencies: #158624, #158708, #158734	2025-07-25 18:49:00 +00:00
albanD	d30f89b9b8	Add host protoc script back (#159157 ) Following comment from https://github.com/pytorch/pytorch/pull/158475#issuecomment-3116518904 Also this is a fake issue as protoc is dead anyways: https://github.com/pytorch/pytorch/issues/159156 Also also, macos cross compilation is not something that is tested :/ But I guess we're ok with that given how niche it it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/159157 Approved by: https://github.com/janeyx99	2025-07-25 18:44:20 +00:00
PyTorch MergeBot	3fb78501f0	Revert "enable compiled autograd on CPU windows (#158432 )" This reverts commit a369350065493109d1abfbb994695777ab11bcf4. Reverted https://github.com/pytorch/pytorch/pull/158432 on behalf of https://github.com/atalman due to Broke audio cuda windows builds see: https://github.com/pytorch/audio/issues/3992 ([comment](https://github.com/pytorch/pytorch/pull/158432#issuecomment-3119912177))	2025-07-25 18:29:16 +00:00
angelayi	8a0508335f	[export] Fix public bindings (#159109 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159109 Approved by: https://github.com/jbschlosser	2025-07-25 18:18:52 +00:00
rajeshvshiyal	4c0d5ad4be	Fix docstring for clip_grads_with_norm_ to reflect clamping behavior (#158200 ) Fix docstring for clip_grads_with_norm_ to reflect clamping behavior This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by: grad = grad * (max_norm / (total_norm + eps)) However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users. Updated the formula in the docstring to: grad = grad * min(max_norm / (total_norm + eps), 1.0) Added a note explaining the rationale for clamping (to prevent gradient amplification). Ensured consistency with the behavior of clip_grad_norm_. Fixes #151554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200 Approved by: https://github.com/mikaylagawarecki	2025-07-25 18:07:41 +00:00
Joel Schlosser	316c188a5e	Remove torch.functional entries from the doc ignore list (#158581 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking. 3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs. This PR takes option (3) above and: * Removes all 20 `torch.functional` entries from the doc ignore list * Removes `torch.functional.align_tensors()` entirely, since we don't want to document it. * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety. * Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581 Approved by: https://github.com/janeyx99	2025-07-25 17:19:01 +00:00
Edward Z. Yang	191eca0bf0	Use simple_wraps instead of functools.wraps in AOTAutograd (#158734 ) Wrapping is load bearing for things that introspect argument signatures, but use of functools.wraps to do this is undesirable as this overrides the name/module of the wrapping function, which is bad for tracking down exactly what code is actually being run at runtime. simple_wraps is like wraps but it doesn't override the name information, so you still get an appropriate printout. To see the stack of all functions wrapping each other, there is now a helper fn_stack. I also make some assertions tighter in the descriptor PR. These didn't catch any bugs but I figure might as well. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734 Approved by: https://github.com/wconstab ghstack dependencies: #158624, #158708	2025-07-25 17:08:54 +00:00
Sheng Fu	74f64d3c84	Add inputs and outputs in Triton Kernel FX Graph segment (#158174 ) Summary: Add inputs and outputs in Triton Kernel FX Graph segment The FX graph segment in Triton kernel does not include the input tensors and return tensors, for example Python code: ``` @torchdynamo.optimize("inductor") def fn(a, b, c): x = torch.nn.functional.linear(a, b) x = x.sin() x = x.t() + c * 2 return x ``` ``` # %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {}) # %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) ``` The fix is to add the input and output tensors into FX graph segment ``` # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {}) # %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # return %add ``` Differential Revision: D78131358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158174 Approved by: https://github.com/jansel	2025-07-25 17:01:17 +00:00
PyTorch MergeBot	f8fafdc7a6	Revert "[BE] remove torch deploy - conditionals (#158288 )" This reverts commit ab26d4fbeb5bc4b4e6ef1c37fbec9fab6e5a9edd. Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	c8316d0e79	Revert "[BE] Remove torch deploy \| remove torch deploy specific files (#158290 )" This reverts commit 6ed2cb6ccd00e64f67fd414d42dff54393140c8f. Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	a9f6770edd	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit 9c68c4d08f4c4da49f0086b80e382f0cdd518f60. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	5620e617c9	Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 )" This reverts commit 255c0545e7eac2ec6d00a41a3fc9d6d8201f8f39. Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
Huy Do	ee84ba42ea	[Experiment] Run PT2 benchmark twice a day (#159162 ) Running every 4 hours seems too many, lower it to twice a day. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159162 Approved by: https://github.com/desertfire, https://github.com/eellison	2025-07-25 15:58:29 +00:00
Catherine Lee	561193e5f2	[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 ) 3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2 sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB) I've applied skips to the ones that OOMed Time decreases from ~2.7hr per test job -> ~2hr Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691 Approved by: https://github.com/huydhn	2025-07-25 15:26:29 +00:00
PyTorch MergeBot	9535995bbc	Revert "Remove tensorexpr tests (#158928 )" This reverts commit a0bc865123dba047aa1507e281bf2462780cf271. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to broke cpp static runtime test? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16517697273/job/46715871457) [HUD commit link](`a0bc865123`) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3118554478))	2025-07-25 15:22:51 +00:00
William Wen	6fcb2b4413	[dynamo] unimplemented -> unimplemented_v2 for user_defined.py (#156652 ) For https://github.com/pytorch/pytorch/issues/147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156652 Approved by: https://github.com/zou3519 Co-authored-by: Sidharth <ssubbarao8@meta.com>	2025-07-25 15:04:17 +00:00
Edward Z. Yang	204eb4da5e	Add expanded_def option for FX printing, render descriptor, update tests (#158708 ) ---- - First, we add a new expanded_def to FX, which will expand the definitions of variables into multiple lines, one per variable definition. This makes extremely long args/return lists much more readable. - Next, we extend this mechanism to also print out descriptors on placeholders and return values, as comments, if available. This is how we will test descriptors. - We update tlparse for AOTAutograd to use this format. - We update expect tests to use this format and update their formats, so you can inspect what it can look at. There may be other tests I should update, open to suggestions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708 Approved by: https://github.com/wconstab ghstack dependencies: #158624	2025-07-25 13:22:32 +00:00
Edward Z. Yang	bf311141d6	Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624 ) One of the recurring challenges of working with FX graphs produced by AOTAutograd is that there is a very intricate input/output calling convention that is essentially impossible to understand without actually reverse engineering the AOTAutograd code. It is so bad that there is a bit of logic for stashing indices of relevant arguments/outputs in TracingContext so Inductor can figure out what the correct arguments are. This PR introduces the necessary scaffolding to keep track of "descriptors" of every input/output to a (joint) FX graph produced by AOTAutograd. First read through descriptors.py to get a sense for what is available: for inputs, you can figure out if you have a plain input, tangent, parameter, or something more exotic like one of the fields of a subclass or view base. For outputs, you can determine if you have a plain output or grad, or something more exotic like the contents of a mutated input or an intermediate base of several views that were returned. There are two distinct parts of this patch: AOTInput tracking, and AOTOutput tracking. AOTInput tracking. The way this works is that AOTAutograd starts of with some Tensor `flat_args` that are the inputs to the graph being traced, and then updates these arguments as it modifies the input calling convention. Anywhere these `args` are passed around, we now add a news argument `args_descs` which is updated in synchrony with args. Add a new arg? Add a new AOTInput to `args_descs`. AOTOutput tracking. Originally, I wanted to also add an `outs_descs` analogous to `args_descs` tracking output metadata. However, it is often difficult to compute what the output will be until you're actually tracing the function for real (and are able to peek at the real outputs). So we only compute `outs_desc` when we actually trace. To do this, we change the calling convention of the function we trace to return not just outputs, but a tuple of `outs` and `outs_descs`. Before we bottom out at the `make_fx` invocation, we save `outs_descs` to a nonlocal and bottom out. To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624 Approved by: https://github.com/xmfan	2025-07-25 13:22:32 +00:00
thenumberouscode	92e93bb580	[inductor][cpu] Stop lowering div to reciprocal multiplication to preserve precision when the divisor is a scalar and device is on cpu (#158231 ) ## Fixes https://github.com/pytorch/pytorch/issues/157959 ## mini repro from issue ```c++ import torch from torch import nn class Foo(nn.Module): def __init__( self, use_parameter: bool ) -> None: super().__init__() self.b = 101 if use_parameter: self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False) def forward(self, x: torch.Tensor) -> torch.Tensor: # return x + self.b # return x - self.b return x / self.b # return x * self.b torch.manual_seed(42) x = torch.rand((5, 5)) expected = Foo(False)(x) models = [ Foo(False), Foo(True), torch.compile(Foo(False), fullgraph=True), torch.compile(Foo(True), fullgraph=True), ] for m in models: print((m(x) - expected).sum()) ``` all outputs equal zero except the result of torch.compile(Foo(False), fullgraph=True) ## summary: when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal. this could lead precision lost in c++ kernel. but not in triton kernel ## why: Generated C++ kernel; thanks to @xmfan for supplying the code. ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); } } } } } ``` The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits. ```c++ #include <iostream> #include <iomanip> int main() { auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = static_cast<double>(0.009900990099009901); std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl; std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl; return 0; } ``` the ouput is ```bash tmp1 = 0.0099009899422526359558 tmp2 = 0.0099009900990099011103 ``` `auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value. I also found that the bug occurred at that position `86d8af6a6c/torch/_inductor/lowering.py (L6238)` The commit states that the precision lost is expected in cuda implementation. original commit `03439d4c1c` cuda implementation `0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)` What is interesting is that the Triton kernel works correctly due to the precision of float type in python. ```python def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 0.009900990099009901 tmp2 = tmp0 * tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231 Approved by: https://github.com/eellison	2025-07-25 08:57:17 +00:00
cyy	a0bc865123	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD	2025-07-25 08:37:51 +00:00
Laith Sakka	aaa384b2d4	move view_meta to fake impl (#158406 ) Python dispatcher is not always enabled in fake tensors and have to be called explicitly. While it should be, it requires some work to get all tests working. I have been running in several issues where I add to add enable_python_dispatcher ex XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406 Approved by: https://github.com/bobrenjc93	2025-07-25 08:21:27 +00:00
Jeff Daily	0fd5f1c294	[ROCm][CI] upgrade wheels to 6.4.2 patch release (#158886 ) Upgrade wheels to ROCm 6.4.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158886 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 08:11:41 +00:00
Xu Han	e38a2b3d0f	[inductor] add missing ignore_errors parameter for Windows. (#159025 ) The origin code comemnts: ```python # Let's not fail if we can't clean up the temp dir. Also note that for # Windows, we can't delete the loaded modules because the module binaries # are open. ``` But we are missing the `ignore_errors` parameter for Windows. I help to add it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025 Approved by: https://github.com/jansel	2025-07-25 07:58:22 +00:00
Robert Burke	ae183d6092	Aten vector default constructors set to 0, add fnmadd and fnmsub (#158508 ) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158508 Approved by: https://github.com/swolchok	2025-07-25 06:55:37 +00:00
Animesh Jain	659f8fb115	[dynamo][guards] Add some relational guard helpers (#159077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159077 Approved by: https://github.com/jansel ghstack dependencies: #158995	2025-07-25 06:28:10 +00:00
Animesh Jain	05a748d287	[dynamo][guards] Expand is_immutable_object to have None (#158995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158995 Approved by: https://github.com/Lucaskabela, https://github.com/jansel	2025-07-25 06:12:05 +00:00
Han, Chao1	02ca965560	Device agnostic for DCP (#158337 ) Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well. use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices. Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-25 05:24:09 +00:00
Dylan Maloy	511d987378	only call re-plan if historic max's were updated. (#159016 ) Summary: wasteful. only update the plan if a new maximum has been found. Test Plan: ci Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D78859344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159016 Approved by: https://github.com/SherlockNoMad	2025-07-25 05:07:30 +00:00
zeshengzong	9685fc36d4	Add missing optional for tensor ops (#159028 ) ## Test Result <img width="872" height="340" alt="image" src="https://github.com/user-attachments/assets/20c3f1a2-0160-4ea3-b9f3-14630b4ec06d" /> <img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/68f8d8da-0570-4ae8-8e45-573b2c64cae5" /> <img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/42d133f6-94eb-4a38-8b4b-5586f52bff88" /> <img width="878" height="285" alt="image" src="https://github.com/user-attachments/assets/d3ad8950-81fa-4c4c-a5b5-621b0d9df99b" /> <img width="889" height="430" alt="image" src="https://github.com/user-attachments/assets/9aabeaff-bb8f-4990-b253-1bb053e72aca" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159028 Approved by: https://github.com/Skylion007	2025-07-25 04:36:55 +00:00
PyTorch UpdateBot	9e5cfd3ee5	[audio hash update] update the pinned audio hash (#159108 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159108 Approved by: https://github.com/pytorchbot	2025-07-25 04:35:21 +00:00
Nikita Shulga	cdf8e9ec1a	[MPS] Add support for unsigned types (#159094 ) As both Metal and MPS support uint16/uint32 and uint64 Test plan: `python3 -c "import torch;print(torch.randint(55, 66, (16,), device='mps', dtype=torch.uint16)[10:])"` Fixes https://github.com/pytorch/pytorch/issues/159076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159094 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-07-25 04:31:42 +00:00
PyTorch UpdateBot	bcf34d24eb	[vllm hash update] update the pinned vllm hash (#159107 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159107 Approved by: https://github.com/pytorchbot	2025-07-25 04:03:39 +00:00
Jeff Daily	9b29166f57	[ROCm] add flag torch.backends.miopen.immediate (#158951 ) The MIOpen integration has changed over the years. In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode. But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True. This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode. Thus, immediate mode needs its own flag. Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags. This PR adds the first miopen-only flag to control its immediate mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 04:01:51 +00:00
Jeff Daily	1fced0c7d5	[ROCm] enable hipblaslt on gfx908 for ROCm >= 6.3 (#159092 ) Fixes #159030. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159092 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 03:54:30 +00:00
Nichols A. Romero	16c0ccd669	[ROCm][CI] upgrade to 6.4.2 patch release (#158887 ) Upgrade to ROCm 6.4.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158887 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 03:45:44 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
FFFrog	4261e26a8b	[OpenReg] move fallback tests into test_openreg.py (#158441 ) ---- - move fallback tests into test_operneg - remove the test_cpp_extensions_open_device_registration.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158441 Approved by: https://github.com/albanD ghstack dependencies: #158415, #158440	2025-07-25 02:39:41 +00:00
FFFrog	b635359e4c	[OpenReg] add pyproject.toml for openreg (#158440 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158440 Approved by: https://github.com/albanD ghstack dependencies: #158415	2025-07-25 02:39:41 +00:00
FFFrog	f1a1aa9490	[OpenReg] Improve README.md and optimize some codes for OpenReg (#158415 ) ---- - add description for DSO dependencies - remove unnecessary code Pull Request resolved: https://github.com/pytorch/pytorch/pull/158415 Approved by: https://github.com/albanD	2025-07-25 02:39:41 +00:00
FFFrog	6fc0ad22f0	Using the latest torch.library.register_fake API instead of torch.library.impl_abstract (#158839 ) As the title stated. `torch.library.impl_abstract` have beed deprecated in PyTorch2.4, so change to use the new API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158839 Approved by: https://github.com/jingsh, https://github.com/zou3519 ghstack dependencies: #158838	2025-07-25 02:37:30 +00:00
FFFrog	c60d382870	Add tests for torch.ops.load_library (#158838 ) According to this [comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3097899129), adding a related test to keep BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158838 Approved by: https://github.com/zou3519	2025-07-25 02:37:30 +00:00
Chuan Jiang	64cb349b81	Extract a method that filters frames in the captured stack trace (#158266 ) Summary: The subclass can override the filtering logic to customize which frames to keep or drop. Test Plan: ``` buck run caffe2/test:test_export -- -r test_stack_trace buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_custom_obj_list_out buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r class_member_back_compat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266 Approved by: https://github.com/ezyang, https://github.com/yushangdi	2025-07-25 02:22:03 +00:00
PyTorch MergeBot	a53db90e21	Revert "[inductor] consolidate common GEMM triton param retrieval (#158015 )" This reverts commit 9faef3d17c2e422d5d62f62b266155e2deb52c40. Reverted https://github.com/pytorch/pytorch/pull/158015 on behalf of https://github.com/henrylhtsang due to breaking tests ([comment](https://github.com/pytorch/pytorch/pull/158015#issuecomment-3115384824))	2025-07-25 00:16:50 +00:00
Ke Wen	9c10760662	[SymmMem] Use host/nvshmem_api.h for backward compat (#159061 ) Resolves #159045 `nvshmem_host.h` was introduced in 3.3.9. Use `host/nvshmem_api.h` and `host/nvshmemx_api.h` for prior versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159061 Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/fegin	2025-07-24 22:56:26 +00:00
PyTorch MergeBot	8d2a1d6e18	Revert "Graph break with error message (#158800 )" This reverts commit cae4746952afbb6d26ecf7599cb7c6c449c69ef4. Reverted https://github.com/pytorch/pytorch/pull/158800 on behalf of https://github.com/clee2000 due to broke some tests on main inductor/test_distributed_patterns.py::DistributedPatternTests::test_nn_param_return4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16507837934/job/46685704688) [HUD commit link](`cae4746952`), note to self: bad TD, but also dynamo/test_repros failed but didn't get skipped by TD so maybe a landrace, or I just blaming the wrong commit entirely.. ([comment](https://github.com/pytorch/pytorch/pull/158800#issuecomment-3115224608))	2025-07-24 22:45:58 +00:00
PyTorch MergeBot	751285cb22	Revert "Move some of vec into headeronly in preparation for Half.h (#158976 )" This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d. Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))	2025-07-24 22:31:49 +00:00
Lucas Kabela	efc810c7d0	[Bugfix] Fix circular import between export and dynamo from tensor fn map (#158931 ) Fixes #158120 The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized Fix is to populate this map lazily ``` python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305	2025-07-24 22:24:57 +00:00
Xu Han	abb0bf45df	[AOTI] skip crashed case on Windows temporary. (#158929 ) skip crashed case on Windows temporary. This case will crashed application: <img width="1053" height="275" alt="image" src="https://github.com/user-attachments/assets/3225e9c8-cbe7-4998-86da-f20fbb12ead2" /> Quick analysis: <img width="1400" height="261" alt="image" src="https://github.com/user-attachments/assets/9c21fefc-9ed8-40f2-84c5-edde2004777c" /> 1. It is crashed on OpenMP. 2. stack is dameged, need consider how to debug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158929 Approved by: https://github.com/desertfire	2025-07-24 22:08:19 +00:00
PyTorch MergeBot	b533f12120	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit da94023b0205bf98c3da366f2f86e0a443f4db17. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @sraikund16 can you please help validate the fix? (See D78845227 for details). You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3115072504))	2025-07-24 21:46:00 +00:00
Aaron Orenstein	e20736bf1d	Dont't GC as often when collecting cudagraphs (#158193 ) TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`. We were previously garbage collecting at the beginning of each cudagraph capture. vLLM collects 5427 graphs and most of those garbage collections weren't actually collecting any memory (CPU or GPU). This changes it to not collect more than every 10s so if we're capturing in a loop we don't burn all our cycles looking for garbage. (These number have a lot of variance from run to run but give the correct general scale) ``` \| calls \| total \| synchronize \| gcs \| collect \| empty cache \| sys freed \| cuda freed \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ before \| 5427 \| 78s \| 1.48s \| 5427 \| 53.22s \| 1.21s \| 145855 \| 1539309568 \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ after \| 5427 \| 24s \| 0s \| 3 \| 1.53s \| 0.84s \| 592 \| 1539309568 \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ ``` total - this is the total time reported by vLLM's "Graph capturing finished" log. The rest of these are measured in torch.cuda.graphs.graph.__enter__(): calls - number of times torch.cuda.graphs.graph.__enter__ was called synchronize - this is the duration taken by the cuda.synchronize call gcs - number of times gc.collect was called collect - this is the duration taken by the gc.collect call empty cache - this is the duration taken by the torch.cuda.empty_cache call sys freed - the number of bytes reported freed by gc.collect cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is fairly quick. Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this): <img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193 Approved by: https://github.com/ngimel	2025-07-24 21:37:11 +00:00
Lucas Kabela	cae4746952	Graph break with error message (#158800 ) Fixes #157452 Test with ``` python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks ``` ### Release Notes Change to nn.Parameter Constructor Behavior in Dynamo Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800 Approved by: https://github.com/anijain2305	2025-07-24 21:05:17 +00:00
Jithun Nair	4a13d4d7d0	[ROCm] Update jit_utils.cpp for compatibility with ROCm7.0 (#158868 ) Resolves error when running tests such as `test_nn.py::TestNN::test_L1Loss_no_reduce_complex_cuda` etc. on ROCm7.0: ``` /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1016:7: error: no template named 'is_floating_point'; did you mean '__hip_internal::is_floating_point'? 1016 \| is_floating_point<_Tp>::value, \| ^~~~~~~~~~~~~~~~~ \| __hip_internal::is_floating_point /tmp/comgr-4cd8ad/include/hiprtc_runtime.h:1481:31: note: '__hip_internal::is_floating_point' declared here 1481 \| template<typename _Tp> struct is_floating_point : public false_type {}; \| ^ /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1017:16: error: too few template arguments for class template '__libcpp_complex_overload_traits' 1017 \| typename __libcpp_complex_overload_traits<_Tp>::_ComplexType \| ^ /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:850:10: note: template is declared here 847 \| template <class _Tp, bool = is_integral<_Tp>::value, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 848 \| bool = is_floating_point<_Tp>::value \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 849 \| > \| ~ 850 \| struct __libcpp_complex_overload_traits {}; \| ^ fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx90a. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158868 Approved by: https://github.com/jeffdaily	2025-07-24 21:00:37 +00:00
Ti-Tai Wang	da35562bba	[ONNX] Filter out torchscript sentences (#158850 ) Fixes #157300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850 Approved by: https://github.com/justinchuby, https://github.com/svekars	2025-07-24 20:59:06 +00:00
Guilherme Leobas	de85ee73ae	Update context in `unimplemented_v2` when exception bubbles up to the interpreter (#158924 ) Before: ``` .Observed exception Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region. Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues. Developer debug context: ``` After: ``` Observed exception Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region. Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues. Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924 Approved by: https://github.com/williamwen42, https://github.com/zou3519	2025-07-24 20:50:22 +00:00
eqy	8573a2beda	[CUDA] Fix missing `__syncthreads` in MultiMarginLoss backward (#158994 ) Turns out issue in #158921 is detectable with a simple unit test and adding the missing sync fixes it Pull Request resolved: https://github.com/pytorch/pytorch/pull/158994 Approved by: https://github.com/malfet, https://github.com/Skylion007 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-24 20:47:29 +00:00
PyTorch MergeBot	13398dab79	Revert "Remove tensorexpr tests (#158928 )" This reverts commit a3f9f79f591102afa93145bb67dc7e34df44f9a4. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to Theres still some references to the things removed in this PR in test.sh, the jobs on this PR are failing because of that but log classifier is probably pointing to a wrong line, should be an easy fix tho ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3114873706))	2025-07-24 20:45:30 +00:00
Jane Xu	5564f2ca2e	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-24 20:32:33 +00:00
Sidharth	f3edcac23a	[dynamo] Added back weblink generation (#159011 ) Added back weblink generation for v2.9 development Note: It is fine to bring the weblink generation back since v2.9 isn't released for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/159011 Approved by: https://github.com/williamwen42	2025-07-24 20:27:11 +00:00
zhxchen17	90c241dedd	[precompile] Support user defined function calls from bytecode. (#158947 ) Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage. Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process. With this fix, we are able to run nanogpt inference+training with precompile under torchbench. Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947 Approved by: https://github.com/jamesjwu	2025-07-24 20:10:57 +00:00
Luca Wehrstedt	5ab0eb28f7	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-24 20:10:51 +00:00
Laith Sakka	0b2ef76e85	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-24 20:08:05 +00:00
Ruben Rodriguez Buchillon	9faef3d17c	[inductor] consolidate common GEMM triton param retrieval (#158015 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D78081314](https://our.internmc.facebook.com/intern/diff/D78081314) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158015 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-07-24 19:17:48 +00:00
namgyu-youn	aeaa20083f	[profiler] update CUDA runtime kernel identification logic (#157890 ) Update CUDA kernel detection to exclude memory API calls References: - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/157890 Approved by: https://github.com/sraikund16	2025-07-24 19:14:08 +00:00
zpcore	5be7e187ba	Support `sort` and `scatter_add` strategy (#159022 ) Add `sort`, `scatter_add` strategy. I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies. Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022 Approved by: https://github.com/XilunWu, https://github.com/wconstab	2025-07-24 18:33:18 +00:00
Nikita Shulga	347a97da66	[MPS] Enable dlpack integration (#158888 ) Though testing is a lie and dependent on https://github.com/pytorch/pytorch/pull/153835 Fixes https://github.com/pytorch/pytorch/issues/153789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158888 Approved by: https://github.com/albanD ghstack dependencies: #158874	2025-07-24 18:05:41 +00:00
Conan Truong	78aa3bd6b6	Added Emscripten __assert_fail declaration to Macros.h (#158580 ) Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten. Test Plan: N/A Rollback Plan: Differential Revision: D78500790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580 Approved by: https://github.com/JacobSzwejbka	2025-07-24 17:10:29 +00:00
Jeff Daily	ee97dbf2e7	[ROCm][CI] update HIP patch for 6.4.1, again (#159001 ) Another fix for hipGraph capture of MIOpen OCL kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159001 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-24 16:36:19 +00:00
zeshengzong	b7d41729e0	Add `zerotensor` design description in code (#158837 ) Fix `TODO: add a note explaining the design decisions` by adding design description in https://github.com/pytorch/pytorch/issues/69687 to codebase. Make it easier to get and read by other developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158837 Approved by: https://github.com/soulitzer	2025-07-24 16:35:42 +00:00
Lucas Kabela	abcb24f4de	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397 Approved by: https://github.com/anijain2305	2025-07-24 15:55:18 +00:00
fduwjj	fd48681b6a	[DeviceMesh][ez] Make the logic within flatten simpler (#158999 ) While looking at the code of device mesh I find that this logic can be simplified. Also the naming needs to be correct. Because this mesh is not "flattened" yet, so we can just call it flatten. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158999 Approved by: https://github.com/wz337, https://github.com/wconstab ghstack dependencies: #158900	2025-07-24 15:40:13 +00:00
cyy	a3f9f79f59	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD	2025-07-24 15:38:36 +00:00
Ke Wen	2fc0b1605e	[a2av] Make test input more random (#157029 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Use torch.randn to fill input buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157029 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743, #156881, #157026	2025-07-24 15:35:12 +00:00
PyTorch MergeBot	11ea3736dd	Revert "[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 )" This reverts commit 0c0fcb53ff5ee1eb5f0d1f535ed3726d01f8abb5. Reverted https://github.com/pytorch/pytorch/pull/158691 on behalf of https://github.com/ZainRizvi due to Sorry but these are causing jobs to fail with out of memory errors on trunk ([comment](https://github.com/pytorch/pytorch/pull/158691#issuecomment-3113922186))	2025-07-24 15:31:53 +00:00
Ke Wen	43d4ff6851	[a2av] Test dispatch-then-combine (#157026 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Putting both the dispatch API and combine API in battlefield, one following the other, i.e. ``` all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...) all_to_all_vdev_2d_offset( input=out, out=combine_out, in_splits_offsets=out_splits_offsets, out_splits_offsets=combine_out_splits_offsets ) ``` Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine. Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits. It works! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026 Approved by: https://github.com/Skylion007, https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743, #156881	2025-07-24 15:21:02 +00:00
Ke Wen	83957d1c03	[a2av] Add token combine operator (#156881 ) Added `all_to_all_vdev_2d_offset`, which: Perform a 2D AllToAllv operation, with input split and offset information provided on device. The input offsets need not to be exact prefix sum of the input splits, i.e. paddings are allowed between the splitted chunks. The paddings, however, will not be transferred to peer ranks. In Mixure of Experts models, this operation can be used to combine tokens processed by experts on remote ranks. This operation can be viewed as an "reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles tokens to experts). The change may seem a bit dense, sorry. But it is mainly two changes: 1. templating existing device functions (to use provided input offset or calculate it) 2. generalizing variable names, e.g. npes, ne --> minor_size, major_size, so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881 Approved by: https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743	2025-07-24 15:08:04 +00:00
Pian Pawakapan	48fe4ff247	[export] set enable_gqa in export flash->math decomp (#158604 ) Differential Revision: D78524147 For `scaled_dot_product_attention(..., enable_gqa=True)`: - the Math backend passes the flag through, performing the extra [KV broadcast](`6e07d6a0ff/aten/src/ATen/native/transformers/attention.cpp (L902)`) if set to True - the Flash backend has no flag, and relies on correct indexing in the C++ kernel - Export used to default to Math for `enable_gqa=True`, but https://github.com/pytorch/pytorch/pull/157893 landed and enabled Flash. At the same time, there's an export-only [decomp](`6e07d6a0ff/torch/_decomp/decompositions.py (L4968)`) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs. This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158604 Approved by: https://github.com/angelayi, https://github.com/drisspg	2025-07-24 14:46:13 +00:00
James Wu	f55c5d085e	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-24 14:09:54 +00:00
Nick Westlake	a3025e17b2	Fix inductor non-stable argsort/sort test (#146622 ) - Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output. - Add functionality to allow alternative assert_equal functions in inductor tests for future cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622 Approved by: https://github.com/eellison Co-authored-by: George Wigley <georgewi@graphcore.ai>	2025-07-24 14:02:12 +00:00
atalman	afd6eb0d49	[docker release] Remove build layer as not used (#158988 ) [docker release] Remove build layer as not used in any of the : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official Pull Request resolved: https://github.com/pytorch/pytorch/pull/158988 Approved by: https://github.com/oulgen, https://github.com/malfet	2025-07-24 12:22:55 +00:00
IvanKobzarev	3ced1079a4	[inductor] Fix collectives_reordering overwrite real_dep with fake_dep with the same name (#158960 ) Differential Revision: [D78839734](https://our.internmc.facebook.com/intern/diff/D78839734) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158960 Approved by: https://github.com/wconstab	2025-07-24 11:08:58 +00:00
Hameer Abbasi	3e954d3943	better testing for subclasses + compile (#158742 ) Fixes #114398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158742 Approved by: https://github.com/ezyang	2025-07-24 10:28:44 +00:00
Sherlock Huang	fb067de550	[NativeRT] Remove device_ member from OpKernel base class (#158944 ) Summary: In general, device_ is not very useful in OpKernel. Remove it to avoid misuse. Also, the meaning of `device_` is also ambiguous in the OpKernel. For StaticDispatch kernels, we always call cpu kernel. For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on. For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined. Test Plan: CI Rollback Plan: Reviewed By: henryoier, dolpm, kqfu, zhxchen17 Differential Revision: D78704840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944 Approved by: https://github.com/dolpm	2025-07-24 09:21:37 +00:00
Wei (Will) Feng	693197eed6	[doc] remove FSDP1 developer note (#158991 ) this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2 https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991 Approved by: https://github.com/svekars, https://github.com/mori360 ghstack dependencies: #158989	2025-07-24 08:21:54 +00:00
cyy	65c1109ca2	Remove CUDA 11 CMake code (#156795 ) CUDA 11 is no longer supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156795 Approved by: https://github.com/atalman, https://github.com/malfet	2025-07-24 08:00:41 +00:00
Ke Wen	70fb5bb6fb	[CI] Add smoke test for NVSHMEM availability (#158938 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158938 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-07-24 06:34:21 +00:00
zero000064	30bb7636da	removed zero dim cpu logic from fake_tensor.py (#147501 ) Fixes #144748 In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug. The root cause is fake_tenosr.py's find-common-device method, `0b0da81021/torch/_subclasses/fake_tensor.py (L833)`, takes zero dim cpu tensor into account but the device check in adaption.h doesn't. This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501 Approved by: https://github.com/ezyang	2025-07-24 06:19:46 +00:00
Wei (Will) Feng	68349118b5	[doc] add weifengpy to torch distributed pocs (#158989 ) <img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989 Approved by: https://github.com/mori360	2025-07-24 04:42:33 +00:00
PyTorch UpdateBot	e09d80c545	[vllm hash update] update the pinned vllm hash (#158997 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158997 Approved by: https://github.com/pytorchbot	2025-07-24 04:04:17 +00:00
Nikita Shulga	07df6ba7f5	[BE] Remove unused `test_python_gloo_with_tls` (#158964 ) This was last modified in 2021 and has not been invokved at least since 2.0 release Pull Request resolved: https://github.com/pytorch/pytorch/pull/158964 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #158961, #158962, #158963	2025-07-24 02:34:27 +00:00
Nikita Shulga	d61153a300	Delete mobile merge rule (#158963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158963 Approved by: https://github.com/atalman ghstack dependencies: #158961, #158962	2025-07-24 02:34:27 +00:00
Nikita Shulga	da9e120e3f	[BE] Remove unused `build-android` action (#158962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158962 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #158961	2025-07-24 02:34:27 +00:00
Nikita Shulga	611b61e758	[BE] Remove android build rules (#158961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158961 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-07-24 02:34:27 +00:00
cyy	d352c28dd1	[2/N] Remove FindPackageHandleStandardArgs.cmake (#156559 ) Following #157188, this PR removes FindPackageHandleStandardArgs.cmake Pull Request resolved: https://github.com/pytorch/pytorch/pull/156559 Approved by: https://github.com/albanD	2025-07-24 02:34:10 +00:00
Catherine Lee	0c0fcb53ff	[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 ) 3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2 sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB) I've applied skips to the ones that OOMed Time decreases from ~2.7hr per test job -> ~2hr Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691 Approved by: https://github.com/huydhn	2025-07-24 01:51:28 +00:00
Teja	febf3c475e	fix forced loglevel in pytorch oss code (#158820 ) Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820 Approved by: https://github.com/Skylion007, https://github.com/pradeepfn	2025-07-24 00:40:28 +00:00
Aditya Tewari	7001d6fbc9	Skip slow tests for aarch64-inductor-benchmarks (#158842 ) This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench. The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml) • The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout. Action: • An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025. To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in: * timm benchmarks * torchbench benchmarks As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but: • An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated. Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842 Approved by: https://github.com/malfet	2025-07-24 00:21:38 +00:00
Bin Bao	0118931e27	[Inductor] Fix a user-defined Triton kernel bool param codegen issue (#158845 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/158778. When handling a boolean type parameter to a user-defined Triton kernel, we need to treat it differently from integer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158845 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-07-24 00:19:27 +00:00
atalman	ebb032a202	[docker release] Fix push nightly tag (#158984 ) This is a typo. I see that this step is not executing in nightly builds: https://github.com/pytorch/pytorch/actions/runs/16464544564/job/46538759844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158984 Approved by: https://github.com/oulgen	2025-07-23 23:39:49 +00:00
Ke Wen	60ac3414eb	[a2av] Split in_out_splits into in_splits and out_splits_offsets (#156743 ) So that it would be easier if user would like to feed `out_splits_offsets` as input to a combining a2av (coming next). An example is in #157029. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156743 Approved by: https://github.com/ngimel ghstack dependencies: #158234, #158235	2025-07-23 23:34:48 +00:00
PyTorch MergeBot	d34cee4cf3	Revert "[Torch Native] Add test for packaging weight (#158750 )" This reverts commit 85ee2fb8c5c57b513526b0cc968ba13012167572. Reverted https://github.com/pytorch/pytorch/pull/158750 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing on trunk: inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_with_exporter_weights [GH job link](https://github.com/pytorch/pytorch/actions/runs/16478978095/job/46590552109) [HUD commit link](`85ee2fb8c5`) ([comment](https://github.com/pytorch/pytorch/pull/158750#issuecomment-3111188266))	2025-07-23 23:24:55 +00:00
Anshul Sinha	5cdb3d896e	[FSDP][Replicate] added replicate function that uses FSDP instead of DDP (#158207 ) Summary Users would like to use Replicate with TP. Currently, the replicate function uses DDP, which has not been maintained resulting in a lack of integration options. Since users can use FSDP with TP, we will make the replicate function use FSDP so that users can use replicate with FSDP. To that end I have created a replicate function that uses FSDP instead of DDP. One blocker that I ran into is that the replicate function has a contract which assigns a module "replicate" attribute in registry. This would mean that fully_shards is_composable requirement would not be satisfied making it impossible to apply fully_shard to a replicate module. The solution to this was to copy the fully_shard function and state and modify it for replicate. In the future, it should be explored making the replicate_state inherit from FSDP_state to get rid of code duplicity. I have attached below the profile tracing of a replicated Net Module. https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_270fcc36-194a-42f5-9841-cace984c2132_devgpu263.prn2.facebook.com_1792146.1753232748025155780.pt.trace.json Test Case 1. pytest test/distributed/_composable/test_replicate_with_fsdp.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158207 Approved by: https://github.com/weifengpy Co-authored-by: Anshul Sinha <50644008+sinhaanshul@users.noreply.github.com>	2025-07-23 22:53:06 +00:00
Guilherme Leobas	0204099762	Raise exception in Dynamo if op fails in the interpreter (#158661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158661 Approved by: https://github.com/williamwen42 ghstack dependencies: #158660	2025-07-23 22:31:51 +00:00
Guilherme Leobas	b67f97c166	Correctly handle `OP_CONTAINS` (#158660 ) CPython can fallback to `__iter__` if object doesn't implement `__contains__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660 Approved by: https://github.com/zou3519	2025-07-23 22:31:51 +00:00
Mikayla Gawarecki	7f649ed4f8	Add basic torch.hash_tensor op (#154149 ) Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor. - The hash is always uint64. - Integers will be casted to uint64 before performing the xor_sum reduction - Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149 Approved by: https://github.com/albanD	2025-07-23 22:28:03 +00:00
Max Ren	86df3ff1f1	fix xnnpack build on mac (#158881 ) Summary: Fix a bug for not getting the correct sources Test Plan: CI on my mac: ``` buck2 build @//fbobjc/mode/profile --show-full-output //xplat/executorch/examples/portable/executor_runner:executor_runner_opt File changed: fbsource//xplat/caffe2/third_party/xnnpack.buck.bzl Buck UI: https://www.internalfb.com/buck2/67b59179-4de8-462a-9202-0b9c34a35aef Network: Up: 2.4MiB Down: 1.3KiB (reSessionID-f687a7cd-5961-4851-bc67-b07043baa52a) Loading targets. Remaining 0/1 504 targets declared Analyzing targets. Remaining 0/42 1960 actions, 2424 artifacts declared Executing actions. Remaining 0/975 37.2s exec time total Command: build. Finished 40 local Time elapsed: 7.7s BUILD SUCCEEDED fbsource//xplat/executorch/examples/portable/executor_runner:executor_runner_opt /Users/maxren/fbsource/buck-out/v2/gen/fbsource/267ffdee31edf15e/xplat/executorch/examples/portable/executor_runner/__executor_runner_opt__/executor_runner_opt ``` Rollback Plan: Reviewed By: swolchok Differential Revision: D78771697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158881 Approved by: https://github.com/digantdesai	2025-07-23 22:06:27 +00:00
fduwjj	82f8e04f27	Update distributed maintainers (#158900 ) I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900 Approved by: https://github.com/albanD	2025-07-23 21:53:27 +00:00
saienduri	5619bf9971	Enable MI355X PyTorch CI testing. (#158889 ) This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes. - Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build. - Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](`df6023e305`) for mi350 compatibility. - Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker. - Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST. - Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: `ca7d5fae11 (rocm-mi300)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889 Approved by: https://github.com/jeffdaily	2025-07-23 21:50:31 +00:00
zpcore	d8425e9c75	[1/N] support of replication fallback strategy (#158046 ) #### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor. We can call register_op_strategy to register using the `fallback_op_strategy`: - For op without List[Tensor] as input, call: ``` register_op_strategy(op_overload)(replicate_op_strategy) ``` - For op contains List[Tensor] as input, call: ``` register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy) ``` The strategy will force all input and output to be replicated with the corresponding redistribute_cost. #### 2. Add a test function as a necessary condition for strategy function. ``` detect_exists_identical_opspec(*args, op, mesh, strategy_function) ``` This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`. #### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing. E.g., ``` with op_strategy_context(test_op.default, replicate_op_strategy): ... ``` #### 4. Fix a bug that TupleStrategy never get flatten as expected: `9df0176408/torch/distributed/tensor/_op_schema.py (L286)` Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after `strategy_schema = _wrap_with_op_strategy(op_schema)`. This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2025-07-23 21:14:20 +00:00
fduwjj	633d5faf3f	[DeviceMesh] Enable slicing a submesh with warnings (#158899 ) We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899 Approved by: https://github.com/wz337	2025-07-23 21:13:41 +00:00
Sidharth	4d5d56a30e	[dynamo] lintrunner for gb_registry adds/updates (#158460 ) This PR adds automation to adding/updating the JSON registry through the lintrunner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158460 Approved by: https://github.com/williamwen42	2025-07-23 21:02:54 +00:00
Xuehai Pan	64e8d7d66b	[BE] bump test dependency `z3-solver` to drop using deprecated `pkg_resources` (#158905 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158905 Approved by: https://github.com/albanD, https://github.com/ezyang ghstack dependencies: #158904	2025-07-23 21:01:02 +00:00
Xuehai Pan	b935ad17d5	[BE][Easy] add missing Python 3.14 PyPI classifier (#158904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158904 Approved by: https://github.com/albanD	2025-07-23 21:01:02 +00:00
henrylhtsang	f7f550649f	[cutlass backend] Change default inst level mm config number (#158901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158901 Approved by: https://github.com/ColinPeppler, https://github.com/jingsh, https://github.com/Skylion007	2025-07-23 20:53:22 +00:00
PaliC	255c0545e7	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290, #158291	2025-07-23 20:27:28 +00:00
PaliC	9c68c4d08f	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-23 20:27:28 +00:00
PaliC	6ed2cb6ccd	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD ghstack dependencies: #158288	2025-07-23 20:27:28 +00:00
PaliC	ab26d4fbeb	[BE] remove torch deploy - conditionals (#158288 ) This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started. 1. Remove test_deploy_interaction as we no longer need to worry about this 2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1) 3. Remove `USE_DEPLOY` and switch to the default path always Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288 Approved by: https://github.com/albanD	2025-07-23 20:27:28 +00:00
Denghui Dong	da94023b02	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-23 20:03:52 +00:00
Nichols A. Romero	c996aff6ed	[ROCm] UT verifies a runtime error is raised if tensor.item() is captured in a cudagraph (#158878 ) Unit test for this PR: https://github.com/pytorch/pytorch/pull/158165 This unit test verifies that a runtime error is raised when tensor.item() operation is captured in a cudagraph. Equally valid for ROCm and CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158878 Approved by: https://github.com/jeffdaily, https://github.com/ngimel	2025-07-23 20:01:50 +00:00
drisspg	691736ae07	Add kernel options to flex docs (#158875 ) Fixes https://github.com/pytorch/pytorch/issues/158741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875 Approved by: https://github.com/BoyuanFeng, https://github.com/albanD	2025-07-23 19:05:19 +00:00
PaulZhang12	fe8f556006	Fix Triton GEMM templates with k=1 (#158650 ) Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape. One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint. For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100: <img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650 Approved by: https://github.com/davidberard98	2025-07-23 18:45:51 +00:00
Shangdi Yu	85ee2fb8c5	[Torch Native] Add test for packaging weight (#158750 ) Add test that require weights to be packaged for torch native For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model. After we added weight deduping, we should be able to let this config be False. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750 Approved by: https://github.com/desertfire	2025-07-23 18:36:10 +00:00
Mikayla Gawarecki	fef236da69	Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (#158866 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158866 Approved by: https://github.com/janeyx99	2025-07-23 18:31:05 +00:00
PyTorch MergeBot	76be282e3a	Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 )" This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8. Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))	2025-07-23 18:25:46 +00:00
PaulZhang12	9905ed616a	[Inductor] Expose decomposeK knobs as envvars (#158745 ) Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745 Approved by: https://github.com/eellison	2025-07-23 18:23:44 +00:00
PyTorch MergeBot	30b0ad5c68	Revert "Fix decorators skipping NCCL tests (#158846 )" This reverts commit 57024913c409764f129d6a7792625f5b05462e31. Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](`57024913c4`) ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3109553414))	2025-07-23 17:47:35 +00:00
PyTorch MergeBot	41b6cdaf76	Revert "Fix Triton GEMM templates with k=1 (#158650 )" This reverts commit 9df0f565972a8a034fd77d65aff2c53e6e9856d1. Reverted https://github.com/pytorch/pytorch/pull/158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158650#issuecomment-3109538827))	2025-07-23 17:42:10 +00:00
Animesh Jain	1b456c580d	[dynamo][guards] Add type info of the guarded value in guard managers (#158765 ) tlparse looks like this <img width="1165" height="226" alt="image" src="https://github.com/user-attachments/assets/04c4e6b1-34a3-4d9d-8304-6eb6d9a94980" /> This will aid in reading guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158765 Approved by: https://github.com/Lucaskabela, https://github.com/StrongerXi	2025-07-23 16:59:15 +00:00
Xu Han	5e386eec94	[AOTI] enable aot inductor on Windows (#158915 ) With many PRs landed, we can run the first aot inductor example on Windows. <img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" /> Let's remove the Windows check on `AotCodeCompiler`. CC: @angelayi , @desertfire , @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915 Approved by: https://github.com/desertfire	2025-07-23 16:29:15 +00:00

1506 changed files with 60667 additions and 67981 deletions

									
										16

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -438,9 +438,7 @@ def build_torchvision(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -495,9 +493,7 @@ def build_torchdata(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -553,9 +549,7 @@ def build_torchtext(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -613,9 +607,7 @@ def build_torchaudio(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

									
										1

.ci/docker/README.md
									
												View File
												
				@ -104,7 +104,6 @@ If your new Docker image needs a library installed from a specific pinned commit

				   ```bash

				   pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)

				     CUDA_VERSION=12.8.1

				     CUDNN_VERSION=9

				     ANACONDA_PYTHON_VERSION=3.12

				     GCC_VERSION=11

				     VISION=yes

									
										78

.ci/docker/build.sh
									
												View File
												
				@ -93,7 +93,6 @@ tag=$(echo $image | awk -F':' '{print $2}')

				case "$tag" in

				  pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.4

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				@ -104,7 +103,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				@ -115,7 +113,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				@ -127,7 +124,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				@ -139,7 +135,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				@ -149,20 +144,8 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    VISION=yes

				@ -171,45 +154,8 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				@ -230,19 +176,7 @@ case "$tag" in

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.11-clang12)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)

				  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)

				    if [[ $tag =~ "jammy" ]]; then

				      ANACONDA_PYTHON_VERSION=3.10

				    else

				@ -256,7 +190,9 @@ case "$tag" in

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    if [[ $tag =~ "benchmarks" ]]; then

				      INDUCTOR_BENCHMARKS=yes

				    fi

				    ;;

				  pytorch-linux-noble-rocm-alpha-py3)

				    ANACONDA_PYTHON_VERSION=3.12

				@ -268,7 +204,6 @@ case "$tag" in

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				@ -299,7 +234,6 @@ case "$tag" in

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				@ -378,7 +312,6 @@ case "$tag" in

				    fi

				    if [[ "$image" == *cuda* ]]; then

				      extract_version_from_image_name cuda CUDA_VERSION

				      extract_version_from_image_name cudnn CUDNN_VERSION

				    fi

				    if [[ "$image" == *rocm* ]]; then

				      extract_version_from_image_name rocm ROCM_VERSION

				@ -430,9 +363,6 @@ docker build \

				       --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \

				       --build-arg "GCC_VERSION=${GCC_VERSION}" \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

 @ -1 +1 @@
 e186efbf7fb93328dd6b34927a4e8c8f24395
 v4.54.0

0

.github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 ec6354315768a85da41032535e3b7b99c5f706
 f7888497a1eb9e98d4c07537f0d0bcfe180d1363

									
										5

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -66,8 +66,9 @@ function do_cpython_build {

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.45.1 setuptools==80.9.0

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    # packaging is needed to create symlink since wheel no longer provides needed information

				    ${prefix}/bin/pip install packaging==25.0 wheel==0.45.1 setuptools==80.9.0

				    local abi_tag=$(${prefix}/bin/python -c "from packaging.tags import interpreter_name, interpreter_version; import sysconfig ; from sysconfig import get_config_var; print('{0}{1}-{0}{1}{2}'.format(interpreter_name(), interpreter_version(), 't' if sysconfig.get_config_var('Py_GIL_DISABLED') else ''))")

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

									
										4

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -68,8 +68,8 @@ function install_nvshmem {

				  # download, unpack, install

				  wget -q "${url}"

				  tar xf "${filename}.tar.gz"

				  cp -a "libnvshmem/include/"* /usr/local/include/

				  cp -a "libnvshmem/lib/"*     /usr/local/lib/

				  cp -a "libnvshmem/include/"* /usr/local/cuda/include/

				  cp -a "libnvshmem/lib/"*     /usr/local/cuda/lib64/

				  # cleanup

				  cd ..

									
										26

.ci/docker/common/install_cudnn.sh
									
												View File
											
				@ -1,26 +0,0 @@

				#!/bin/bash

				if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

				    else

				        print "Unsupported CUDA version ${CUDA_VERSION}"

				        exit 1

				    fi

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    tar xf ${CUDNN_NAME}.tar.xz

				    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cudnn

				    ldconfig

				fi

									
										30

.ci/docker/common/install_inductor_benchmark_deps.sh
									
												View File
												
				@ -15,11 +15,37 @@ function install_timm() {

				  commit=$(get_pinned_commit timm)

				  pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"

				  # Clean up

				  conda_run pip uninstall -y torch torchvision triton

				}

				function install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  python install.py --continue_on_fail

				  # soxr comes from https://github.com/huggingface/transformers/pull/39429

				  pip install transformers==4.54.0 soxr==0.5.0

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				  chown -R jenkins torchbench

				  chown -R jenkins /opt/conda

				}

				# Pango is needed for weasyprint which is needed for doctr

				conda_install pango

				# Stable packages are ok here, just to satisfy TorchBench check

				pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

				install_torchbench

				install_huggingface

				install_timm

				# Clean up

				conda_run pip uninstall -y torch torchvision torchaudio triton torchao

									
										15

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -30,7 +30,7 @@ EOF

				    # we want the patch version of 6.4 instead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        ROCM_VERSION="${ROCM_VERSION}.1"

				        ROCM_VERSION="${ROCM_VERSION}.2"

				    fi

				    # Default url values

				@ -85,16 +85,19 @@ EOF

				    # CI no longer builds for ROCm 6.3, but

				    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            CLR_HASH=606bc820b4b1f315d135da02a1f0b176ca50a92c  # branch release/rocm-rel-6.4.1-statco-hotfix

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.2) ]]; then

				            HIP_TAG=rocm-6.4.2

				            CLR_HASH=74d78ba3ac4bac235d02bcb48511c30b5cfdd457  # branch release/rocm-rel-6.4.2-statco-hotfix

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_TAG=rocm-6.4.1

				            CLR_HASH=efe6c35790b9206923bfeed1209902feff37f386  # branch release/rocm-rel-6.4.1-statco-hotfix

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            HIP_TAG=rocm-6.4.0

				            CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c  # branch release/rocm-rel-6.4-statco-hotfix

				        fi

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH

				        git clone https://github.com/ROCm/HIP -b $HIP_TAG

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr

				        pushd clr

									
										41

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -34,18 +34,27 @@ function install_ubuntu() {

				    # The xpu-smi packages

				    apt-get install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    apt-get install -y \

				        intel-opencl-icd intel-level-zero-gpu level-zero \

				        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				        # Compute and Media Runtimes

				        apt-get install -y \

				            intel-opencl-icd intel-level-zero-gpu level-zero \

				            intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				        # Development Packages

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    else # rolling driver

				        apt-get install -y \

				            intel-opencl-icd libze-intel-gpu1 libze1 \

				            intel-media-va-driver-non-free libmfx-gen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    apt-get install -y ${XPU_PACKAGES}

				@ -130,11 +139,11 @@ function install_sles() {

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				# Default use GPU driver rolling releases

				XPU_DRIVER_VERSION=""

				if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				    # Use GPU driver LTS releases

				    XPU_DRIVER_VERSION="/lts/2350"

				fi

				# Default use Intel® oneAPI Deep Learning Essentials 2025.0

									
										2

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -41,7 +41,7 @@ case ${DOCKER_TAG_PREFIX} in

				    rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        BASE_TARGET=rocm

				        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

									
										2

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -77,7 +77,7 @@ case ${image} in

				    manylinux2_28-builder:rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        TARGET=rocm_final

				        MANY_LINUX_VERSION="2_28"

25

.ci/docker/requirements-ci.txt

View File

 @ -50,7 +50,7 @@ flatbuffers==24.12.23
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 #Pinned versions: 3.44.6, 4.53.2
 #Pinned versions: 5.35.1
 #test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
 junitparser==2.1.1
 @ -63,11 +63,12 @@ lark==0.12.0
 #Pinned versions: 0.12.0
 #test that import:
 librosa>=0.6.2 ; python_version < "3.11"
 librosa==0.10.2 ; python_version == "3.12"
 librosa>=0.6.2 ; python_version < "3.11" and platform_machine != "s390x"
 librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 #test that import: test_spectral_ops.py
 #librosa depends on numba; disable it for s390x while numba is disabled too
 #mkl #this breaks linux-bionic-rocm4.5-py3.7
 #Description: Intel oneAPI Math Kernel Library
 @ -110,14 +111,15 @@ ninja==1.11.1.3
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 numba==0.60.0 ; python_version == "3.12"
 numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
 numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 #test that import: test_numba_integration.py
 #For numba issue see https://github.com/pytorch/pytorch/issues/51511
 #Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 @ -221,9 +223,9 @@ pygments==2.15.0
 #Pinned versions: 2.12.0
 #test that import: the doctests
 #PyYAML
 #pyyaml
 #Description: data serialization format
 #Pinned versions:
 #Pinned versions: 6.0.2
 #test that import:
 #requests
 @ -233,7 +235,7 @@ pygments==2.15.0
 #rich
 #Description: rich text and beautiful formatting in the terminal
 #Pinned versions: 10.9.0
 #Pinned versions: 14.1.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 @ -307,7 +309,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver==4.12.6.0
 z3-solver==4.15.1.0 ; platform_machine != "s390x"
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:
 @ -361,7 +363,6 @@ pwlf==2.2.1
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 pyyaml
 pyzstd

7

.ci/docker/requirements-docs.txt

View File

 @ -1,7 +1,7 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably
 @ -50,8 +50,8 @@ IPython==8.12.0
 #Pinned versions: 8.12.0
 myst-nb==0.17.2
 #Description: This is used to generate PyTorch functorch docs
 #Pinned versions: 0.13.2
 #Description: This is used to generate PyTorch functorch and torch.compile docs.
 #Pinned versions: 0.17.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5
 @ -59,4 +59,3 @@ sphinx-copybutton==0.5.0
 sphinx-design==0.4.0
 sphinxcontrib-mermaid==1.0.0
 myst-parser==0.18.1
 myst-nb

									
										3

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

									
										3

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt

				ARG TRITON

				ARG TRITON_CPU

									
										33

.ci/manywheel/build_common.sh
									
												View File
												
				@ -138,28 +138,11 @@ fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				    EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				@ -272,10 +255,6 @@ ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				@ -452,16 +431,8 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

									
										2

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

									
										34

.ci/pytorch/build-mobile.sh
									
												View File
											
				@ -1,34 +0,0 @@

				#!/usr/bin/env bash

				# DO NOT ADD 'set -x' not to reveal CircleCI secret context environment variables

				set -eu -o pipefail

				# This script uses linux host toolchain + mobile build options in order to

				# build & test mobile libtorch without having to setup Android/iOS

				# toolchain/simulator.

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				# Install torch & torchvision - used to download & trace test model.

				# Ideally we should use the libtorch built on the PR so that backward

				# incompatible changes won't break this script - but it will significantly slow

				# down mobile CI jobs.

				# Here we install nightly instead of stable so that we have an option to

				# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.

				retry pip install --pre torch torchvision \

				  -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \

				  --progress-bar off

				# Run end-to-end process of building mobile library, linking into the predictor

				# binary, and running forward pass with a real model.

				if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then

				  TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh

				elif [[ "$BUILD_ENVIRONMENT" == *-mobile-lightweight-dispatch* ]]; then

				  test/mobile/lightweight_dispatch/build.sh

				else

				  TEST_DEFAULT_BUILD=1 test/mobile/custom_build/build.sh

				fi

				print_sccache_stats

									
										45

.ci/pytorch/build.sh
									
												View File
												
				@ -11,10 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then

				  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"

				fi

				echo "Python version:"

				python --version

				@ -54,9 +50,6 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				  export ATEN_THREADING=NATIVE

				fi

				# Enable LLVM dependency for TensorExpr testing

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if ! which conda; then

				  # In ROCm CIs, we are doing cross compilation on build machines with

				@ -124,26 +117,8 @@ if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				fi

				# Use special scripts for Android builds

				if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then

				  export ANDROID_NDK=/opt/ndk

				  build_args=()

				  if [[ "${BUILD_ENVIRONMENT}" == *-arm-v7a* ]]; then

				    build_args+=("-DANDROID_ABI=armeabi-v7a")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-arm-v8a* ]]; then

				    build_args+=("-DANDROID_ABI=arm64-v8a")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_32* ]]; then

				    build_args+=("-DANDROID_ABI=x86")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_64* ]]; then

				    build_args+=("-DANDROID_ABI=x86_64")

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then

				    build_args+=("-DUSE_VULKAN=ON")

				  fi

				  build_args+=("-DUSE_LITE_INTERPRETER_PROFILER=OFF")

				  exec ./scripts/build_android.sh "${build_args[@]}" "$@"

				fi

				if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan* ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *vulkan* ]]; then

				  export USE_VULKAN=1

				  # shellcheck disable=SC1091

				  source /var/lib/jenkins/vulkansdk/setup-env.sh

				@ -198,7 +173,7 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"

				fi

				@ -214,7 +189,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export USE_ASAN=1

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

				@ -225,7 +199,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				    export USE_PRECOMPILED_HEADERS=1

				fi

				if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				@ -287,22 +261,13 @@ else

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        python3 tools/packaging/split_wheel.py bdist_wheel

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				      python setup.py bdist_wheel

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

									
										28

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -229,7 +229,6 @@ function install_torchrec_and_fbgemm() {

				    pip_install tabulate  # needed for newer fbgemm

				    pip_install patchelf  # needed for rocm fbgemm

				    pushd /tmp

				    local wheel_dir=dist/fbgemm_gpu

				    local found_whl=0

				@ -245,7 +244,7 @@ function install_torchrec_and_fbgemm() {

				    if [ "${found_whl}" == "0" ]; then

				      git clone --recursive https://github.com/pytorch/fbgemm

				      pushd fbgemm/fbgemm_gpu

				      git checkout "${fbgemm_commit}"

				      git checkout "${fbgemm_commit}" --recurse-submodules

				      python setup.py bdist_wheel \

				        --build-variant=rocm \

				        -DHIP_ROOT_DIR="${ROCM_PATH}" \

				@ -264,7 +263,6 @@ function install_torchrec_and_fbgemm() {

				    done

				    rm -rf fbgemm

				    popd

				  else

				    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec

				    pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu

				@ -283,30 +281,6 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488

				  # is regressing speedup metric. This needs to be investigated further

				  pip install transformers==4.38.1

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

				function install_torchao() {

				  local commit

				  commit=$(get_pinned_commit torchao)

									
										123

.ci/pytorch/create_test_cert.py
									
												View File
											
				@ -1,123 +0,0 @@

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				temp_dir = mkdtemp()

				print(temp_dir)

				def genrsa(path):

				    key = rsa.generate_private_key(

				        public_exponent=65537,

				        key_size=2048,

				    )

				    with open(path, "wb") as f:

				        f.write(

				            key.private_bytes(

				                encoding=serialization.Encoding.PEM,

				                format=serialization.PrivateFormat.TraditionalOpenSSL,

				                encryption_algorithm=serialization.NoEncryption(),

				            )

				        )

				    return key

				def create_cert(path, C, ST, L, O, key):

				    subject = issuer = x509.Name(

				        [

				            x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				            x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				            x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				            x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				        ]

				    )

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(subject)

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.now(timezone.utc) + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				            critical=True,

				        )

				        .sign(key, hashes.SHA256())

				    )

				    # Write our certificate out to disk.

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				    return cert

				def create_req(path, C, ST, L, O, key):

				    csr = (

				        x509.CertificateSigningRequestBuilder()

				        .subject_name(

				            x509.Name(

				                [

				                    # Provide various details about who we are.

				                    x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				                    x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				                    x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				                    x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				                ]

				            )

				        )

				        .sign(key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(csr.public_bytes(serialization.Encoding.PEM))

				    return csr

				def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(csr_cert.subject)

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.now(timezone.utc) + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				    return cert

				ca_key = genrsa(temp_dir + "/ca.key")

				ca_cert = create_cert(

				    temp_dir + "/ca.pem",

				    "US",

				    "New York",

				    "New York",

				    "Gloo Certificate Authority",

				    ca_key,

				)

				pkey = genrsa(temp_dir + "/pkey.key")

				csr = create_req(

				    temp_dir + "/csr.csr",

				    "US",

				    "California",

				    "San Francisco",

				    "Gloo Testing Company",

				    pkey,

				)

				cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

									
										28

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -157,6 +157,32 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				# Shellcheck doesn't like it when you pass no arguments to a function

				# that can take args. See https://www.shellcheck.net/wiki/SC2120

				# shellcheck disable=SC2120

				checkout_install_torchbench() {

				  local commit

				  commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # soxr comes from https://github.com/huggingface/transformers/pull/39429

				  pip install transformers==4.54.0 soxr==0.5.0

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

				torchbench_setup_macos() {

				  git clone --recursive https://github.com/pytorch/vision torchvision

				  git clone --recursive https://github.com/pytorch/audio torchaudio

				@ -179,8 +205,6 @@ torchbench_setup_macos() {

				  USE_OPENMP=0 python setup.py develop

				  popd

				  # Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120

				  # shellcheck disable=SC2119,SC2120

				  checkout_install_torchbench

				}

									
										18

.ci/pytorch/run_glootls_test.sh
									
												View File
											
				@ -1,18 +0,0 @@

				#!/bin/bash

				CREATE_TEST_CERT="$(dirname "${BASH_SOURCE[0]}")/create_test_cert.py"

				TMP_CERT_DIR=$(python "$CREATE_TEST_CERT")

				openssl verify -CAfile "${TMP_CERT_DIR}/ca.pem" "${TMP_CERT_DIR}/cert.pem"

				export GLOO_DEVICE_TRANSPORT=TCP_TLS

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=${TMP_CERT_DIR}/pkey.key

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=${TMP_CERT_DIR}/cert.pem

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=${TMP_CERT_DIR}/ca.pem

				time python test/run_test.py --include distributed/test_c10d_gloo --verbose -- ProcessGroupGlooTest

				unset GLOO_DEVICE_TRANSPORT

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE

									
										25

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -385,6 +385,29 @@ def smoke_test_compile(device: str = "cpu") -> None:

				    x_pt2 = torch.compile(model, mode="max-autotune")(x)

				def smoke_test_nvshmem() -> None:

				    if not torch.cuda.is_available():

				        print("CUDA is not available, skipping NVSHMEM test")

				        return

				    # Check if NVSHMEM is compiled in current build

				    try:

				        from torch._C._distributed_c10d import _is_nvshmem_available

				    except ImportError:

				        # Not built with NVSHMEM support.

				        # torch is not compiled with NVSHMEM prior to 2.9

				        if torch.__version__ < "2.9":

				            return

				        else:

				            # After 2.9: NVSHMEM is expected to be compiled in current build

				            raise RuntimeError("torch not compiled with NVSHMEM") from None

				    print("torch compiled with NVSHMEM")

				    # Check if NVSHMEM is available on current system.

				    print(f"NVSHMEM available at run time: {_is_nvshmem_available()}")

				def smoke_test_modules():

				    cwd = os.getcwd()

				    for module in MODULES:

				@ -479,6 +502,8 @@ def main() -> None:

				        options.pypi_pkg_check,

				    )

				    smoke_test_nvshmem()

				if __name__ == "__main__":

				    main()

									
										65

.ci/pytorch/test.sh
									
												View File
												
				@ -365,7 +365,6 @@ test_dynamo_wrapped_shard() {

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  python tools/dynamo/gb_id_mapping.py verify

				  # PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.

				  # Instead, use @skipIfTorchDynamo on your tests.

				  time python test/run_test.py --dynamo \

				@ -463,7 +462,7 @@ test_inductor_aoti() {

				  # rebuild with the build cache with `BUILD_AOT_INDUCTOR_TEST` enabled

				  /usr/bin/env CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 "${BUILD_COMMAND[@]}"

				  /usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile

				  /usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference cpp/test_vec_half_AVX2 -dist=loadfile

				}

				test_inductor_cpp_wrapper_shard() {

				@ -628,6 +627,8 @@ test_perf_for_dashboard() {

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *h100* ]]; then

				    device=cuda_h100

				  elif [[ "${TEST_CONFIG}" == *b200* ]]; then

				    device=cuda_b200

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				@ -802,6 +803,16 @@ test_dynamo_benchmark() {

				  if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then

				    test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    # TODO (huydhn): Just smoke test some sample models

				    if [[ "${TEST_CONFIG}" == *b200* ]]; then

				      if [[ "${suite}" == "huggingface" ]]; then

				        export TORCHBENCH_ONLY_MODELS="DistillGPT2"

				      elif [[ "${suite}" == "timm_models" ]]; then

				        export TORCHBENCH_ONLY_MODELS="inception_v3"

				      elif [[ "${suite}" == "torchbench" ]]; then

				        export TORCHBENCH_ONLY_MODELS="hf_Bert"

				      fi

				    fi

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				@ -929,12 +940,6 @@ test_torchbench_gcp_smoketest(){

				  popd

				}

				test_python_gloo_with_tls() {

				  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

				  assert_git_not_dirty

				}

				test_aten() {

				  # Test ATen

				  # The following test(s) of ATen have already been skipped by caffe2 in rocm environment:

				@ -981,6 +986,8 @@ test_without_numpy() {

				  if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				    python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"

				  fi

				  # Regression test for https://github.com/pytorch/pytorch/pull/157734 (torch.onnx should be importable without numpy)

				  python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"

				  popd

				}

				@ -1044,20 +1051,10 @@ test_libtorch_api() {

				    mkdir -p $TEST_REPORTS_DIR

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml

				    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml

				  else

				    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

				    # On s390x, pytorch is built without llvm.

				    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and

				    # test fails with errors like:

				    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer

				    # unknown file: Failure

				    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }

				    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    fi

				  fi

				  # quantization is not fully supported on s390x yet

				@ -1325,10 +1322,13 @@ EOF

				  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing

				  # file is modified to introduce an invalid public API function.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"

				  # The filepath here must not have __all__ defined in it, otherwise the test will pass.

				  # If your PR introduces __all__ to torch/cuda/streams.py please point this to another file

				  # that does not have __all__ defined.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/cuda/streams.py"

				  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"

				  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"

				  invalid_api="torch.nn.parameter.new_public_func"

				  invalid_api="torch.cuda.streams.new_public_func"

				  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

				  check_public_api_test_fails \

				@ -1562,7 +1562,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1674,43 +1674,34 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				elif [[ "${TEST_CONFIG}" == cachebench ]]; then

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_cachebench

				  PYTHONPATH=/torchbench test_cachebench

				elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt

				  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench

				  PYTHONPATH=/torchbench test_verify_cachebench

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  install_torchaudio

				  install_torchvision

				  install_torchao

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				    TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest

				    TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				    PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchvision

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  if [[ "$SHARD_NUMBER" -eq "1" ]]; then

				    test_inductor_aoti

				  fi

									
										7

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -61,9 +61,10 @@ if "%USE_XPU%"=="1" (

				  call "C:\Program Files (x86)\Intel\oneAPI\compiler\latest\env\vars.bat"

				  call "C:\Program Files (x86)\Intel\oneAPI\ocloc\latest\env\vars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				  :: Reduce build time

				  SET TORCH_XPU_ARCH_LIST=bmg

				  :: Re-setup python env for build

				  call pip install -r requirements.txt

				)

				@echo on

									
										2

.ci/pytorch/win-test.sh
									
												View File
												
				@ -41,7 +41,7 @@ fi

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.12.2.0

				python -m pip install z3-solver==4.15.1.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.30

									
										14

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -192,9 +192,6 @@ retry brew install libomp

				# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule

				export USE_DISTRIBUTED=1

				if [[ -n "$CROSS_COMPILE_ARM64" ]]; then

				    export CMAKE_OSX_ARCHITECTURES=arm64

				fi

				export USE_MKLDNN=OFF

				export USE_QNNPACK=OFF

				export BUILD_TEST=OFF

				@ -202,16 +199,7 @@ export BUILD_TEST=OFF

				pushd "$pytorch_rootdir"

				echo "Calling setup.py bdist_wheel at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    python setup.py bdist_wheel -d "$whl_tmp_dir"

				fi

				python setup.py bdist_wheel -d "$whl_tmp_dir"

				echo "Finished setup.py bdist_wheel at $(date)"

									
										12

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -65,16 +65,8 @@ fi

				if [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

									
										1

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -134,7 +134,6 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

									
										4

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -23,10 +23,6 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

4

.flake8

View File

 @ -7,12 +7,12 @@ max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
     E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,F824,
     # shebang has extra meaning in fbcode lints, so I think it's not worth trying
     # to line this up with executable bit
     EXE001,
     # these ignores are from flake8-bugbear; please fix!
     B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907
     B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907,B908,B910
     # these ignores are from flake8-comprehensions; please fix!
     C407,
     # these ignores are from flake8-logging-format; please fix!

									
										10

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -53,16 +53,12 @@ self-hosted-runner:

				    - linux.rocm.gpu.mi250

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # MI300 runners

				    - linux.rocm.gpu.mi300.2

				    - linux.rocm.gpu.mi300.4

				    # gfx942 runners

				    - linux.rocm.gpu.gfx942.2

				    - linux.rocm.gpu.gfx942.4

				    - rocm-docker

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

				    # Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)

				    - macos-m1-stable

				    - macos-m1-13

				    - macos-m1-14

				    # GitHub-hosted MacOS runners

				    - macos-latest-xlarge

									
										78

.github/actions/build-android/action.yml
									
										vendored
									
												View File
											
				@ -1,78 +0,0 @@

				name: build android

				description: build android for a specific arch

				inputs:

				  arch:

				    description: arch to build

				    required: true

				  arch-for-build-env:

				    description: |

				      arch to pass to build environment.

				      This is currently different than the arch name we use elsewhere, which

				      should be fixed.

				    required: true

				  github-secret:

				    description: github token

				    required: true

				  build-environment:

				    required: true

				    description: Top-level label for what's being built/tested.

				  docker-image:

				    required: true

				    description: Name of the base docker image to build with.

				  branch:

				    required: true

				    description: What branch we are building on.

				outputs:

				  container_id:

				    description: Docker container identifier used to build the artifacts

				    value: ${{ steps.build.outputs.container_id }}

				runs:

				  using: composite

				  steps:

				    - name: Build-${{ inputs.arch }}

				      id: build

				      shell: bash

				      env:

				        BRANCH: ${{ inputs.branch }}

				        BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-${{ inputs.arch-for-build-env }}-build"

				        AWS_DEFAULT_REGION: us-east-1

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_REGION: us-east-1

				        DOCKER_IMAGE: ${{ inputs.docker-image  }}

				        MATRIX_ARCH: ${{ inputs.arch }}

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        set -exo pipefail

				        export container_name

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e AWS_DEFAULT_REGION \

				          -e PR_NUMBER \

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_REGION \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --tty \

				          --detach \

				          --user jenkins \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        git submodule sync && git submodule update -q --init --recursive --depth 1

				        docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"

				        (echo "sudo chown -R jenkins . && .ci/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete" | docker exec -u jenkins -i "${container_name}" bash) 2>&1

				        # Copy install binaries back

				        mkdir -p "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"

				        docker cp "${container_name}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"

				        echo "container_id=${container_name}" >> "${GITHUB_OUTPUT}"

									
										2

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -70,7 +70,7 @@ runs:

				          set -eux

				          # PyYAML 6.0 doesn't work with MacOS x86 anymore

				          # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2

				          python3 -m pip install requests==2.27.1 pyyaml==6.0.1

				          python3 -m pip install requests==2.27.1 pyyaml==6.0.2

				    - name: Parse ref

				      id: parse-ref

									
										1

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -24,7 +24,6 @@ runs:

				          -e PYTORCH_FINAL_PACKAGE_DIR \

				          -e PYTORCH_ROOT \

				          -e SKIP_ALL_TESTS \

				          -e USE_SPLIT_BUILD \

				          --tty \

				          --detach \

				          -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 b6a3368a45aaafe05f1a6a9f10c68adc5e944d9e
 bdb88e1d66f272cad72156c90ac8428ca61a601c

2

.github/ci_commit_pins/vllm.txt vendored

View File

 @ -1 +1 @@
 b77c7d327f2a463bb9ef8be36f30e920bc066502
 e74eb907f96069e6d8a4f3c9f457001fef2ea

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 c00dea2c9adb2137903c86b4191e8c247f8fda9
 faec1e7b6cc47220181e74ae9cde2605f9b00

									
										19

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -131,21 +131,6 @@

				  - Lint

				  - pull

				- name: Mobile

				  patterns:

				  - ios/**

				  - android/**

				  - test/mobile/**

				  approved_by:

				  - linbinyu

				  - IvanKobzarev

				  - dreiss

				  - raziel

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				- name: PrimTorch

				  patterns:

				  - torch/_meta_registrations.py

				@ -503,6 +488,10 @@

				  - torch/_dynamo/**

				  - torch/csrc/dynamo/**

				  - test/dynamo/**

				  - test/dynamo_expected_failures/**

				  - test/dynamo_skips/**

				  - test/inductor_expected_failures/**

				  - test/inductor_skips/**

				  approved_by:

				  - guilhermeleobas

				  mandatory_checks_name:

6

.github/requirements-gha-cache.txt vendored

View File

 @ -7,9 +7,9 @@
 #   .ci/docker/requirements-ci.txt
 boto3==1.35.42
 jinja2==3.1.6
 lintrunner==0.10.7
 lintrunner==0.12.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
 pyyaml==6.0
 pyyaml==6.0.2
 requests==2.32.4
 rich==10.9.0
 rich==14.1.0

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -2,7 +2,7 @@ boto3==1.35.42
 cmake==3.27.*
 expecttest==0.3.0
 fbscribelogger==0.1.7
 filelock==3.6.0
 filelock==3.18.0
 hypothesis==6.56.4
 librosa>=0.6.2
 mpmath==1.3.0
 @ -33,4 +33,4 @@ tensorboard==2.13.0
 typing-extensions==4.12.2
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
 z3-solver==4.12.2.0
 z3-solver==4.15.1.0

									
										18

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -193,7 +193,7 @@ LIBTORCH_CONTAINER_IMAGES: dict[str, str] = {

				    "cpu": "libtorch-cxx11-builder:cpu",

				}

				FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12", "3.13", "3.13t"]

				FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t"]

				def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				@ -273,7 +273,6 @@ def generate_wheels_matrix(

				    os: str,

				    arches: Optional[list[str]] = None,

				    python_versions: Optional[list[str]] = None,

				    use_split_build: bool = False,

				) -> list[dict[str, str]]:

				    package_type = "wheel"

				    if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":

				@ -315,15 +314,11 @@ def generate_wheels_matrix(

				            # TODO: Enable python 3.13t on cpu-s390x

				            if gpu_arch_type == "cpu-s390x" and python_version == "3.13t":

				                continue

				            if use_split_build and (

				                arch_version not in ["12.6", "12.8", "12.9", "cpu"] or os != "linux"

				            # TODO: Enable python 3.14 on non linux OSes

				            if os != "linux" and (

				                python_version == "3.14" or python_version == "3.14t"

				            ):

				                raise RuntimeError(

				                    "Split build is only supported on linux with cuda 12* and cpu.\n"

				                    f"Currently attempting to build on arch version {arch_version} and os {os}.\n"

				                    "Please modify the matrix generation to exclude this combination."

				                )

				                continue

				            # cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				@ -339,7 +334,6 @@ def generate_wheels_matrix(

				                        "gpu_arch_type": gpu_arch_type,

				                        "gpu_arch_version": gpu_arch_version,

				                        "desired_cuda": desired_cuda,

				                        "use_split_build": "True" if use_split_build else "False",

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version].split(

				                            ":"

				                        )[0],

				@ -372,7 +366,6 @@ def generate_wheels_matrix(

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "use_split_build": "True" if use_split_build else "False",

				                            "container_image": WHEEL_CONTAINER_IMAGES[

				                                arch_version

				                            ].split(":")[0],

				@ -395,7 +388,6 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version].split(

				                            ":"

				                        )[0],

									
										42

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -59,9 +59,7 @@ class BinaryBuildWorkflow:

				    is_scheduled: str = ""

				    branches: str = "nightly"

				    # Mainly for macos

				    cross_compile_arm64: bool = False

				    macos_runner: str = "macos-14-xlarge"

				    use_split_build: bool = False

				    # Mainly used for libtorch builds

				    build_variant: str = ""

				@ -72,9 +70,6 @@ class BinaryBuildWorkflow:

				                for item in [self.os, "binary", self.package_type, self.build_variant]

				                if item != ""

				            )

				        if self.use_split_build:

				            # added to distinguish concurrency groups

				            self.build_environment += "-split"

				    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:

				        output_file_path = (

				@ -117,21 +112,6 @@ LINUX_BINARY_BUILD_WORFKLOWS = [

				            isolated_workflow=True,

				        ),

				    ),

				    # See https://github.com/pytorch/pytorch/issues/138750

				    #   BinaryBuildWorkflow(

				    #     os=OperatingSystem.LINUX,

				    #     package_type="manywheel",

				    #     build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				    #         OperatingSystem.LINUX,

				    #         use_split_build=True,

				    #         arches=["11.8", "12.1", "12.4", "cpu"],

				    #     ),

				    #     ciflow_config=CIFlowConfig(

				    #         labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				    #         isolated_workflow=True,

				    #     ),

				    #     use_split_build=True,

				    # ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

				@ -175,27 +155,11 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["12.6", "12.8", "12.9"],

				            python_versions=["3.9"],

				            arches=["12.8"],

				            python_versions=["3.12"],

				        ),

				        branches="main",

				    ),

				    # See https://github.com/pytorch/pytorch/issues/138750

				    # BinaryBuildWorkflow(

				    #     os=OperatingSystem.LINUX,

				    #     package_type="manywheel",

				    #     build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				    #         OperatingSystem.LINUX,

				    #         arches=["11.8", "12.1", "12.4"],

				    #         python_versions=["3.9"],

				    #         use_split_build=True,

				    #     ),

				    #     ciflow_config=CIFlowConfig(

				    #         labels={LABEL_CIFLOW_PERIODIC},

				    #     ),

				    #     branches="main",

				    #     use_split_build=True,

				    # ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

				@ -338,7 +302,6 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				            generate_binary_build_matrix.RELEASE,

				            libtorch_variants=["shared-with-deps"],

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},

				@ -351,7 +314,6 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.MACOS_ARM64

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

									
										2

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -2,7 +2,7 @@

				set -ex

				# Use uv to speed up lintrunner init

				python3 -m pip install uv==0.1.45 setuptools

				python3 -m pip install -U uv==0.8.* setuptools

				CACHE_DIRECTORY="/tmp/.lintbin"

				# Try to recover the cached binaries

									
										7

.github/scripts/runner_determinator.py
									
										vendored
									
												View File
												
				@ -262,7 +262,12 @@ def is_exception_branch(branch: str) -> bool:

				    """

				    Branches that get opted out of experiments by default, until they're explicitly enabled.

				    """

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				    return branch.split("/", maxsplit=1)[0] in {

				        "main",

				        "nightly",

				        "release",

				        "landchecks",

				    }

				def load_yaml(yaml_text: str) -> Any:

									
										4

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -1891,7 +1891,9 @@ def validate_revert(

				        else pr.get_comment_by_id(comment_id)

				    )

				    if comment.editor_login is not None:

				        raise PostCommentError("Don't want to revert based on edited command")

				        raise PostCommentError(

				            "Halting the revert as the revert comment has been edited."

				        )

				    author_association = comment.author_association

				    author_login = comment.author_login

				    allowed_reverters = ["COLLABORATOR", "MEMBER", "OWNER"]

3

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -47,9 +47,6 @@ env:
   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
   PR_NUMBER: ${{ github.event.pull_request.number }}
   SKIP_ALL_TESTS: 0
 {%- if cross_compile_arm64 %}
   CROSS_COMPILE_ARM64: 1
 {% endif %}
 !{{ common.concurrency(build_environment) }}
 jobs:

5

.github/templates/upload.yml.j2 vendored

View File

 @ -25,11 +25,6 @@
       DOCKER_IMAGE: !{{ config["container_image"] }}
       DOCKER_IMAGE_TAG_PREFIX: !{{ config["container_image_tag_prefix"] }}
 {%- endif %}
 {%- if config["package_type"] == "manywheel" %}
   {%- if config.use_split_build is defined %}
       use_split_build: !{{ config["use_split_build"] }}
   {%- endif %}
 {%- endif %}
 {%- if config["package_type"] == "libtorch" %}
   {%- if config["libtorch_config"] %}
       LIBTORCH_CONFIG: !{{ config["libtorch_config"] }}

									
										10

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -26,13 +26,6 @@ on:

				        default: 240

				        type: number

				        description: timeout for the job

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				      ALPINE_IMAGE:

				        required: false

				        type: string

				@ -117,7 +110,6 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Make the env permanent during this workflow (but not the secrets)

				        shell: bash

				@ -142,7 +134,6 @@ jobs:

				            echo "PR_NUMBER=${{ env.PR_NUMBER }}"

				            echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"

				            echo "SHA1=${{ env.SHA1 }}"

				            echo "USE_SPLIT_BUILD=${{ env.use_split_build }}"

				          } >> "${GITHUB_ENV} }}"

				      - name: List the env

				@ -261,7 +252,6 @@ jobs:

				            -e PYTORCH_ROOT \

				            -e SKIP_ALL_TESTS \

				            -e PYTORCH_EXTRA_INSTALL_REQUIREMENTS \

				            -e USE_SPLIT_BUILD \

				            --tty \

				            --detach \

				            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

									
										9

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -64,13 +64,6 @@ on:

				        required: true

				        type: string

				        description: Hardware to run this job on. Valid values are linux.4xlarge, linux.4xlarge.nvidia.gpu, linux.arm64.2xlarge, and linux.rocm.gpu

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				    secrets:

				      github-token:

				        required: true

				@ -104,7 +97,6 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Make the env permanent during this workflow (but not the secrets)

				        shell: bash

				@ -129,7 +121,6 @@ jobs:

				            echo "PR_NUMBER=${{ env.PR_NUMBER }}"

				            echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"

				            echo "SHA1=${{ env.SHA1 }}"

				            echo "USE_SPLIT_BUILD=${{ env.USE_SPLIT_BUILD }}"

				          } >> "${GITHUB_ENV} }}"

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

									
										8

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -51,13 +51,6 @@ on:

				        required: false

				        type: string

				        description: Desired python version

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				    secrets:

				      github-token:

				        required: true

				@ -86,7 +79,6 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

									
										1

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -306,7 +306,6 @@ jobs:

				            -e OUR_GITHUB_JOB_ID \

				            -e HUGGING_FACE_HUB_TOKEN \

				            -e SCRIBE_GRAPHQL_ACCESS_TOKEN \

				            -e USE_SPLIT_BUILD \

				            -e BUILD_ADDITIONAL_PACKAGES \

				            --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \

				            --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \

									
										20

.github/workflows/_linux-test.yml
									
										vendored
									
												View File
												
				@ -96,7 +96,7 @@ jobs:

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        if: ${{ matrix.runner != 'B200' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        if: ${{ !contains(matrix.runner, 'b200') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -109,7 +109,7 @@ jobs:

				          no-sudo: true

				      - name: Setup Python

				        if: matrix.runner == 'B200'

				        if: contains(matrix.runner, 'b200')

				        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0

				        with:

				          python-version: '3.12'

				@ -117,7 +117,7 @@ jobs:

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel' && matrix.runner != 'B200'

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel' && !contains(matrix.runner, 'b200')

				      - name: configure aws credentials

				        if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				@ -128,7 +128,7 @@ jobs:

				          aws-region: us-east-1

				      - name: Login to Amazon ECR

				        if: ${{ inputs.aws-role-to-assume != '' && matrix.runner == 'B200' }}

				        if: ${{ inputs.aws-role-to-assume != '' && contains(matrix.runner, 'b200') }}

				        id: login-ecr

				        continue-on-error: true

				        uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1

				@ -166,17 +166,17 @@ jobs:

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        with:

				          driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '570.133.07' }}

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && matrix.runner != 'B200' }}

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && !contains(matrix.runner, 'b200') }}

				      - name: Setup GPU_FLAG for docker run

				        id: setup-gpu-flag

				        run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && (steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' || matrix.runner == 'B200') }}

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && (steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' || contains(matrix.runner, 'b200')) }}

				      - name: Setup SCCACHE_SERVER_PORT environment for docker run when on container

				        id: setup-sscache-port-flag

				        run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"

				        if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' && matrix.runner != 'B200' }}

				        if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' && !contains(matrix.runner, 'b200') }}

				      - name: Lock NVIDIA A100 40GB Frequency

				        run: |

				@ -277,8 +277,8 @@ jobs:

				          NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				          TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

				          # Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs

				          SCCACHE_BUCKET: ${{ matrix.runner != 'B200' && 'ossci-compiler-cache-circleci-v2' || '' }}

				          SCCACHE_REGION: ${{ matrix.runner != 'B200' && 'us-east-1' || '' }}

				          SCCACHE_BUCKET: ${{ !contains(matrix.runner, 'b200') && 'ossci-compiler-cache-circleci-v2' || '' }}

				          SCCACHE_REGION: ${{ !contains(matrix.runner, 'b200') && 'us-east-1' || '' }}

				          SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}

				          DOCKER_IMAGE: ${{ inputs.docker-image }}

				          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				@ -403,7 +403,7 @@ jobs:

				          job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				      - name: Authenticate with AWS

				        if: ${{ matrix.runner == 'B200' }}

				        if: ${{ contains(matrix.runner, 'b200') }}

				        uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0

				        with:

				          role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-benchmark-results

									
										4

.github/workflows/_rocm-test.yml
									
										vendored
									
												View File
												
				@ -269,8 +269,8 @@ jobs:

				          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct

				          docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"

				      - name: Change permissions (only needed for MI300 runners for now)

				        if: ${{ always() && steps.test.conclusion && contains(matrix.runner, 'mi300') }}

				      - name: Change permissions (only needed for kubernetes runners for now)

				        if: ${{ always() && steps.test.conclusion && (contains(matrix.runner, 'gfx942') || contains(matrix.runner, 'mi355')) }}

				        run: |

				          docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R 1001:1001 test"

									
										8

.github/workflows/build-triton-wheel.yml
									
										vendored
									
												View File
												
				@ -50,7 +50,7 @@ jobs:

				    strategy:

				      fail-fast: false

				      matrix:

				        py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]

				        py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]

				        device: ["cuda", "rocm", "xpu", "aarch64"]

				        docker-image: ["pytorch/manylinux2_28-builder:cpu"]

				        include:

				@ -126,6 +126,12 @@ jobs:

				          3.13t)

				            PYTHON_EXECUTABLE=/opt/python/cp313-cp313t/bin/python

				            ;;

				          3.14)

				            PYTHON_EXECUTABLE=/opt/python/cp314-cp314/bin/python

				            ;;

				          3.14t)

				            PYTHON_EXECUTABLE=/opt/python/cp314-cp314t/bin/python

				            ;;

				          *)

				            echo "Unsupported python version ${PY_VERS}"

				            exit 1

									
										3

.github/workflows/check-labels.yml
									
										vendored
									
												View File
												
				@ -34,7 +34,8 @@ jobs:

				      contents: read

				      pull-requests: write

				    name: Check labels

				    if: github.repository_owner == 'pytorch'

				    # Disabling the job until https://github.com/pytorch/pytorch/issues/159825 is resolved

				    if: github.repository_owner == 'pytorch' && false

				    runs-on: linux.24_04.4x

				    steps:

				      - name: Checkout PyTorch

									
										5

.github/workflows/check_mergeability_ghstack.yml
									
										vendored
									
												View File
												
				@ -7,7 +7,8 @@ on:

				jobs:

				  ghstack-mergeability-check:

				    if: github.repository_owner == 'pytorch'

				    # Disabling the job until https://github.com/pytorch/pytorch/issues/159825 is resolved

				    if: github.repository_owner == 'pytorch' && false

				    runs-on: ubuntu-latest

				    steps:

				      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				@ -56,7 +57,7 @@ jobs:

				          cache: pip

				          architecture: x64

				      - run: pip install pyyaml==6.0

				      - run: pip install pyyaml==6.0.2

				        shell: bash

				      - name: Verify mergeability

									
										2

.github/workflows/cherry-pick.yml
									
										vendored
									
												View File
												
				@ -26,7 +26,7 @@ jobs:

				          cache: pip

				      # Not the direct dependencies but the script uses trymerge

				      - run: pip install pyyaml==6.0

				      - run: pip install pyyaml==6.0.2

				      - name: Setup committer id

				        run: |

									
										9

.github/workflows/docker-builds.yml
									
										vendored
									
												View File
												
				@ -51,21 +51,17 @@ jobs:

				        docker-image-name: [

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm,

				          pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9,

				          pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11,

				          pytorch-linux-jammy-py3.9-clang12,

				          pytorch-linux-jammy-py3.11-clang12,

				          pytorch-linux-jammy-py3.12-clang12,

				          pytorch-linux-jammy-py3.13-clang12,

				          pytorch-linux-jammy-rocm-n-py3,

				          pytorch-linux-noble-rocm-n-py3,

				          pytorch-linux-noble-rocm-alpha-py3,

				          pytorch-linux-jammy-rocm-n-py3-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12,

				          pytorch-linux-jammy-py3.9-gcc11,

				          pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks,

				@ -76,7 +72,8 @@ jobs:

				          pytorch-linux-jammy-py3-clang12-onnx,

				          pytorch-linux-jammy-linter,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter,

				          pytorch-linux-jammy-py3-clang12-executorch,

				          # Executorch pin needs update

				          # pytorch-linux-jammy-py3-clang12-executorch,

				          pytorch-linux-jammy-py3.12-triton-cpu

				        ]

				        include:

									
										2

.github/workflows/docker-release.yml
									
										vendored
									
												View File
												
				@ -144,7 +144,7 @@ jobs:

				        run: |

				          make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"

				      - name: Push nightly tags

				        if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.build_platforms == 'linux/amd4' }}

				        if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.platform == 'linux/amd4' }}

				        run: |

				          PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"

				          CUDA_SUFFIX="-cu${CUDA_VERSION}"

									
										30

.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
									
										generated
									
										vendored
									
												View File
												
				@ -60,7 +60,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -84,7 +83,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -108,7 +106,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cpu-aarch64

				    secrets:

				@ -129,7 +126,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -156,7 +152,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cuda-aarch64-12_9

				    secrets:

				@ -176,7 +171,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -200,7 +194,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      build_name: manywheel-py3_10-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -224,7 +217,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      build_name: manywheel-py3_10-cpu-aarch64

				    secrets:

				@ -245,7 +237,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -272,7 +263,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      build_name: manywheel-py3_10-cuda-aarch64-12_9

				    secrets:

				@ -292,7 +282,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -316,7 +305,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      build_name: manywheel-py3_11-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -340,7 +328,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      build_name: manywheel-py3_11-cpu-aarch64

				    secrets:

				@ -361,7 +348,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -388,7 +374,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      build_name: manywheel-py3_11-cuda-aarch64-12_9

				    secrets:

				@ -408,7 +393,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -432,7 +416,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -456,7 +439,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cpu-aarch64

				    secrets:

				@ -477,7 +459,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -504,7 +485,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cuda-aarch64-12_9

				    secrets:

				@ -524,7 +504,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -548,7 +527,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      build_name: manywheel-py3_13-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -572,7 +550,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      build_name: manywheel-py3_13-cpu-aarch64

				    secrets:

				@ -593,7 +570,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -620,7 +596,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      build_name: manywheel-py3_13-cuda-aarch64-12_9

				    secrets:

				@ -640,7 +615,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13t"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -664,7 +638,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13t"

				      build_name: manywheel-py3_13t-cpu-aarch64

				      build_environment: linux-aarch64-binary-manywheel

				@ -688,7 +661,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: manylinux2_28_aarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64

				      use_split_build: False

				      DESIRED_PYTHON: "3.13t"

				      build_name: manywheel-py3_13t-cpu-aarch64

				    secrets:

				@ -709,7 +681,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.13t"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.arm64.m7g.4xlarge.ephemeral

				@ -736,7 +707,6 @@ jobs:

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: manylinuxaarch64-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.13t"

				      build_name: manywheel-py3_13t-cuda-aarch64-12_9

				    secrets:

									
										110

.github/workflows/generated-linux-binary-manywheel-main.yml
									
										generated
									
										vendored
									
												View File
												
				@ -42,54 +42,7 @@ jobs:

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  manywheel-py3_9-cuda12_6-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_VERSION: 12.6

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.6

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: manywheel-py3_9-cuda12_6

				      build_environment: linux-binary-manywheel

				      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  manywheel-py3_9-cuda12_6-test:  # Testing

				    if: ${{ github.repository_owner == 'pytorch' }}

				    needs:

				      - manywheel-py3_9-cuda12_6-build

				      - get-label-type

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_VERSION: 12.6

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.6

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cuda12_6

				      build_environment: linux-binary-manywheel

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.4xlarge.nvidia.gpu # for other cuda versions, we use 4xlarge runner

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  manywheel-py3_9-cuda12_8-build:

				  manywheel-py3_12-cuda12_8-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				@ -103,18 +56,17 @@ jobs:

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.8

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      DESIRED_PYTHON: "3.12"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: manywheel-py3_9-cuda12_8

				      build_name: manywheel-py3_12-cuda12_8

				      build_environment: linux-binary-manywheel

				      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  manywheel-py3_9-cuda12_8-test:  # Testing

				  manywheel-py3_12-cuda12_8-test:  # Testing

				    if: ${{ github.repository_owner == 'pytorch' }}

				    needs:

				      - manywheel-py3_9-cuda12_8-build

				      - manywheel-py3_12-cuda12_8-build

				      - get-label-type

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				@ -127,56 +79,8 @@ jobs:

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.8

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cuda12_8

				      build_environment: linux-binary-manywheel

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.g4dn.4xlarge.nvidia.gpu  # 12.8 and 12.9 build need sm_70+ runner

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  manywheel-py3_9-cuda12_9-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu129

				      GPU_ARCH_VERSION: 12.9

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: manywheel-py3_9-cuda12_9

				      build_environment: linux-binary-manywheel

				      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  manywheel-py3_9-cuda12_9-test:  # Testing

				    if: ${{ github.repository_owner == 'pytorch' }}

				    needs:

				      - manywheel-py3_9-cuda12_9-build

				      - get-label-type

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu129

				      GPU_ARCH_VERSION: 12.9

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: cuda12.9

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cuda12_9

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cuda12_8

				      build_environment: linux-binary-manywheel

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runs_on: linux.g4dn.4xlarge.nvidia.gpu  # 12.8 and 12.9 build need sm_70+ runner

1313

.github/workflows/generated-linux-binary-manywheel-nightly.yml generated vendored

View File

File diff suppressed because it is too large Load Diff

									
										2

.github/workflows/generated-linux-binary-manywheel-rocm-main.yml
									
										generated
									
										vendored
									
												View File
												
				@ -58,7 +58,6 @@ jobs:

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: rocm6.4

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: manywheel-py3_9-rocm6_4

				@ -83,7 +82,6 @@ jobs:

				      SKIP_ALL_TESTS: 1

				      DOCKER_IMAGE: manylinux2_28-builder

				      DOCKER_IMAGE_TAG_PREFIX: rocm6.4

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				    steps:

				      - name: Setup ROCm

									
										15

.github/workflows/generated-linux-s390x-binary-manywheel-nightly.yml
									
										generated
									
										vendored
									
												View File
												
				@ -60,7 +60,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      runs_on: linux.s390x

				      ALPINE_IMAGE: "docker.io/s390x/alpine"

				@ -84,7 +83,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cpu-s390x

				      build_environment: linux-s390x-binary-manywheel

				@ -107,7 +105,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				      build_name: manywheel-py3_9-cpu-s390x

				    secrets:

				@ -127,7 +124,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      runs_on: linux.s390x

				      ALPINE_IMAGE: "docker.io/s390x/alpine"

				@ -151,7 +147,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      build_name: manywheel-py3_10-cpu-s390x

				      build_environment: linux-s390x-binary-manywheel

				@ -174,7 +169,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				      build_name: manywheel-py3_10-cpu-s390x

				    secrets:

				@ -194,7 +188,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      runs_on: linux.s390x

				      ALPINE_IMAGE: "docker.io/s390x/alpine"

				@ -218,7 +211,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      build_name: manywheel-py3_11-cpu-s390x

				      build_environment: linux-s390x-binary-manywheel

				@ -241,7 +233,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				      build_name: manywheel-py3_11-cpu-s390x

				    secrets:

				@ -261,7 +252,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      runs_on: linux.s390x

				      ALPINE_IMAGE: "docker.io/s390x/alpine"

				@ -285,7 +275,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cpu-s390x

				      build_environment: linux-s390x-binary-manywheel

				@ -308,7 +297,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				      build_name: manywheel-py3_12-cpu-s390x

				    secrets:

				@ -328,7 +316,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      runs_on: linux.s390x

				      ALPINE_IMAGE: "docker.io/s390x/alpine"

				@ -352,7 +339,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      build_name: manywheel-py3_13-cpu-s390x

				      build_environment: linux-s390x-binary-manywheel

				@ -375,7 +361,6 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				      DOCKER_IMAGE: pytorch/manylinuxs390x-builder

				      DOCKER_IMAGE_TAG_PREFIX: cpu-s390x

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				      build_name: manywheel-py3_13-cpu-s390x

				    secrets:

									
										154

.github/workflows/inductor-perf-test-b200.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,154 @@

				name: inductor-perf-b200

				on:

				  schedule:

				    - cron: 0 7 * * 1-6

				    - cron: 0 7 * * 0

				  # NB: GitHub has an upper limit of 10 inputs here, so before we can sort it

				  # out, let try to run torchao cudagraphs_low_precision as part of cudagraphs

				  workflow_dispatch:

				    inputs:

				      training:

				        description: Run training (on by default)?

				        required: false

				        type: boolean

				        default: true

				      inference:

				        description: Run inference (on by default)?

				        required: false

				        type: boolean

				        default: true

				      default:

				        description: Run inductor_default?

				        required: false

				        type: boolean

				        default: false

				      dynamic:

				        description: Run inductor_dynamic_shapes?

				        required: false

				        type: boolean

				        default: false

				      cppwrapper:

				        description: Run inductor_cpp_wrapper?

				        required: false

				        type: boolean

				        default: false

				      cudagraphs:

				        description: Run inductor_cudagraphs?

				        required: false

				        type: boolean

				        default: true

				      freezing_cudagraphs:

				        description: Run inductor_cudagraphs with freezing for inference?

				        required: false

				        type: boolean

				        default: false

				      aotinductor:

				        description: Run aot_inductor for inference?

				        required: false

				        type: boolean

				        default: false

				      maxautotune:

				        description: Run inductor_max_autotune?

				        required: false

				        type: boolean

				        default: false

				      benchmark_configs:

				        description: The list of configs used the benchmark

				        required: false

				        type: string

				        default: inductor_huggingface_perf_cuda_b200,inductor_timm_perf_cuda_b200,inductor_torchbench_perf_cuda_b200

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

				  cancel-in-progress: true

				permissions:

				  id-token: write

				  contents: read

				jobs:

				  get-label-type:

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				      opt_out_experiments: lf

				  build:

				    name: cuda12.8-py3.10-gcc9-sm100

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      # Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100

				      # or newer GPUs, so it doesn't benefit much from existing compiler cache

				      # from trunk. Also use a memory-intensive runner here because memory is

				      # usually the bottleneck

				      runner: linux.12xlarge.memory

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks

				      cuda-arch-list: '10.0'

				      test-matrix: |

				        { include: [

				          { config: "inductor_huggingface_perf_cuda_b200", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },

				          { config: "inductor_timm_perf_cuda_b200", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },

				          { config: "inductor_torchbench_perf_cuda_b200", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },

				        ]}

				      selected-test-configs: ${{ inputs.benchmark_configs }}

				      build-additional-packages: "vision audio fbgemm torchao"

				    secrets: inherit

				  test-periodically:

				    name: cuda12.8-py3.10-gcc9-sm100

				    uses: ./.github/workflows/_linux-test.yml

				    needs: build

				    if: github.event.schedule == '0 7 * * 1-6'

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true

				      docker-image: ${{ needs.build.outputs.docker-image }}

				      test-matrix: ${{ needs.build.outputs.test-matrix }}

				      aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only

				      timeout-minutes: 720

				      disable-monitor: false

				      monitor-log-interval: 15

				      monitor-data-collect-interval: 4

				    secrets: inherit

				  test-weekly:

				    name: cuda12.8-py3.10-gcc9-sm100

				    uses: ./.github/workflows/_linux-test.yml

				    needs: build

				    if: github.event.schedule == '0 7 * * 0'

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-maxautotune-true-freeze_autotune_cudagraphs-true-cudagraphs_low_precision-true

				      docker-image: ${{ needs.build.outputs.docker-image }}

				      test-matrix: ${{ needs.build.outputs.test-matrix }}

				      timeout-minutes: 1440

				      aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only

				      disable-monitor: false

				      monitor-log-interval: 15

				      monitor-data-collect-interval: 4

				    secrets: inherit

				  test:

				    name: cuda12.8-py3.10-gcc9-sm100

				    uses: ./.github/workflows/_linux-test.yml

				    needs: build

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cudagraphs-${{ inputs.cudagraphs }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-maxautotune-${{ inputs.maxautotune }}-freezing_cudagraphs-${{ inputs.freezing_cudagraphs }}-cudagraphs_low_precision-${{ inputs.cudagraphs }}

				      docker-image: ${{ needs.build.outputs.docker-image }}

				      test-matrix: ${{ needs.build.outputs.test-matrix }}

				      aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only

				      timeout-minutes: 720

				      disable-monitor: false

				      monitor-log-interval: 15

				      monitor-data-collect-interval: 4

				    secrets: inherit

									
										4

.github/workflows/inductor-perf-test-nightly-h100.yml
									
										vendored
									
												View File
												
				@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100

				on:

				  schedule:

				    - cron: 15 0,4,8,12,16,20 * * 1-6

				    - cron: 15 0,12 * * 1-6

				    - cron: 0 7 * * 0

				  # NB: GitHub has an upper limit of 10 inputs here, so before we can sort it

				  # out, let try to run torchao cudagraphs_low_precision as part of cudagraphs

				@ -126,7 +126,7 @@ jobs:

				    name: cuda12.8-py3.10-gcc9-sm90

				    uses: ./.github/workflows/_linux-test.yml

				    needs: build

				    if: github.event.schedule == '15 0,4,8,12,16,20 * * 1-6'

				    if: github.event.schedule == '15 0,12 * * 1-6'

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90

				      dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true

									
										36

.github/workflows/inductor-perf-test-nightly-rocm.yml
									
										vendored
									
												View File
												
				@ -85,26 +85,26 @@ jobs:

				    uses: ./.github/workflows/_linux-build.yml

				    with:

				      build-environment: linux-jammy-rocm-py3_10

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks

				      test-matrix: |

				        { include: [

				          { config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },

				        ]}

				    secrets: inherit

									
										32

.github/workflows/inductor-periodic.yml
									
										vendored
									
												View File
												
				@ -77,25 +77,25 @@ jobs:

				    uses: ./.github/workflows/_linux-build.yml

				    with:

				      build-environment: linux-jammy-rocm-py3_10

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks

				      sync-tag: rocm-build

				      test-matrix: |

				        { include: [

				          { config: "dynamo_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamo_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamo_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamo_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamo_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "aot_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "aot_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamic_aot_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamic_aot_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamic_aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "dynamo_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamo_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamo_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamo_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamo_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "aot_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "aot_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamic_aot_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamic_aot_eager_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamic_aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				        ]}

				    secrets: inherit

									
										4

.github/workflows/inductor-rocm-mi300.yml
									
										vendored
									
												View File
												
				@ -47,8 +47,8 @@ jobs:

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3

				      test-matrix: |

				        { include: [

				          { config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },

				        ]}

				    secrets: inherit

									
										1

.github/workflows/mac-mps.yml
									
										vendored
									
												View File
												
				@ -28,7 +28,6 @@ jobs:

				      # than our AWS macos-m1-14 runners

				      test-matrix: |

				        { include: [

				          { config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },

				          { config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },

				          { config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m2-15" },

				        ]}

									
										9

.github/workflows/nightly.yml
									
										vendored
									
												View File
												
				@ -75,10 +75,11 @@ jobs:

				            repo-owner: pytorch

				            branch: main

				            pin-folder: .github/ci_commit_pins

				          - repo-name: executorch

				            repo-owner: pytorch

				            branch: main

				            pin-folder: .ci/docker/ci_commit_pins

				          # executorch jobs are disabled since it needs some manual work for the hash update

				          # - repo-name: executorch

				          #   repo-owner: pytorch

				          #   branch: main

				          #   pin-folder: .ci/docker/ci_commit_pins

				          - repo-name: triton

				            repo-owner: triton-lang

				            branch: main

									
										6

.github/workflows/periodic-rocm-mi300.yml
									
										vendored
									
												View File
												
				@ -59,9 +59,9 @@ jobs:

				      docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3

				      test-matrix: |

				        { include: [

				          { config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },

				          { config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },

				          { config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },

				          { config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },

				          { config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },

				          { config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },

				        ]}

				    secrets: inherit

									
										31

.github/workflows/periodic.yml
									
										vendored
									
												View File
												
				@ -51,37 +51,6 @@ jobs:

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  linux-jammy-cuda12_4-py3_10-gcc11-sm89-build:

				    name: linux-jammy-cuda12.4-py3.10-gcc11-sm89

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.4-py3.10-gcc11-sm89

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11

				      cuda-arch-list: 8.9

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_4-py3_10-gcc11-sm89-test:

				    name: linux-jammy-cuda12.4-py3.10-gcc11-sm89

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda12_4-py3_10-gcc11-sm89-build

				      - target-determination

				    with:

				      build-environment: linux-jammy-cuda12.4-py3.10-gcc11-sm89

				      docker-image: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-sm89-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-sm89-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda12_4-py3_10-gcc11-build:

				    name: linux-jammy-cuda12.4-py3.10-gcc11

				    uses: ./.github/workflows/_linux-build.yml

									
										142

.github/workflows/pull.yml
									
										vendored
									
												View File
												
				@ -254,67 +254,6 @@ jobs:

				      timeout-minutes: 600

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-build-distributed:

				    name: linux-jammy-cuda12.8-py3.10-gcc11-build-distributed

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-distributed

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      cuda-arch-list: '7.5'

				      test-matrix: |

				        { include: [

				          { config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-test-distributed:

				    name: linux-jammy-cuda12.8-py3.10-gcc11-test

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda12_8-py3_10-gcc11-build-distributed

				      - target-determination

				    with:

				      timeout-minutes: 360

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-distributed

				      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build-distributed.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build-distributed.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-build:

				    name: linux-jammy-cuda12.8-py3.10-gcc11

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-test:

				    name: linux-jammy-cuda12.8-py3.10-gcc11

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda12_8-py3_10-gcc11-build

				      - target-determination

				    with:

				      timeout-minutes: 360

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11

				      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda12_8-cudnn9-py3_9-clang12-build:

				    name: linux-jammy-cuda12.8-cudnn9-py3.9-clang12

				    uses: ./.github/workflows/_linux-build.yml

				@ -329,30 +268,6 @@ jobs:

				        ]}

				    secrets: inherit

				  linux-jammy-py3_9-clang9-xla-build:

				    name: linux-jammy-py3_9-clang9-xla

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-py3.9-clang9-xla

				      docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.3-lite

				      test-matrix: |

				        { include: [

				          { config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },

				        ]}

				    secrets: inherit

				  linux-jammy-py3_9-clang9-xla-test:

				    name: linux-jammy-py3_9-clang9-xla

				    uses: ./.github/workflows/_linux-test.yml

				    needs: linux-jammy-py3_9-clang9-xla-build

				    with:

				      build-environment: linux-jammy-py3.9-clang9-xla

				      docker-image: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cpu-py3_10-gcc11-bazel-test:

				    name: linux-jammy-cpu-py3.10-gcc11-bazel-test

				    uses: ./.github/workflows/_bazel-build-test.yml

				@ -402,38 +317,8 @@ jobs:

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-sm89-build:

				    name: linux-jammy-cuda12.8-py3.10-gcc11-sm89

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm89

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      cuda-arch-list: 8.9

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-sm89-test:

				    name: linux-jammy-cuda12.8-py3.10-gcc11-sm89

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda12_8-py3_10-gcc11-sm89-build

				      - target-determination

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm89

				      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm89-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm89-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-py3-clang12-executorch-build:

				    if: false  # Docker build needs pin update

				    name: linux-jammy-py3-clang12-executorch

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				@ -458,31 +343,6 @@ jobs:

				      test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:

				    name: cuda12.8-py3.10-gcc9-sm75

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm75

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks

				      cuda-arch-list: '7.5'

				      test-matrix: |

				        { include: [

				          { config: "pr_time_benchmarks", shard: 1, num_shards: 1, runner: "linux.g4dn.metal.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:

				    name: cuda12.8-py3.10-gcc9-sm75

				    uses: ./.github/workflows/_linux-test.yml

				    needs: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build

				    with:

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm75

				      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-xpu-2025_1-py3_9-build:

				    name: linux-jammy-xpu-2025.1-py3.9

				    uses: ./.github/workflows/_linux-build.yml

									
										2

.github/workflows/revert.yml
									
										vendored
									
												View File
												
				@ -26,7 +26,7 @@ jobs:

				          architecture: x64

				          check-latest: false

				          cache: pip

				      - run: pip install pyyaml==6.0

				      - run: pip install pyyaml==6.0.2

				      - name: Setup committer id

				        run: |

									
										12

.github/workflows/rocm-mi300.yml
									
										vendored
									
												View File
												
				@ -48,12 +48,12 @@ jobs:

				      sync-tag: rocm-build

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },

				          { config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				          { config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },

				        ]}

				    secrets: inherit

									
										68

.github/workflows/rocm-mi355.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,68 @@

				name: rocm-mi355

				on:

				  workflow_dispatch:

				  schedule:

				    - cron: 30 11,1 * * *  # about 4:30am PDT and 6:30pm PDT

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

				  cancel-in-progress: true

				permissions: read-all

				jobs:

				  target-determination:

				    if: github.repository_owner == 'pytorch'

				    name: before-test

				    uses: ./.github/workflows/target_determination.yml

				    permissions:

				      id-token: write

				      contents: read

				  get-label-type:

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  linux-noble-rocm-py3_12-build:

				    if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}

				    name: linux-noble-rocm-py3.12-mi355

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-noble-rocm-py3.12-mi355

				      docker-image-name: ci-image:pytorch-linux-noble-rocm-alpha-py3

				      sync-tag: rocm-build

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				          { config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				          { config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				          { config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				          { config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				          { config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },

				        ]}

				    secrets: inherit

				  linux-noble-rocm-py3_12-test:

				    permissions:

				      id-token: write

				      contents: read

				    name: linux-noble-rocm-py3.12-mi355

				    uses: ./.github/workflows/_rocm-test.yml

				    needs:

				      - linux-noble-rocm-py3_12-build

				      - target-determination

				    with:

				      build-environment: linux-noble-rocm-py3.12-mi355

				      docker-image: ${{ needs.linux-noble-rocm-py3_12-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-noble-rocm-py3_12-build.outputs.test-matrix }}

				      tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"

				    secrets: inherit

									
										4

.github/workflows/torchbench.yml
									
										vendored
									
												View File
												
				@ -10,6 +10,10 @@ concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

				  cancel-in-progress: true

				permissions:

				  id-token: write

				  contents: read

				jobs:

				  get-default-label-prefix:

				    if: github.repository_owner == 'pytorch'

									
										40

.github/workflows/trunk.yml
									
										vendored
									
												View File
												
				@ -63,6 +63,43 @@ jobs:

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-build:

				    name: linux-jammy-cuda12.8-py3.10-gcc11

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      cuda-arch-list: '7.5 8.9'

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "pr_time_benchmarks", shard: 1, num_shards: 1, runner: "linux.g4dn.metal.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-test:

				    name: linux-jammy-cuda12.8-py3.10-gcc11

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda12_8-py3_10-gcc11-build

				      - target-determination

				    with:

				      timeout-minutes: 360

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc11

				      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}

				    secrets: inherit

				  # no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated

				  linux-jammy-cuda12_8-py3_10-gcc11-no-ops-build:

				    name: linux-jammy-cuda12.8-py3.10-gcc11-no-ops

				@ -94,7 +131,6 @@ jobs:

				          { config: "default", shard: 1, num_shards: 3, runner: "macos-m1-stable" },

				          { config: "default", shard: 2, num_shards: 3, runner: "macos-m1-stable" },

				          { config: "default", shard: 3, num_shards: 3, runner: "macos-m1-stable" },

				          { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },

				          { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },

				          { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-15" },

				        ]}

				@ -206,7 +242,7 @@ jobs:

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-py3.9-gcc11

				      docker-image-name: ci-image:pytorch-linux-jammy-py3.9-gcc11

				      docker-image-name: ci-image:pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks

				      test-matrix: |

				        { include: [

				          { config: "verify_cachebench", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },

									
										2

.github/workflows/trymerge.yml
									
										vendored
									
												View File
												
				@ -28,7 +28,7 @@ jobs:

				          check-latest: false

				          cache: pip

				          architecture: x64

				      - run: pip install pyyaml==6.0

				      - run: pip install pyyaml==6.0.2

				      - name: Setup committer id

				        run: |

									
										2

.github/workflows/tryrebase.yml
									
										vendored
									
												View File
												
				@ -25,7 +25,7 @@ jobs:

				          architecture: x64

				          check-latest: false

				          cache: pip

				      - run: pip install pyyaml==6.0

				      - run: pip install pyyaml==6.0.2

				      - name: Setup committer id

				        run: |

									
										28

.github/workflows/unstable.yml
									
										vendored
									
												View File
												
				@ -12,7 +12,9 @@ concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

				  cancel-in-progress: true

				permissions: read-all

				permissions:

				  id-token: write

				  contents: read

				jobs:

				  # There must be at least one job here to satisfy GitHub action workflow syntax

				@ -51,3 +53,27 @@ jobs:

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  linux-jammy-py3_9-clang9-xla-build:

				    name: linux-jammy-py3_9-clang9-xla

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-py3.9-clang9-xla

				      docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.3-lite

				      test-matrix: |

				        { include: [

				          { config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },

				        ]}

				    secrets: inherit

				  linux-jammy-py3_9-clang9-xla-test:

				    name: linux-jammy-py3_9-clang9-xla

				    uses: ./.github/workflows/_linux-test.yml

				    needs: linux-jammy-py3_9-clang9-xla-build

				    with:

				      build-environment: linux-jammy-py3.9-clang9-xla

				      docker-image: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.test-matrix }}

				    secrets: inherit

									
										2

.github/workflows/update-viablestrict.yml
									
										vendored
									
												View File
												
				@ -23,7 +23,7 @@ jobs:

				        with:

				          repository: pytorch/pytorch

				          stable-branch: viable/strict

				          requires: '[\"pull\", \"trunk\", \"lint\", \"linux-binary\"]'

				          requires: '[\"pull\", \"trunk\", \"lint\", \"linux-binary\", \"linux-aarch64\"]'

				          secret-bot-token: ${{ secrets.MERGEBOT_TOKEN }}

				          clickhouse-url: ${{ secrets.CLICKHOUSE_URL }}

				          clickhouse-username: ${{ secrets.CLICKHOUSE_VIABLESTRICT_USERNAME }}

									
										1

.github/workflows/upload-test-stats.yml
									
										vendored
									
												View File
												
				@ -14,6 +14,7 @@ on:

				      - inductor-periodic

				      - rocm

				      - rocm-mi300

				      - rocm-mi355

				      - inductor-micro-benchmark

				      - inductor-micro-benchmark-x86

				      - inductor-cu124

3

.gitignore vendored

View File

 @ -146,6 +146,9 @@ merge_record.json
 torchgen/packaged/*
 !torchgen/packaged/README.md
 # This file is injected by ROCm build scripts to bootstrap in torch/__init__.py.
 torch/_rocm_init.py
 # IPython notebook checkpoints
 .ipynb_checkpoints

									
										39

.lintrunner.toml
									
												View File
												
				@ -39,16 +39,16 @@ init_command = [

				    'python3',

				    'tools/linter/adapters/pip_init.py',

				    '--dry-run={{DRYRUN}}',

				    'flake8==6.1.0',

				    'flake8-bugbear==23.3.23',

				    'flake8-comprehensions==3.15.0',

				    'flake8==7.3.0',

				    'flake8-bugbear==24.12.12',

				    'flake8-comprehensions==3.16.0',

				    'flake8-executable==2.1.3',

				    'flake8-logging-format==0.9.0',

				    'flake8-pyi==23.3.1',

				    'flake8-simplify==0.19.3',

				    'flake8-logging-format==2024.24.12',

				    'flake8-pyi==25.5.0',

				    'flake8-simplify==0.22.0',

				    'mccabe==0.7.0',

				    'pycodestyle==2.11.1',

				    'pyflakes==3.1.0',

				    'pycodestyle==2.14.0',

				    'pyflakes==3.4.0',

				    'torchfix==0.4.0 ; python_version >= "3.9" and python_version < "3.13"',

				]

				@ -158,16 +158,16 @@ init_command = [

				    'mypy==1.16.0',

				    'sympy==1.13.3',

				    'types-requests==2.27.25',

				    'types-pyyaml==6.0.1',

				    'types-pyyaml==6.0.2',

				    'types-tabulate==0.8.8',

				    'types-protobuf==5.29.1.20250403',

				    'types-setuptools==79.0.0.20250422',

				    'types-jinja2==2.11.9',

				    'types-colorama==0.4.6',

				    'filelock==3.13.1',

				    'filelock==3.18.0',

				    'junitparser==2.1.1',

				    'rich==10.9.0',

				    'pyyaml==6.0.1',

				    'rich==14.1.0',

				    'pyyaml==6.0.2',

				    'optree==0.13.0',

				    'dataclasses-json==0.6.7',

				    'pandas==2.2.3',

				@ -1111,7 +1111,7 @@ init_command = [

				    'python3',

				    'tools/linter/adapters/pip_init.py',

				    '--dry-run={{DRYRUN}}',

				    'PyYAML==6.0.1',

				    'pyyaml==6.0.2',

				]

				[[linter]]

				@ -1133,7 +1133,7 @@ init_command = [

				    'python3',

				    'tools/linter/adapters/pip_init.py',

				    '--dry-run={{DRYRUN}}',

				    'PyYAML==6.0.1',

				    'pyyaml==6.0.2',

				]

				[[linter]]

				@ -1452,8 +1452,6 @@ init_command = [

				    'python3',

				    'tools/linter/adapters/pip_init.py',

				    '--dry-run={{DRYRUN}}',

				    '--no-black-binary',

				    'black==23.12.1',

				    'usort==1.0.8.post1',

				    'isort==6.0.1',

				    'ruff==0.12.2',  # sync with RUFF

				@ -1794,3 +1792,12 @@ include_patterns = [

				    'torch/header_only_apis.txt',

				]

				is_formatter = false

				[[linter]]

				code = "GB_REGISTRY"

				include_patterns = ["torch/_dynamo/**/*.py"]

				command = [

				  "python3",

				  "tools/linter/adapters/gb_registry_linter.py",

				]

									
										12

.pre-commit-config.yaml
									
												View File
											
				@ -1,12 +0,0 @@

				repos:

				  - repo: local

				    hooks:

				      - id: lintrunner

				        name: Run Lintrunner in an isolated venv before every push. The first run may be slow...

				        entry: python scripts/run_lintrunner.py   # wrapper below

				        language: python                          # pre‑commit manages venv for the wrapper

				        additional_dependencies: []               # wrapper handles lintrunner install

				        always_run: true

				        stages: [pre-push]                        # fire only on pre‑push

				        pass_filenames: false                     # Lintrunner gets no per‑file args

				        verbose: true                             # stream output as it is produced...allegedly anyways

									
										16

AGENTS.md
									
												View File
												
				@ -1 +1,17 @@

				- This is the only AGENTS.md, there are no recursive AGENTS.md

				- When you are working on a bug, first create a standalone file that

				  reproduces the bug and verify it fails in the expected way.  Use this to

				  test if your changes work.  Once the change is passing, find an appropriate

				  test file to add the test to and make sure to follow local conventions on

				  the test file.

				- If you are running the real test suite, DO NOT run the entire test suite.

				  Instead run only a single test case, e.g., 'python test/test_torch.py TestTorch.test_dir'

				- Do NOT run setup.py, you do not have a working build environment

				- Do NOT run pre-commit, it is not setup

				- To run lint, run 'lintrunner -a' (which will autoapply changes)

				- Do NOT attempt to install dependencies, you do not have Internet access

				- When you are ready to make a PR, do exactly these steps:

				  - git stash -u

				  - git reset --hard $(cat /tmp/orig_work.txt) # NB: reset to the LOCAL branch, do NOT fetch

				  - git stash pop

				  - Resolve conflicts if necessary

									
										1

BUILD.bazel
									
												View File
												
				@ -679,6 +679,7 @@ cc_library(

				        [

				            "torch/*.h",

				            "torch/csrc/**/*.h",

				            "torch/nativert/**/*.h",

				            "torch/csrc/distributed/c10d/**/*.hpp",

				            "torch/lib/libshm/*.h",

				        ],

									
										33

CMakeLists.txt
									
												View File
												
				@ -239,7 +239,9 @@ option(USE_XPU "Use XPU" ON)

				cmake_dependent_option(

				  BUILD_LAZY_CUDA_LINALG "Build cuda linalg ops as separate library" ON

				  "USE_CUDA AND LINUX AND BUILD_PYTHON" OFF)

				cmake_dependent_option(USE_ROCM "Use ROCm" ON "LINUX" OFF)

				cmake_dependent_option(USE_ROCM "Use ROCm" ON "LINUX OR WIN32" OFF)

				cmake_dependent_option(USE_ROCM_CK_GEMM "Use ROCm Composable Kernel for GEMMs" ON "USE_ROCM;NOT WIN32" OFF)

				option(USE_ROCM_CK_SDPA "Use ROCm Composable Kernel for SDPA" OFF)

				option(CAFFE2_STATIC_LINK_CUDA "Statically link CUDA libraries" OFF)

				cmake_dependent_option(USE_CUDNN "Use cuDNN" ON "USE_CUDA" OFF)

				cmake_dependent_option(USE_STATIC_CUDNN "Use cuDNN static libraries" OFF

				@ -251,7 +253,6 @@ cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF)

				option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)

				option(USE_KINETO "Use Kineto profiling library" ON)

				option(USE_CUPTI_SO "Use CUPTI as a shared library" ON)

				option(USE_FAKELOWP "Use FakeLowp operators" OFF)

				option(USE_GFLAGS "Use GFLAGS" OFF)

				option(USE_GLOG "Use GLOG" OFF)

				option(USE_LITE_PROTO "Use lite protobuf instead of full." OFF)

				@ -260,11 +261,13 @@ option(USE_PYTORCH_METAL "Use Metal for PyTorch iOS build" OFF)

				option(USE_PYTORCH_METAL_EXPORT "Export Metal models on MacOSX desktop" OFF)

				option(USE_NATIVE_ARCH "Use -march=native" OFF)

				cmake_dependent_option(USE_MPS "Use MPS for macOS build" ON "MPS_FOUND" OFF)

				option(USE_DISTRIBUTED "Use distributed" ON)

				cmake_dependent_option(USE_NCCL "Use NCCL" ON

				                       "USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)

				                       "USE_DISTRIBUTED;USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)

				cmake_dependent_option(USE_XCCL "Use XCCL" ON

				                       "USE_XPU;UNIX;NOT APPLE" OFF)

				cmake_dependent_option(USE_RCCL "Use RCCL" ON USE_NCCL OFF)

				cmake_dependent_option(USE_RCCL "Use RCCL" ON "USE_NCCL;NOT WIN32" OFF)

				cmake_dependent_option(USE_STATIC_NCCL "Use static NCCL" OFF "USE_NCCL" OFF)

				cmake_dependent_option(USE_SYSTEM_NCCL "Use system-wide NCCL" OFF "USE_NCCL"

				                       OFF)

				@ -322,7 +325,6 @@ set(MKLDNN_ENABLE_CONCURRENT_EXEC ${USE_MKLDNN})

				cmake_dependent_option(USE_MKLDNN_CBLAS "Use CBLAS in MKLDNN" OFF "USE_MKLDNN"

				                       OFF)

				option(USE_STATIC_MKL "Prefer to link with MKL statically (Unix only)" OFF)

				option(USE_DISTRIBUTED "Use distributed" ON)

				cmake_dependent_option(

				  USE_MPI "Use MPI for Caffe2. Only available if USE_DISTRIBUTED is on." ON

				  "USE_DISTRIBUTED" OFF)

				@ -564,7 +566,7 @@ if(MSVC)

				  set(CMAKE_NINJA_CMCLDEPS_RC OFF)

				  if(MSVC_Z7_OVERRIDE)

				    # CMake set debug flags to use /Z7

				    set(CMAKE_MSVC_DEBUG_INFORMATION_FORMAT Embedded)

				    set(CMAKE_MSVC_DEBUG_INFORMATION_FORMAT "$<$<CONFIG:Debug,RelWithDebInfo>:Embedded>")

				  endif()

				  foreach(

				    flag_var

				@ -834,10 +836,11 @@ include(ExternalProject)

				# ---[ Dependencies ---[ FBGEMM doesn't work on x86 32bit and

				# CMAKE_SYSTEM_PROCESSOR thinks its 64bit

				if(USE_FBGEMM

				   AND((CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" AND CMAKE_SIZEOF_VOID_P EQUAL

				                                                      4)

				        OR CMAKE_SYSTEM_PROCESSOR STREQUAL "x86"))

				if(USE_FBGEMM AND NOT CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64")

				  message(WARNING

				    "x64 operating system is required for FBGEMM. "

				    "Not compiling with FBGEMM. "

				    "Turn this warning off by USE_FBGEMM=OFF.")

				  set(USE_FBGEMM OFF)

				endif()

				@ -872,6 +875,14 @@ cmake_dependent_option(

				  "USE_CUDA OR USE_ROCM;NOT MSVC"

				  OFF)

				cmake_dependent_option(

				  USE_FBGEMM_GENAI

				  "Whether to build FBGEMM GenAI quantized GEMM kernels.\

				  Will be disabled if not supported by the platform"

				  OFF

				  "USE_CUDA OR USE_ROCM"

				  OFF)

				# CAVEAT: Again, Flash Attention2 will error while building for sm52 while Mem

				# Eff Attention won't

				cmake_dependent_option(

				@ -905,6 +916,10 @@ if(USE_FBGEMM)

				  string(APPEND CMAKE_CXX_FLAGS " -DUSE_FBGEMM")

				endif()

				if(USE_FBGEMM_GENAI)

				  string(APPEND CMAKE_CXX_FLAGS " -DUSE_FBGEMM_GENAI")

				endif()

				if(USE_PYTORCH_QNNPACK)

				  string(APPEND CMAKE_CXX_FLAGS " -DUSE_PYTORCH_QNNPACK")

				endif()

19

CODEOWNERS

View File

 @ -14,7 +14,6 @@
 /torch/csrc/autograd/ @albanD @soulitzer
 /torch/autograd/ @albanD @soulitzer
 /tools/autograd/ @albanD @soulitzer
 /torch/header_only_apis.txt @janeyx99
 /torch/nn/ @albanD @jbschlosser @mikaylagawarecki
 /torch/optim/ @albanD @janeyx99
 /test/test_public_bindings.py @albanD
 @ -51,12 +50,12 @@ nn/qat/ @jerryzh168
 /torch/csrc/distributed/c10d/Ops.* @kwen2501
 # ONNX Export
 /torch/_dynamo/backends/onnxrt.py @wschin
 /torch/csrc/jit/passes/onnx.h @titaiwangms @shubhambhokare1
 /torch/csrc/jit/passes/onnx.cpp @titaiwangms @shubhambhokare1
 /torch/csrc/jit/passes/onnx/ @titaiwangms @shubhambhokare1
 /torch/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin
 /test/onnx/  @titaiwangms @shubhambhokare1 @justinchuby @wschin
 /torch/_dynamo/backends/onnxrt.py @titaiwangms @xadupre @justinchuby
 /torch/csrc/jit/passes/onnx.h @titaiwangms @xadupre
 /torch/csrc/jit/passes/onnx.cpp @titaiwangms @xadupre
 /torch/csrc/jit/passes/onnx/ @titaiwangms @xadupre
 /torch/onnx/ @titaiwangms @xadupre @justinchuby
 /test/onnx/  @titaiwangms @xadupre @justinchuby
 # CI
 /.ci  @pytorch/pytorch-dev-infra
 @ -165,6 +164,7 @@ caffe2/utils/hip @jeffdaily @jithunnair-amd
 # torch.export
 /torch/export/ @avikchaudhuri @tugsbayasgalan @zhxchen17 @ydwu4 @angelayi
 /torch/_export/ @avikchaudhuri @tugsbayasgalan @zhxchen17 @ydwu4 @angelayi
 /torch/_export/serde/schema.py @SherlockNoMad @zhxchen17
 # Dynamic Shapes
 /torch/fx/experimental/symbolic_shapes.py @bobrenjc93 @laithsakka
 @ -196,3 +196,8 @@ torch/backends/cudnn/ @eqy @syed-ahmed
 /torch/utils/_cxx_pytree.py @XuehaiPan
 /torch/utils/pytree/ @XuehaiPan
 /torch/_dynamo/polyfills/pytree.py @XuehaiPan
 # Relating to libtorch ABI
 /torch/csrc/stable/ @janeyx99 @mikaylagawarecki
 /torch/headeronly/ @janeyx99
 /torch/header_only_apis.txt @janeyx99

									
										15

Dockerfile
									
												View File
												
				@ -47,18 +47,6 @@ WORKDIR /opt/pytorch

				COPY . .

				RUN git submodule update --init --recursive

				FROM conda as build

				ARG CMAKE_VARS

				WORKDIR /opt/pytorch

				COPY --from=conda /opt/conda /opt/conda

				COPY --from=submodule-update /opt/pytorch /opt/pytorch

				RUN make triton

				RUN --mount=type=cache,target=/opt/ccache \

				    export eval ${CMAKE_VARS} && \

				    TORCH_CUDA_ARCH_LIST="7.0 7.2 7.5 8.0 8.6 8.7 8.9 9.0 9.0a" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \

				    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \

				    python -m pip install --no-build-isolation -v .

				FROM conda as conda-installs

				ARG PYTHON_VERSION=3.11

				ARG CUDA_PATH=cu121

				@ -109,4 +97,5 @@ WORKDIR /workspace

				FROM official as dev

				# Should override the already installed version from the official-image stage

				COPY --from=build /opt/conda /opt/conda

				COPY --from=conda /opt/conda /opt/conda

				COPY --from=submodule-update /opt/pytorch /opt/pytorch

									
										12

README.md
									
												View File
												
				@ -1,4 +1,4 @@

				![PyTorch Logo](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/pytorch-logo-dark.png)

				![PyTorch Logo](https://github.com/pytorch/pytorch/blob/9708fcf92db88b80b9010c68662d634434da3106/docs/source/_static/img/pytorch-logo-dark.png)

				--------------------------------------------------------------------------------

				@ -72,7 +72,7 @@ Elaborating Further:

				If you use NumPy, then you have used Tensors (a.k.a. ndarray).

				![Tensor illustration](./docs/source/_static/img/tensor_illustration.png)

				![Tensor illustration](https://github.com/pytorch/pytorch/blob/9708fcf92db88b80b9010c68662d634434da3106/docs/source/_static/img/tensor_illustration.png)

				PyTorch provides Tensors that can live either on the CPU or the GPU and accelerates the

				computation by a huge amount.

				@ -99,7 +99,7 @@ from several research papers on this topic, as well as current and past work suc

				While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date.

				You get the best of speed and flexibility for your crazy research.

				![Dynamic graph](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/dynamic_graph.gif)

				![Dynamic graph](https://github.com/pytorch/pytorch/blob/9708fcf92db88b80b9010c68662d634434da3106/docs/source/_static/img/dynamic_graph.gif)

				### Python First

				@ -243,7 +243,7 @@ git submodule update --init --recursive

				```bash

				conda install cmake ninja

				# Run this command from the PyTorch directory after cloning the source code using the “Get the PyTorch Source“ section below

				# Run this command from the PyTorch directory after cloning the source code using the “Get the PyTorch Source“ section above

				pip install -r requirements.txt

				```

				@ -276,7 +276,7 @@ conda install pkg-config libuv

				pip install mkl-static mkl-include

				# Add these packages if torch.distributed is needed.

				# Distributed package support on Windows is a prototype feature and is subject to changes.

				conda install -c conda-forge libuv=1.39

				conda install -c conda-forge libuv

				```

				#### Install PyTorch

				@ -560,7 +560,7 @@ To learn more about making a contribution to Pytorch, please see our [Contributi

				PyTorch is a community-driven project with several skillful engineers and researchers contributing to it.

				PyTorch is currently maintained by [Soumith Chintala](http://soumith.ch), [Gregory Chanan](https://github.com/gchanan), [Dmytro Dzhulgakov](https://github.com/dzhulgakov), [Edward Yang](https://github.com/ezyang), and [Nikita Shulga](https://github.com/malfet) with major contributions coming from hundreds of talented individuals in various forms and means.

				PyTorch is currently maintained by [Soumith Chintala](http://soumith.ch), [Gregory Chanan](https://github.com/gchanan), [Dmytro Dzhulgakov](https://github.com/dzhulgakov), [Edward Yang](https://github.com/ezyang), [Alban Desmaison](https://github.com/albanD), [Piotr Bialecki](https://github.com/ptrblck) and [Nikita Shulga](https://github.com/malfet) with major contributions coming from hundreds of talented individuals in various forms and means.

				A non-exhaustive but growing list needs to mention: [Trevor Killeen](https://github.com/killeent), [Sasank Chilamkurthy](https://github.com/chsasank), [Sergey Zagoruyko](https://github.com/szagoruyko), [Adam Lerer](https://github.com/adamlerer), [Francisco Massa](https://github.com/fmassa), [Alykhan Tejani](https://github.com/alykhantejani), [Luca Antiga](https://github.com/lantiga), [Alban Desmaison](https://github.com/albanD), [Andreas Koepf](https://github.com/andreaskoepf), [James Bradbury](https://github.com/jekbradbury), [Zeming Lin](https://github.com/ebetica), [Yuandong Tian](https://github.com/yuandong-tian), [Guillaume Lample](https://github.com/glample), [Marat Dukhan](https://github.com/Maratyszcza), [Natalia Gimelshein](https://github.com/ngimel), [Christian Sarofeen](https://github.com/csarofeen), [Martin Raison](https://github.com/martinraison), [Edward Yang](https://github.com/ezyang), [Zachary Devito](https://github.com/zdevito). <!-- codespell:ignore -->

				Note: This project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor to the Torch community and has helped with many things Torch and PyTorch.

									
										186

aten/src/ATen/CMakeLists.txt
									
												View File
												
				@ -119,6 +119,8 @@ file(GLOB_RECURSE native_mps_cpp "native/mps/*.cpp")

				file(GLOB_RECURSE native_mps_mm "native/mps/*.mm")

				file(GLOB_RECURSE native_mps_metal "native/mps/*.metal")

				file(GLOB_RECURSE native_mps_h "native/mps/*.h")

				file(GLOB_RECURSE native_sparse_mps_mm "native/sparse/mps/*.mm")

				file(GLOB_RECURSE native_mps_sparse_metal "native/sparse/mps/*.metal")

				file(GLOB native_sparse_cpp "native/sparse/*.cpp")

				file(GLOB native_quantized_cpp

				@ -178,26 +180,27 @@ file(GLOB native_flash_attn_api_cpp "native/transformers/cuda/flash_attn/flash_a

				file(GLOB flash_attention_hip_hip "native/transformers/hip/flash_attn/*.hip")

				# if USE_FLASH_ATTENTION is set, ensure CK instances get generated

				if(USE_FLASH_ATTENTION)

				  if(DEFINED ENV{USE_CK_FLASH_ATTENTION})

				    set(USE_CK_FLASH_ATTENTION $ENV{USE_CK_FLASH_ATTENTION})

				      if(USE_CK_FLASH_ATTENTION STREQUAL "1")

				        if(DEFINED ENV{PYTORCH_ROCM_ARCH})

				          list(LENGTH PYTORCH_ROCM_ARCH NUM_ARCHS)

				          if(NUM_ARCHS GREATER 1)

				            message(WARNING "Building CK for multiple archs can increase build time considerably!

				            Consider setting PYTORCH_ROCM_ARCH env var value as the gfx arch you need to build for")

				          endif()

				        endif()

				        message(STATUS "USE_CK_FLASH_ATTENTION is set; building PyTorch with CK Flash Attention enabled")

				        message(STATUS "Generating CK kernel instances...")

				        add_subdirectory(native/transformers/hip/flash_attn/ck)

				        file(GLOB flash_attention_hip_ck_hip "native/transformers/hip/flash_attn/ck/*.hip")

				        list(APPEND native_transformers_hip_hip ${flash_attention_hip_ck_hip})

				        # FAv3 Generation

				        add_subdirectory(native/transformers/hip/flash_attn/ck/fav_v3)

				        file(GLOB flash_attention_v3_hip "native/transformers/hip/flash_attn/ck/fav_v3/*.hip")

				        list(APPEND native_transformers_hip_hip ${flash_attention_v3_hip})

				  if("$ENV{USE_CK_FLASH_ATTENTION}" STREQUAL "1")

				    message(STATUS "USE_CK_FLASH_ATTENTION is being deprecated. Please use USE_ROCM_CK_SDPA instead")

				    caffe2_update_option(USE_ROCM_CK_SDPA ON)

				  endif()

				  if(USE_ROCM_CK_SDPA)

				    if(DEFINED ENV{PYTORCH_ROCM_ARCH})

				      list(LENGTH PYTORCH_ROCM_ARCH NUM_ARCHS)

				      if(NUM_ARCHS GREATER 1)

				        message(WARNING "Building CK for multiple archs can increase build time considerably!

				        Consider setting PYTORCH_ROCM_ARCH env var value as the gfx arch you need to build for")

				      endif()

				    endif()

				    message(STATUS "USE_ROCM_CK_SDPA is set; building PyTorch with CK SDPA enabled")

				    message(STATUS "Generating CK kernel instances...")

				    add_subdirectory(native/transformers/hip/flash_attn/ck)

				    file(GLOB flash_attention_hip_ck_hip "native/transformers/hip/flash_attn/ck/*.hip")

				    list(APPEND native_transformers_hip_hip ${flash_attention_hip_ck_hip})

				    # FAv3 Generation

				    add_subdirectory(native/transformers/hip/flash_attn/ck/fav_v3)

				    file(GLOB flash_attention_v3_hip "native/transformers/hip/flash_attn/ck/fav_v3/*.hip")

				    list(APPEND native_transformers_hip_hip ${flash_attention_v3_hip})

				  endif()

				  file(GLOB flash_attention_hip_aot_hip "native/transformers/hip/flash_attn/aot/*.hip")

				  file(GLOB flash_attention_src_hip_hip "native/transformers/hip/flash_attn/src/*.hip")

				@ -247,6 +250,50 @@ if(USE_MEM_EFF_ATTENTION)

				  list(APPEND ATen_ATTENTION_KERNEL_SRCS ${mem_eff_attention_cuda_kernels_cu})

				endif()

				IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)

				  message(WARNING "Unsupported ROCM arch for FBGEMM GenAI, will set USE_FBGEMM_GENAI to OFF")

				  set(USE_FBGEMM_GENAI off)

				endif()

				# FBGEMM GenAI

				IF(USE_FBGEMM_GENAI)

				  set(FBGEMM_THIRD_PARTY ${PROJECT_SOURCE_DIR}/third_party/fbgemm/external/)

				  set(FBGEMM_GENAI_DIR ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize)

				  if(USE_ROCM)

				    # Only include the kernels we want to build to avoid increasing binary size.

				    file(GLOB_RECURSE fbgemm_genai_native_rocm_hip

				      "${FBGEMM_GENAI_DIR}/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped*.hip"

				      "${FBGEMM_GENAI_DIR}/ck_extensions/fp8_rowwise_grouped/fp8_rowwise_grouped_gemm.hip")

				    set_source_files_properties(${fbgemm_genai_native_rocm_hip} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)

				    # Add additional HIPCC compiler flags for performance

				    set(FBGEMM_GENAI_EXTRA_HIPCC_FLAGS

				      -mllvm

				      -amdgpu-coerce-illegal-types=1

				      -mllvm

				      -enable-post-misched=0

				      -mllvm

				      -greedy-reverse-local-assignment=1

				      -fhip-new-launch-api)

				    hip_add_library(

				      fbgemm_genai STATIC

				      ${fbgemm_genai_native_rocm_hip}

				      HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})

				    set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)

				    target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)

				    target_include_directories(fbgemm_genai PUBLIC

				      # FBGEMM version of Composable Kernel is used due to some customizations

				      ${FBGEMM_THIRD_PARTY}/composable_kernel/include

				      ${FBGEMM_THIRD_PARTY}/composable_kernel/library/include

				      ${FBGEMM_GENAI_DIR}/include/

				      ${FBGEMM_GENAI_DIR}/common/include/

				    )

				  endif()

				endif()

				# XNNPACK

				file(GLOB native_xnnpack "native/xnnpack/*.cpp")

				@ -372,39 +419,42 @@ if(USE_CUDA)

				endif()

				if(USE_ROCM)

				  # NOTE: The PyTorch build does not actually add_subdirectory

				  # third_party/composable_kernel or use it as a CMake library. What is used

				  # is header only, so this should be ok, except that the CMake build generates

				  # a ck/config.h. We just do that part here. Without this, the ck.h from the

				  # ROCM SDK may get accidentally used instead.

				  function(_pytorch_rocm_generate_ck_conf)

				    set(CK_ENABLE_INT8 "ON")

				    set(CK_ENABLE_FP16 "ON")

				    set(CK_ENABLE_FP32 "ON")

				    set(CK_ENABLE_FP64 "ON")

				    set(CK_ENABLE_BF16 "ON")

				    set(CK_ENABLE_FP8 "ON")

				    set(CK_ENABLE_BF8 "ON")

				    set(CK_USE_XDL "ON")

				    set(CK_USE_WMMA "ON")

				    configure_file(

				      "${Torch_SOURCE_DIR}/third_party/composable_kernel/include/ck/config.h.in"

				      "${CMAKE_CURRENT_BINARY_DIR}/composable_kernel/ck/config.h"

				      )

				  endfunction()

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/hip)

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/include)

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/library/include)

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_BINARY_DIR}/composable_kernel)

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/aiter/csrc/include)

				  _pytorch_rocm_generate_ck_conf()

				  if((USE_FLASH_ATTENTION AND USE_ROCM_CK_SDPA) OR USE_ROCM_CK_GEMM)

				    # NOTE: The PyTorch build does not actually add_subdirectory

				    # third_party/composable_kernel or use it as a CMake library. What is used

				    # is header only, so this should be ok, except that the CMake build generates

				    # a ck/config.h. We just do that part here. Without this, the ck.h from the

				    # ROCM SDK may get accidentally used instead.

				    function(_pytorch_rocm_generate_ck_conf)

				      set(CK_ENABLE_INT8 "ON")

				      set(CK_ENABLE_FP16 "ON")

				      set(CK_ENABLE_FP32 "ON")

				      set(CK_ENABLE_FP64 "ON")

				      set(CK_ENABLE_BF16 "ON")

				      set(CK_ENABLE_FP8 "ON")

				      set(CK_ENABLE_BF8 "ON")

				      set(CK_USE_XDL "ON")

				      set(CK_USE_WMMA "ON")

				      configure_file(

				        "${Torch_SOURCE_DIR}/third_party/composable_kernel/include/ck/config.h.in"

				        "${CMAKE_CURRENT_BINARY_DIR}/composable_kernel/ck/config.h"

				        )

				    endfunction()

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/hip)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/include)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/library/include)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/example/ck_tile/01_fmha)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_BINARY_DIR}/composable_kernel)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/aiter/csrc/include)

				    _pytorch_rocm_generate_ck_conf()

				  endif()

				  # Next two lines are needed because TunableOp uses third-party/fmt

				  list(APPEND ATen_HIP_INCLUDE $<TARGET_PROPERTY:fmt::fmt-header-only,INTERFACE_INCLUDE_DIRECTORIES>)

				  list(APPEND ATen_HIP_DEPENDENCY_LIBS fmt::fmt-header-only)

				if(USE_FLASH_ATTENTION)

				  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/native/transformers/hip/flash_attn/ck)

				endif()

				  if(USE_FLASH_ATTENTION AND USE_ROCM_CK_SDPA)

				    list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/native/transformers/hip/flash_attn/ck)

				  endif()

				  list(APPEND ATen_HIP_SRCS

				    ${ATen_HIP_SRCS}

				    ${hip_hip}

				@ -414,12 +464,17 @@ endif()

				    ${native_quantized_hip_hip}

				    ${native_transformers_hip_hip} ${native_transformers_src_hip_hip}

				  )

				  if(WIN32) # Windows doesn't support Composable Kernels

				  if(NOT USE_ROCM_CK_GEMM)

				    file(GLOB native_hip_bgemm "native/hip/bgemm_kernels/*.hip")

				    file(GLOB native_hip_ck "native/hip/ck*.hip")

				    exclude(ATen_HIP_SRCS "${ATen_HIP_SRCS}"

				      ${native_hip_bgemm} ${native_hip_ck})

				  endif()

				  if(WIN32) # Windows doesn't support Composable Kernels and Triton

				    exclude(ATen_HIP_SRCS "${ATen_HIP_SRCS}"

				      ${native_transformers_hip_hip} ${native_transformers_hip_cpp})

				  endif()

				  # TODO: Codegen separate files for HIP and use those (s/cuda_generated_sources/hip_generated_sources)

				  list(APPEND all_hip_cpp

				    ${native_nested_hip_cpp}

				@ -586,17 +641,10 @@ if(USE_CUDA AND NOT USE_ROCM)

				      CUDA::cufft_static_nocallback

				    )

				   if(NOT BUILD_LAZY_CUDA_LINALG)

				     if(CUDA_VERSION_MAJOR LESS_EQUAL 11)

				       list(APPEND ATen_CUDA_DEPENDENCY_LIBS

				         CUDA::cusolver_static

				         ${CUDAToolkit_LIBRARY_DIR}/liblapack_static.a     # needed for libcusolver_static

				       )

				     elseif(CUDA_VERSION_MAJOR GREATER_EQUAL 12)

				       list(APPEND ATen_CUDA_DEPENDENCY_LIBS

				         CUDA::cusolver_static

				         ${CUDAToolkit_LIBRARY_DIR}/libcusolver_lapack_static.a     # needed for libcusolver_static

				       )

				     endif()

				     list(APPEND ATen_CUDA_DEPENDENCY_LIBS

				       CUDA::cusolver_static

				       ${CUDAToolkit_LIBRARY_DIR}/libcusolver_lapack_static.a     # needed for libcusolver_static

				     )

				   endif()

				  else()

				    list(APPEND ATen_CUDA_DEPENDENCY_LIBS

				@ -661,29 +709,25 @@ endif()

				if(USE_MPS)

				    include(../../../cmake/Metal.cmake)

				    set(ATen_MPS_SRCS ${ATen_MPS_SRCS} ${mps_cpp} ${mps_mm} ${mps_h} ${native_mps_cpp} ${native_mps_mm} ${native_mps_h})

				    set(ATen_MPS_SRCS ${ATen_MPS_SRCS} ${mps_cpp} ${mps_mm} ${mps_h} ${native_mps_cpp} ${native_mps_mm} ${native_mps_h} ${native_sparse_mps_mm})

				    if(CAN_COMPILE_METAL)

				        foreach(SHADER ${native_mps_metal})

				        foreach(SHADER ${native_mps_metal} ${native_mps_sparse_metal})

				            cmake_path(GET SHADER STEM TGT_STEM)

				            string(CONCAT TGT_BASIC ${TGT_STEM} "_30.air")

				            string(CONCAT TGT_BFLOAT ${TGT_STEM} "_31.air")

				            string(CONCAT TGT_BASIC ${TGT_STEM} "_31.air")

				            list(APPEND AIR_BASIC ${TGT_BASIC})

				            list(APPEND AIR_BFLOAT ${TGT_BFLOAT})

				            metal_to_air(${SHADER} ${TGT_BASIC} "-std=metal3.0")

				            metal_to_air(${SHADER} ${TGT_BFLOAT} "-std=metal3.1")

				            metal_to_air(${SHADER} ${TGT_BASIC} "-std=metal3.1")

				        endforeach()

				        air_to_metallib(kernels_basic.metallib ${AIR_BASIC})

				        air_to_metallib(kernels_bfloat.metallib ${AIR_BFLOAT})

				        add_custom_command(

				                          COMMAND echo "// $$(date)" > metallib_dummy.cpp

				                          DEPENDS kernels_basic.metallib kernels_bfloat.metallib

				                          DEPENDS kernels_basic.metallib

				                          OUTPUT metallib_dummy.cpp

				                          COMMENT "Updating metallibs timestamp")

				        add_custom_target(metallibs DEPENDS kernels_basic.metallib kernels_bfloat.metallib metallib_dummy.cpp)

				        add_custom_target(metallibs DEPENDS kernels_basic.metallib metallib_dummy.cpp)

				    else()

				        file(MAKE_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/native/mps")

				        foreach(SHADER ${native_mps_metal})

				        foreach(SHADER ${native_mps_metal} ${native_mps_sparse_metal})

				            cmake_path(GET SHADER STEM TGT_STEM)

				            string(CONCAT SHADER_HDR_NAME  "${CMAKE_CURRENT_BINARY_DIR}" /native/mps/ ${TGT_STEM} "_metallib.h")

				            metal_to_metallib_h(${SHADER} ${SHADER_HDR_NAME})

									
										98

aten/src/ATen/Context.cpp
									
												View File
												
				@ -334,6 +334,14 @@ void Context::setBenchmarkLimitCuDNN(int b) {

				  benchmark_limit_cudnn = b;

				}

				bool Context::immediateMiopen() const {

				  return immediate_miopen;

				}

				void Context::setImmediateMiopen(bool b) {

				  immediate_miopen = b;

				}

				bool Context::allowTF32CuBLAS() const {

				#ifdef USE_ROCM

				    const auto allow_tf32 = c10::utils::check_env(hipblaslt_allow_tf32);

				@ -472,6 +480,9 @@ at::BlasBackend Context::blasPreferredBackend() {

				  // call site for blasPreferredBackend(), we set it to an actual value.

				  if (blas_preferred_backend == at::BlasBackend::Default) {

				    blas_preferred_backend = at::BlasBackend::Cublas;

				    // This logic sits in the getter because it needs to validate

				    // values set via env vars such as TORCH_BLAS_PREFER_CUBLASLT

				    // which initialize the backend without calling the setter

				#ifdef USE_ROCM

				    // AMD Instinct targets prefer hipblaslt

				    static const bool hipblaslt_preferred = []() {

				@ -501,10 +512,14 @@ at::BlasBackend Context::blasPreferredBackend() {

				  // hipblaslt support for all archs is not as complete as hipblas

				  if (blas_preferred_backend == at::BlasBackend::Cublaslt) {

				    static const bool hipblaslt_unsupported = []() {

				      if(!hasCuBLASLt())

				      {

				          return true;

				      }

				      static const std::vector<std::string> archs = {

				          "gfx90a", "gfx942",

				#if ROCM_VERSION >= 60300

				          "gfx1100", "gfx1101", "gfx1200", "gfx1201",

				          "gfx1100", "gfx1101", "gfx1200", "gfx1201", "gfx908",

				#endif

				#if ROCM_VERSION >= 60500

				          "gfx950"

				@ -526,6 +541,24 @@ at::BlasBackend Context::blasPreferredBackend() {

				  return blas_preferred_backend;

				}

				bool Context::ckSupported() {

				#ifdef USE_ROCM

				  static const std::vector<std::string> supported_archs = {

				    "gfx90a", "gfx942", "gfx950"

				  };

				  for (auto index : c10::irange(detail::getCUDAHooks().deviceCount())) {

				    if(!detail::getCUDAHooks().isGPUArch(supported_archs, index)) {

				      TORCH_WARN_ONCE(

				        "Attempting to use CK on an unsupported architecture! Cannot set backend to CK");

				      return false;

				    }

				  }

				  return true;

				#else

				  return false;

				#endif

				}

				void Context::setBlasPreferredBackend(at::BlasBackend b) {

				#ifdef _MSC_VER

				  TORCH_WARN_ONCE(

				@ -535,8 +568,14 @@ void Context::setBlasPreferredBackend(at::BlasBackend b) {

				#else

				  TORCH_CHECK((b != at::BlasBackend::Cublaslt) || hasCuBLASLt(),

				      "Cannot set preferred backend to cuBLASLt if PyTorch has not been compiled with cuBLASLt.");

				  TORCH_CHECK((b != at::BlasBackend::Ck) || hasROCM(),

				      "Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.");

				#ifdef USE_ROCM

				  static const bool ckSupportedFlag = ckSupported();

				  static const bool hasCKGEMMFlag = hasCKGEMM();

				  TORCH_CHECK((b != at::BlasBackend::Ck) || (ckSupportedFlag && hasCKGEMMFlag),

				      "Cannot set preferred blas backend to CK since following conditions are not true: ",

				      "architecture supported for CK: ", ckSupportedFlag,

				      ", PyTorch built with CK GEMM support: ", hasCKGEMMFlag);

				#endif

				  if (b != at::BlasBackend::Default && b != at::BlasBackend::Cublas) {

				    TORCH_WARN_ONCE(

				      "torch.backends.cuda.preferred_blas_library is an experimental feature. "

				@ -548,35 +587,40 @@ void Context::setBlasPreferredBackend(at::BlasBackend b) {

				#endif

				}

				at::ROCmFABackend Context::getROCmFAPreferredBackend() const {

				at::ROCmFABackend Context::getROCmFAPreferredBackend() {

				#ifdef USE_ROCM

				  // Set potential "Default" value so we don't have to interpret at call sites.

				  // We use aotriton backend as the default, for now.

				  if(rocm_fa_preferred_backend == at::ROCmFABackend::Default) {

				    rocm_fa_preferred_backend = at::ROCmFABackend::AOTriton;

				  } else if (rocm_fa_preferred_backend == at::ROCmFABackend::Ck) {

				    // This logic sits in the getter because it needs to validate

				    // values set via env vars such as TORCH_ROCM_FA_PREFER_CK

				    // which initialize the backend without calling the setter

				    // Perform validity checking

				    static const bool hasCKSDPAFlag = hasCKSDPA();

				    static const bool ckSupportedFlag = ckSupported();

				    if(!(hasCKSDPAFlag && ckSupportedFlag)){

				      TORCH_WARN_ONCE(

				        "Cannot set preferred SDPA backend to CK since following conditions are not true: ",

				        "architecture supported for CK: ", ckSupportedFlag,

				        ", PyTorch built with CK SDPA support: ", hasCKSDPAFlag);

				      rocm_fa_preferred_backend = at::ROCmFABackend::AOTriton;

				    }

				  }

				#endif

				  return rocm_fa_preferred_backend;

				}

				void Context::setROCmFAPreferredBackend(at::ROCmFABackend b) {

				  // TODO: add plumbing for hasCK for validity checking

				  TORCH_CHECK((b != at::ROCmFABackend::Ck) || hasROCM(),

				      "Cannot set preferred flash attention backend to Ck if PyTorch has not been compiled for ROCm.");

				#ifdef USE_ROCM

				  if(b == at::ROCmFABackend::Ck) {

				    static const bool ck_unsupported = []() {

				      static const std::vector<std::string> archs = {

				          "gfx90a",  "gfx942"

				      };

				      for (auto index: c10::irange(detail::getCUDAHooks().deviceCount())) {

				        if (!detail::getCUDAHooks().isGPUArch(archs, index)) {

				          TORCH_WARN_ONCE(

				            "Attempting to use CK on an unsupported architecture! Cannot set backend to CK");

				          return true;

				        }

				      }

				      return false;

				    }();

				    if(!ck_unsupported) rocm_fa_preferred_backend = b;

				  }

				  else {

				     rocm_fa_preferred_backend = b;

				  }

				  static const bool hasCKSDPAFlag = hasCKSDPA();

				  static const bool ckSupportedFlag = ckSupported();

				  TORCH_CHECK((b != at::ROCmFABackend::Ck) || (hasCKSDPAFlag && ckSupportedFlag),

				      "Cannot set preferred SDPA backend to CK since following conditions are not true: ",

				      "architecture supported for CK: ", ckSupportedFlag,

				      ", PyTorch built with CK SDPA support: ", hasCKSDPAFlag);

				#endif

				  rocm_fa_preferred_backend = b;

				}

Compare commits

792 Commits test-myst- ... replace-py

16 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

1 .ci/docker/README.md Unescape Escape View File

78 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface.txt Unescape Escape View File

0 .github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

5 .ci/docker/common/install_cpython.sh Unescape Escape View File

4 .ci/docker/common/install_cuda.sh Unescape Escape View File

26 .ci/docker/common/install_cudnn.sh Unescape Escape View File

30 .ci/docker/common/install_inductor_benchmark_deps.sh Unescape Escape View File

15 .ci/docker/common/install_rocm.sh Unescape Escape View File

41 .ci/docker/common/install_xpu.sh Unescape Escape View File

2 .ci/docker/libtorch/build.sh Unescape Escape View File

2 .ci/docker/manywheel/build.sh Unescape Escape View File

25 .ci/docker/requirements-ci.txt Unescape Escape View File

7 .ci/docker/requirements-docs.txt Unescape Escape View File

3 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

3 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

33 .ci/manywheel/build_common.sh Unescape Escape View File

2 .ci/manywheel/build_rocm.sh Unescape Escape View File

34 .ci/pytorch/build-mobile.sh Unescape Escape View File

45 .ci/pytorch/build.sh Unescape Escape View File

28 .ci/pytorch/common_utils.sh Unescape Escape View File

123 .ci/pytorch/create_test_cert.py Unescape Escape View File

28 .ci/pytorch/macos-test.sh Unescape Escape View File

18 .ci/pytorch/run_glootls_test.sh Unescape Escape View File

25 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

65 .ci/pytorch/test.sh Unescape Escape View File

7 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

2 .ci/pytorch/win-test.sh Unescape Escape View File

14 .ci/wheel/build_wheel.sh Unescape Escape View File

12 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

1 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

4 .circleci/scripts/binary_upload.sh Unescape Escape View File

4 .flake8 Unescape Escape View File

10 .github/actionlint.yaml vendored Unescape Escape View File

78 .github/actions/build-android/action.yml vendored Unescape Escape View File

2 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

1 .github/actions/test-pytorch-binary/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/vllm.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

19 .github/merge_rules.yaml vendored Unescape Escape View File

6 .github/requirements-gha-cache.txt vendored Unescape Escape View File

4 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

18 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

42 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

2 .github/scripts/lintrunner.sh vendored Unescape Escape View File

7 .github/scripts/runner_determinator.py vendored Unescape Escape View File

4 .github/scripts/trymerge.py vendored Unescape Escape View File

3 .github/templates/macos_binary_build_workflow.yml.j2 vendored Unescape Escape View File

5 .github/templates/upload.yml.j2 vendored Unescape Escape View File

10 .github/workflows/_binary-build-linux.yml vendored Unescape Escape View File

9 .github/workflows/_binary-test-linux.yml vendored Unescape Escape View File

8 .github/workflows/_binary-upload.yml vendored Unescape Escape View File

1 .github/workflows/_linux-build.yml vendored Unescape Escape View File

20 .github/workflows/_linux-test.yml vendored Unescape Escape View File

4 .github/workflows/_rocm-test.yml vendored Unescape Escape View File

8 .github/workflows/build-triton-wheel.yml vendored Unescape Escape View File

3 .github/workflows/check-labels.yml vendored Unescape Escape View File

5 .github/workflows/check_mergeability_ghstack.yml vendored Unescape Escape View File

2 .github/workflows/cherry-pick.yml vendored Unescape Escape View File

9 .github/workflows/docker-builds.yml vendored Unescape Escape View File

2 .github/workflows/docker-release.yml vendored Unescape Escape View File

30 .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml generated vendored Unescape Escape View File

110 .github/workflows/generated-linux-binary-manywheel-main.yml generated vendored Unescape Escape View File

1313 .github/workflows/generated-linux-binary-manywheel-nightly.yml generated vendored View File

2 .github/workflows/generated-linux-binary-manywheel-rocm-main.yml generated vendored Unescape Escape View File

15 .github/workflows/generated-linux-s390x-binary-manywheel-nightly.yml generated vendored Unescape Escape View File

154 .github/workflows/inductor-perf-test-b200.yml vendored Normal file Unescape Escape View File

4 .github/workflows/inductor-perf-test-nightly-h100.yml vendored Unescape Escape View File

36 .github/workflows/inductor-perf-test-nightly-rocm.yml vendored Unescape Escape View File

32 .github/workflows/inductor-periodic.yml vendored Unescape Escape View File

4 .github/workflows/inductor-rocm-mi300.yml vendored Unescape Escape View File

1 .github/workflows/mac-mps.yml vendored Unescape Escape View File

9 .github/workflows/nightly.yml vendored Unescape Escape View File

6 .github/workflows/periodic-rocm-mi300.yml vendored Unescape Escape View File

31 .github/workflows/periodic.yml vendored Unescape Escape View File

792 Commits

test-myst- ... replace-py

16

.ci/aarch64_linux/build_aarch64_wheel.py

View File

1

.ci/docker/README.md

View File

78

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

0

.github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

5

.ci/docker/common/install_cpython.sh

View File

4

.ci/docker/common/install_cuda.sh

View File

26

.ci/docker/common/install_cudnn.sh

View File

30

.ci/docker/common/install_inductor_benchmark_deps.sh

View File

15

.ci/docker/common/install_rocm.sh

View File

41

.ci/docker/common/install_xpu.sh

View File

2

.ci/docker/libtorch/build.sh

View File

2

.ci/docker/manywheel/build.sh

View File

25

.ci/docker/requirements-ci.txt

View File

7

.ci/docker/requirements-docs.txt

View File

3

.ci/docker/ubuntu-rocm/Dockerfile

View File

3

.ci/docker/ubuntu/Dockerfile

View File

33

.ci/manywheel/build_common.sh

View File

2

.ci/manywheel/build_rocm.sh

View File

34

.ci/pytorch/build-mobile.sh

View File

45

.ci/pytorch/build.sh

View File

28

.ci/pytorch/common_utils.sh

View File

123

.ci/pytorch/create_test_cert.py

View File

28

.ci/pytorch/macos-test.sh

View File

18

.ci/pytorch/run_glootls_test.sh

View File

25

.ci/pytorch/smoke_test/smoke_test.py

View File

65

.ci/pytorch/test.sh

View File

7

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

2

.ci/pytorch/win-test.sh

View File

14

.ci/wheel/build_wheel.sh

View File

12

.circleci/scripts/binary_linux_test.sh

View File

1

.circleci/scripts/binary_populate_env.sh

View File

4

.circleci/scripts/binary_upload.sh

View File

4

.flake8

View File

10

.github/actionlint.yaml vendored

View File

78

.github/actions/build-android/action.yml vendored

View File

2

.github/actions/filter-test-configs/action.yml vendored

View File

1

.github/actions/test-pytorch-binary/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/vllm.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

19

.github/merge_rules.yaml vendored

View File

6

.github/requirements-gha-cache.txt vendored

View File

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

18

.github/scripts/generate_binary_build_matrix.py vendored

View File

42

.github/scripts/generate_ci_workflows.py vendored

View File

2

.github/scripts/lintrunner.sh vendored

View File

7

.github/scripts/runner_determinator.py vendored

View File

4

.github/scripts/trymerge.py vendored

View File

3

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

5

.github/templates/upload.yml.j2 vendored

View File

10

.github/workflows/_binary-build-linux.yml vendored

View File

9

.github/workflows/_binary-test-linux.yml vendored

View File

8

.github/workflows/_binary-upload.yml vendored

View File

1

.github/workflows/_linux-build.yml vendored

View File

20

.github/workflows/_linux-test.yml vendored

View File

4

.github/workflows/_rocm-test.yml vendored

View File

8

.github/workflows/build-triton-wheel.yml vendored

View File

3

.github/workflows/check-labels.yml vendored

View File

5

.github/workflows/check_mergeability_ghstack.yml vendored

View File

2

.github/workflows/cherry-pick.yml vendored

View File

9

.github/workflows/docker-builds.yml vendored

View File

2

.github/workflows/docker-release.yml vendored

View File

30

.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml generated vendored

View File

110

.github/workflows/generated-linux-binary-manywheel-main.yml generated vendored

View File

1313

.github/workflows/generated-linux-binary-manywheel-nightly.yml generated vendored

View File

2

.github/workflows/generated-linux-binary-manywheel-rocm-main.yml generated vendored

View File

15

.github/workflows/generated-linux-s390x-binary-manywheel-nightly.yml generated vendored

View File

154

.github/workflows/inductor-perf-test-b200.yml vendored Normal file

View File

4

.github/workflows/inductor-perf-test-nightly-h100.yml vendored

View File

36

.github/workflows/inductor-perf-test-nightly-rocm.yml vendored

View File

32

.github/workflows/inductor-periodic.yml vendored

View File

4

.github/workflows/inductor-rocm-mi300.yml vendored

View File

1

.github/workflows/mac-mps.yml vendored

View File

9

.github/workflows/nightly.yml vendored

View File

6

.github/workflows/periodic-rocm-mi300.yml vendored

View File

31

.github/workflows/periodic.yml vendored

View File

142

.github/workflows/pull.yml vendored

View File