pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-23 14:59:34 +08:00

Author	SHA1	Message	Date
Svetlana Karslioglu	ba56102387	Cherrypick: Add the RunLLM widget to the website (#159592 ) Add the RunLLM widget to the website (#152055) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055 Approved by: https://github.com/albanD (cherry picked from commit e4e2701429c17078c3c475382a8b1fa4c8a8cefc)	2025-08-04 12:51:07 -04:00
William Wen	c525a02c89	[dynamo, docs] cherry pick torch.compile programming model docs into 2.8 (#159373 ) * [dynamo, docs] cherry pick torch.compile programming model docs into 2.8 * revert requirements-docs.txt * add remaining docs, update conf.py with myst_nb	2025-07-30 13:40:58 -04:00
Andrey Talman	a1cb3cc05d	[Release Only] Remove nvshmem from list of preload libraries (#158925 ) [Release Only] Remove nvshmem from preloadlist	2025-07-23 10:59:41 -04:00
pytorchbot	c76b2356bc	Move out super large one off foreach_copy test (#158880 ) Move out super large one off foreach_copy test (#156876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876 Approved by: https://github.com/albanD, https://github.com/jeffdaily (cherry picked from commit 50b2069b61942e923528c94ccbbc8ab5e92c381e) Co-authored-by: Jane Xu <janeyx@meta.com>	2025-07-22 19:12:15 -04:00
Andrey Talman	20a0e225a0	Revert "[Dynamo] Allow inlining into AO quantization modules (#152934 )" (#158677 ) This reverts commit 20e2ca3e29ce9eb33eef17db077696222c175764.	2025-07-22 19:09:17 -04:00
pytorchbot	9167ac8c75	[MPS] Switch Cholesky decomp to column wise (#158237 ) [MPS] Switch Cholesky decomp to column wise (#157014) Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like. To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host TODOs: - Rename upper parameter to something more sensible and add comments - Use simd_groupsize instead of hardcoded 32 everywhere Fixes https://github.com/pytorch/pytorch/issues/156658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157179 (cherry picked from commit 1c8844d9e7b2d72fb80b67ed51df4f6a1295b3b5) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-07-22 15:25:46 -07:00
pytorchbot	5534685c62	[MPS] Reimplement `tri[ul]` as Metal shaders (#158867 ) [MPS] Reimplement `tri[ul]` as Metal shaders (#157179) And add in-place flavor, as it is currently broken for non-contig tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179 Approved by: https://github.com/dcci (cherry picked from commit a1e4f1f98a0b9596fe52aaf2f85b0778498d5f49) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-07-22 14:26:58 -07:00
Svetlana Karslioglu	d19e08d74b	Cherry pick PR 158746 (#158801 ) Fix the typos in the right nav by pulling the latest theme (#158746) This will fix broken links in the right nav. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746 Approved by: https://github.com/malfet (cherry picked from commit 2bb684304d26804ab87103ada05b6ba63e309b59) (cherry picked from commit 0462fd4707374b28600bb6dd654ce94db57f8950)	2025-07-22 16:44:32 -04:00
Andrey Talman	a6c044ab9a	[cherry-pick] Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537 ) (#158655 ) Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537) Fixes #158376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537 Approved by: https://github.com/atalman Co-authored-by: bobrenjc93 <bobren@meta.com>	2025-07-22 14:46:10 -04:00
pytorchbot	620ebd0646	[Dynamo] Use proper sources for constructing dataclass defaults (#158689 ) [Dynamo] Use proper sources for constructing dataclass defaults (#157993) Partially fixes https://github.com/pytorch/pytorch/issues/154009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157993 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 (cherry picked from commit 89850bbc073c4e27ca51b0b205742e1d316e7097) Co-authored-by: Michael Lazos <mlazos@meta.com>	2025-07-22 14:31:46 -04:00
Svetlana Karslioglu	5d526136a2	Pull latest Sphinx theme (#158595 ) (#158673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595 Approved by: https://github.com/albanD (cherry picked from commit 79e49efaddf3a049adbe2de839cc65d73a1edd42)	2025-07-22 14:08:31 -04:00
Robert Hardwick	f41c694a00	[cherry-pick] Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158445 ) Fix AArch64 grid sampler segfaults by disabling strict-aliasing gcc optimization See https://github.com/pytorch/pytorch/issues/157626 for more context (cherry picked from commit b62f4d03c7e76bfa7e25ad898232fade0111510b)	2025-07-21 15:04:33 -07:00
pytorchbot	1d51e69f5e	[async-TP] Turn asserts back into silent skips (#158736 ) [async-TP] Turn asserts back into silent skips (#158572) https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise. This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569. This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572 Approved by: https://github.com/danielvegamyhre, https://github.com/atalman (cherry picked from commit fac0be7b9c80f20bbff1e813225dcbced7ff4d31) Co-authored-by: Luca Wehrstedt <lcw@meta.com>	2025-07-21 14:17:27 -07:00
Andrey Talman	06152d94bd	[cherry-pick][Docker builds] Move from Miniconda to Miniforge (#158370 ) (#158756 ) * [Docker builds] Move from Miniconda to Miniforge (#158370) This is related to: https://www.anaconda.com/legal/terms/terms-of-service Trying to fix outage with docker builds. https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799 Rocm and XPU builds since they use Miniforge are not affected ``` #22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1 ------ > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt: 11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding: 11.93 • https://repo.anaconda.com/pkgs/main 11.93 • https://repo.anaconda.com/pkgs/r 11.93 11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL: 11.93 ‣ conda tos accept --override-channels --channel CHANNEL ``` Hence solution is: 1. using `` conda tos accept --override-channels --channel defaults`` 2. use Miniforge instead of Miniconda. Using solution 2. Solution Tried that don't work: 1. Using ``CONDA_ALWAYS_YES = true `` 4. Using older version of miniconda ``` [Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com> * Remove tos --------- Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-07-21 15:05:12 -04:00
Camyll Harajli	ad337e17a7	[cherry-pick][release 2.8] Update OpenBLAS commit (#151547 ) (#158243 ) * Update OpenBLAS commit (#151547) Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install` Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2006%20Jun%202025%2009%3A46%3A35%20GMT&stopTime=Fri%2C%2013%20Jun%202025%2009%3A46%3A35%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=0218b65bcf61971c1861cfe8bc586168b73aeb5f&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad) 1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers. 2. Overall torch-bench pass-rate (cpp_wrapper mode) increased `[87%, 65/75 → 96%, 72/75]` <img width="676" alt="Screenshot 2025-06-20 at 17 05 15" src="https://github.com/user-attachments/assets/2ca9c1bc-80c6-464a-8db6-b758f2476582" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547 Approved by: https://github.com/malfet, https://github.com/snadampal, https://github.com/fadara01 Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com> Co-authored-by: Ryo Suzuki <ryo.suzuki@arm.com> Co-authored-by: Ye Tao <ye.tao@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> * Update .ci/docker/common/install_conda.sh Co-authored-by: Andrey Talman <atalman@fb.com> * Update .ci/docker/common/install_conda.sh Co-authored-by: Andrey Talman <atalman@fb.com> * Update .ci/docker/common/install_conda.sh Co-authored-by: Andrey Talman <atalman@fb.com> * try reverting conda install --------- Co-authored-by: Aditya Tewari <aditya.tewari@arm.com> Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com> Co-authored-by: Ryo Suzuki <ryo.suzuki@arm.com> Co-authored-by: Ye Tao <ye.tao@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Andrey Talman <atalman@fb.com>	2025-07-21 10:21:11 -07:00
Andrey Talman	b30e6eafd9	[Reland] Add warning about removed sm50 and sm60 arches (#158744 )	2025-07-21 11:40:17 -04:00
Sidharth Subbarao	390ea60084	[cherry-pick] temporarily disabling generation of weblinks for torch v2.8 & removing string literals for weblink generation (#157951 ) * [dynamo] temporarily disabling generation of weblinks for torch v2.8 release (#157299) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157299 Approved by: https://github.com/williamwen42 (cherry picked from commit 3ed4384f5b4bb7ae7d12298632a258385a51446e) * [dynamo] removing string literals for weblink generation (#157820) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157820 Approved by: https://github.com/williamwen42 (cherry picked from commit 9f18482d41227df3cf2248dfa54bd6601e61e1ca)	2025-07-18 15:37:48 -04:00
Cao E	9bd202e514	Add stride check for attn_mask on non-cpu device (#158618 ) Add stride check for attn_mask on non-cpu device (#158424) Fixes #158374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424 Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman	2025-07-18 13:11:55 -04:00
pytorchbot	f2b69a083d	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158585 ) [CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295) Reopen https://github.com/pytorch/pytorch/pull/156097 Fixes https://github.com/pytorch/pytorch/issues/154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn (cherry picked from commit a9f902add02383ca1b0386eb865767641975fede) Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-18 10:45:13 -04:00
Andrey Talman	8dd85100ef	[cherry-pick][inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158646 ) * [inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158459) Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton. Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459 Approved by: https://github.com/atalman * dont_upde_hash * Revert "dont_upde_hash" This reverts commit 5fffb12a3adead5c1ac9f6d9d1f505cbc74f3421. * fix_docker_builds --------- Co-authored-by: David Berard <dberard@fb.com>	2025-07-18 10:44:04 -04:00
pytorchbot	40e743336e	Add warning about removed sm50 and sm60 arches (#158478 ) Add warning about removed sm50 and sm60 arches (#158301) Related to https://github.com/pytorch/pytorch/issues/157517 Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures. We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures. Test: ``` >>> torch.cuda.get_arch_list() ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] >>> torch.cuda.init() /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning: Found <GPU Name> which is of cuda capability 5.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 7.0. warnings.warn( /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning: Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds. Please see https://github.com/pytorch/pytorch/issues/157517 Please install CUDA 12.6 builds if you require Maxwell or Pascal support. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301 Approved by: https://github.com/nWEIdia, https://github.com/albanD (cherry picked from commit fb731fe371cb1b5bf95de84b19c213590526acb2) Co-authored-by: atalman <atalman@fb.com>	2025-07-16 19:14:19 -04:00
pytorchbot	779e6f3316	Add flag to fx.passes.split_module to normalize input names (#157793 ) Add flag to fx.passes.split_module to normalize input names (#157733) This is useful for vLLM, which runs AOTAutograd directly on graphs after they have been split. I created a new flag for this instead of reusing `keep_original_node_name` (please let me know if you think I should reuse this). The reasoning is: - The names of the placeholder nodes is different from the targets of the placehoder nodes. The targets are the actual input names. - Backwards compatibility: this API has been out for ~4 years, it looks public, and it has extensive public use. For example, this change would actually be BC-breaking to vLLM (they rely on the subgraph input names being different at the moment). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/157733 Approved by: https://github.com/ezyang (cherry picked from commit b9afdd9bcc738697c6eefc90899508ab783bf6ab) Co-authored-by: rzou <zou3519@gmail.com>	2025-07-16 15:43:51 -04:00
pytorchbot	e52eeb7d66	[MPS] Fix `index_kernel` for large tensors (#158239 ) [MPS] Fix `index_kernel` for large tensors (#158064) Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 383.5 \| 379.8 \| 470.9 \| 1232.9 \| 4410.3 __getitem__ (torch.float16, torch.int64) \| 379.6 \| 354.5 \| 533.2 \| 1290.3 \| 4442.2 __getitem__ (torch.float32, torch.int64) \| 360.8 \| 338.6 \| 478.6 \| 1348.9 \| 4870.4 Times are in microseconds (us). ``` and after ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 349.8 \| 330.5 \| 432.6 \| 764.5 \| 1961.2 __getitem__ (torch.float16, torch.int64) \| 342.5 \| 330.7 \| 434.7 \| 741.0 \| 1969.4 __getitem__ (torch.float32, torch.int64) \| 332.2 \| 326.1 \| 445.4 \| 751.3 \| 1972.6 Times are in microseconds (us). ``` While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint Fixes https://github.com/pytorch/pytorch/issues/153560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064 Approved by: https://github.com/dcci (cherry picked from commit beed033b6e6ac57c0b4a1f47eb436e115a52e41b) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-07-16 15:14:59 -04:00
pytorchbot	2140d582e5	Fix einops x torch.compile interaction (#157754 ) Fix einops x torch.compile interaction (#157600) Fixes https://github.com/pytorch/pytorch/issues/157451 If/when einops releases a version greater than 0.8.1, it will just break (without this patch). The history is: - Between 2.6 and 2.7, we tried to delete the einops import (#142847) - That didn't work so well, so we applied a hotfix in 2.7.1. (#153925) - The hotfix wasn't completely correct (0.8.1 is the latest version of einops, so the condition in the hotfix just always evaluates to True!) - It turns out we didn't need to delete the einops import. We already do not eagerly import einops. - I reverted the code back to the state it was in in 2.6. https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py Test Plan: - We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600 Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305 ghstack dependencies: #157416 (cherry picked from commit 5d8d126249f83a9581f6b086f0891753bbb7175e) Co-authored-by: rzou <zou3519@gmail.com>	2025-07-15 10:06:09 -04:00
pytorchbot	b53536b427	Fix #156261 _foreach_copy indexing (#158238 ) Fix #156261 _foreach_copy indexing (#156719) Fixes #156261 Thanks to @ngimel's fast eyes For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719 Approved by: https://github.com/albanD (cherry picked from commit 4ee4863232b9e07728d85254768bcba3aadc9b9a) Co-authored-by: Jane Xu <janeyx@meta.com>	2025-07-14 16:42:28 -04:00
pytorchbot	ff2170c940	Add sm_70 to windows 12.9 build (#158265 ) Add sm_70 to windows 12.9 build (#158126) Please see: https://github.com/pytorch/pytorch/issues/157517 Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126 Approved by: https://github.com/Skylion007, https://github.com/atalman (cherry picked from commit 86d8af6a6cc648134289de89d393d0dce5b3a5f4) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-14 16:40:59 -04:00
pytorchbot	f0101fd29f	docs: add get_default_backend_for_device to distributed documentation (#158236 ) docs: add get_default_backend_for_device to distributed documentation (#156783) `torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap. CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783 Approved by: https://github.com/guangyey, https://github.com/malfet (cherry picked from commit b146ca74f01df3cf711fd0f855e05805e490156c) Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-07-14 16:32:32 -04:00
pytorchbot	bd17a453df	don't error out in empty_cache under mempool context (#158180 ) don't error out in empty_cache under mempool context (#158152) Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152 Approved by: https://github.com/zou3519, https://github.com/eqy (cherry picked from commit 9056279f8159b052599a31b591a78da1acc4224c) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-07-14 09:24:47 -07:00
pytorchbot	983b95f90b	[autograd] Avoid creating and recording event when unnecessary (#157914 ) [autograd] Avoid creating and recording event when unnecessary (#157503) Today, we always create and record an events in two places: 1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event. 2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer. We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream. Removing this unnecessary create + record event should save a few us for each instance avoided. Fixes https://github.com/pytorch/pytorch/issues/157407 ---- Manual test plan: - [x] @eqy to confirm perf is restored - [x] Running the repro originally reported before/after the patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503 Approved by: https://github.com/eqy ghstack dependencies: #155715 (cherry picked from commit 8bda95228fbefa6ce204bf4da8b632d1516431bb) Co-authored-by: soulitzer <soulitzer@gmail.com>	2025-07-14 07:19:27 -07:00
pytorchbot	175782834a	[aarch64] Add sm_80 to CUDA SBSA build (#158118 ) [aarch64] Add sm_80 to CUDA SBSA build (#157843) related to https://github.com/pytorch/pytorch/issues/152690 This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843 Approved by: https://github.com/Skylion007, https://github.com/atalman (cherry picked from commit 297daa1d30c80826b939d8f2dcd07422dec72642) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-11 13:41:21 -04:00
pytorchbot	6e08036e2b	[user triton] AOT inductor support for device-side TMA (#157241 ) [user triton] AOT inductor support for device-side TMA (#155896) Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma` Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel. To support this in AOTI, this PR: * records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen * allocates global scratch, if needed (cuda/device_op_overrides.py) * plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device) * updates tests to verify this works for dynamically shaped inputs This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works) Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space). For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896 Approved by: https://github.com/desertfire (cherry picked from commit b6c00dfe249a7bfc1d61a322d5bc30f164353abf) Co-authored-by: David Berard <dberard@fb.com>	2025-07-11 13:21:56 -04:00
pytorchbot	fd227e5208	Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157968 ) Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558) Please see: https://github.com/pytorch/pytorch/issues/157517 We would like to keep Volta architectures by default for release 2.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558 Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet (cherry picked from commit 179dcc10e4e0c742fb7d93b832021d0c177798bf) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-07-11 10:28:47 -04:00
Andrey Talman	058b58ac9d	Revert "Turn on compile with NVSHMEM (#154538 )" (#158040 ) This reverts commit 3685b101709d466ec79a976c55e9efcd0d1a19fa.	2025-07-11 09:13:30 -04:00
Andrey Talman	0afa9afa5e	Revert "Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568 )" (#158039 ) This reverts commit 34c6371d24a350db754a90c75361cdcf48cc0e71.	2025-07-11 09:12:52 -04:00
Camyll Harajli	d24444a4cf	[cherry-pick] revert #156552 (#156767 ) Revert "Add fx_graph_runnable tests boilerplate (#156552)" This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90.	2025-07-10 15:04:11 -07:00
Camyll Harajli	f885f15603	cherrypick revert of #152932 for release 2.8 (#158031 ) * Revert "Add unified memory APIs for torch.accelerator (#152932)" This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029. * Revert "Add DeviceAllocator as the base device allocator (#138222)" This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.	2025-07-10 10:28:14 -07:00
pytorchbot	d1d97caf15	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157454 ) [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg (cherry picked from commit 82eefaedd98b63de8a87e34275af781f8eb177e1) Co-authored-by: David Berard <dberard@fb.com>	2025-07-09 17:24:51 -04:00
pytorchbot	b0543d6b7c	[release] Triton pin update to 3.4 (#157752 ) [release] Triton pin update to 3.4 (#156664) Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206 Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664 Approved by: https://github.com/davidberard98 (cherry picked from commit 3d06ff82a84a118f0ed246864d4fc01ac4726328) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-07-08 19:03:21 -04:00
pytorchbot	6550831051	[inductor][static launcher] Skip correctness test for test_floats (#157200 ) [inductor][static launcher] Skip correctness test for test_floats (#157023) https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway). Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023 Approved by: https://github.com/jamesjwu (cherry picked from commit e8217ad8becd2b297682c685a9179997cb0a98cc) Co-authored-by: David Berard <dberard@fb.com>	2025-07-07 15:01:33 -07:00
pytorchbot	a83d7ae76a	[ONNX] Bump onnxscript api for torch 2.8 (#157137 ) [ONNX] Bump onnxscript api for torch 2.8 (#157017) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017 Approved by: https://github.com/titaiwangms, https://github.com/malfet (cherry picked from commit 36fd1ac9324429c095f8fbc5f6d2bd4b71f18d61) Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-07-07 15:00:43 -07:00
pytorchbot	90d16ee800	Fix macOS build with `USE_MPS=OFF` (#156932 ) Fix macOS build with `USE_MPS=OFF` (#156847) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847 Approved by: https://github.com/angelayi (cherry picked from commit 455dfd258980294f0745bd90aee12a323e37224d) Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-07-07 14:59:26 -07:00
pytorchbot	f6166e4427	[dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157718 ) [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598) `lru_cache` usage warning was being raised for `torch.get_device_module()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598 Approved by: https://github.com/Sidharth123-cpu (cherry picked from commit 52e4e41cbc36a5cf44395ff84ca2d069263560de) Co-authored-by: William Wen <williamwen@meta.com>	2025-07-07 14:58:18 -07:00
pytorchbot	30a9a25b15	[dynamo] Fix source for lru_cache method (#157308 ) [dynamo] Fix source for lru_cache method (#157292) Fixes - https://github.com/pytorch/pytorch/issues/157273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292 Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel (cherry picked from commit 3684be056d9af667400ba071a116be8b1112bba8) Co-authored-by: Animesh Jain <anijain@umich.edu>	2025-07-07 09:55:02 -07:00
Scott Wolchok	347259f35a	[cherry-pick] Organize BUCK for torch/standalone and Rename torch::standalone to headeronly (#157418 ) * Organize BUCK for torch/standalone (#156503) Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself Test Plan: CI Rollback Plan: Reviewed By: swolchok Differential Revision: D76920013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503 Approved by: https://github.com/swolchok (cherry picked from commit acaf6ba3c6d0bdec88ab3f6c2ef82650050558d2) * Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251) Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week. Original summary: headeronly is more clear, let's change the name before anyone depends on standalone Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire (cherry picked from commit fee2377f9ea62223f69ea9904c5e25ccb2af5961) --------- Co-authored-by: Jane Xu <janeyx@meta.com>	2025-07-07 09:52:12 -07:00
pytorchbot	9fd946f04e	[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#157422 ) [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255) Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0 *Build Failure Logs* Error related to mkldnn ``` pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope 302 \| if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) \|\| \| ^~~~~~~~~~~~~~~~~~~~~~~~ pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope 303 \| (mixed_dtype && cpuinfo_has_x86_amx_fp16())) { \| ^~~~~~~~~~~~~~~~~~~~~~~~ ``` Error related to vec256 complex float redefinition ``` aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation 19 \| class Vectorized<ComplexFlt> { \| ^~~~~~~~~~~~~~~~~~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’  aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 633 \| auto abs_a = a.abs_2_(); \| ^~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 634 \| auto abs_b = b.abs_2_(); \| ^~~~~~ /aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 666 \| vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())}; aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 673 \| vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())}; \| ^~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 680 \| vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())}; ``` *With this changes build logs* ``` Building wheel torch-2.8.0a0+gita3098a7 -- Building version 2.8.0a0+gita3098a7 -- Checkout nccl release tag: v2.26.5-1 cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch cmake --build . --target install --config Release running build_ext -- Building with NumPy bindings -- Not using cuDNN -- Not using CUDA -- Not using XPU -- Using MKLDNN -- Not using Compute Library for the Arm architecture with MKLDNN -- Using CBLAS in MKLDNN -- Not using NCCL -- Building with distributed package: -- USE_TENSORPIPE=True -- USE_GLOO=True -- USE_MPI=False -- Building Executorch -- Not using ITT Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so building 'torch._C' extension creating build/temp.linux-ppc64le-cpython-312/torch/csrc ``` This patch will fix the pytorch build issue on power, and i am able to build successfully. Hi @malfet @albanD Please review this PR for pytorch build issue that we are observing on power. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255 Approved by: https://github.com/albanD, https://github.com/malfet (cherry picked from commit 5e18bc333144473f1f10bc8a5ba05dba7950fb8a) Co-authored-by: Avanish Tiwari <avanish@linux.ibm.com>	2025-07-07 09:50:47 -07:00
pytorchbot	7228ea51f8	[ONNX] Fix conversion of attention - 4D (#157509 ) [ONNX] Fix conversion of attention - 4D (#157130) Fixes a wrong conversion to onnx while investigation #149662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130 Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms (cherry picked from commit 0105cd89ab508ec56126c1de85c8f5b5acc446b5) Co-authored-by: xadupre <xadupre@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-07-07 09:47:18 -07:00
pytorchbot	8b879b0cac	[dynamo] Fix bug in dict(mapping_proxy) (#157515 ) [dynamo] Fix bug in dict(mapping_proxy) (#157467) Fixes https://github.com/pytorch/pytorch/issues/157284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467 Approved by: https://github.com/jansel, https://github.com/StrongerXi (cherry picked from commit 48560eef80e97e855cbb8e2814acefe8f5cc6fbd) Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-07 12:46:28 -04:00
Xuan Liao	b10409a417	[cherry-pick] [fake tensor] fix issue of no attribute tags (#156689 ) (#157519 ) [fake tensor] fix issue of no attribute tags (#156689) Fixes #156688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman (cherry picked from commit 7597988f1b5a41c0b91d379e0ce51111fd7cc95a)	2025-07-07 09:44:51 -07:00
Richard Zou	0cb3b840b0	Add einops x torch.compile testing in PyTorch CI (#157416 ) (#157588 ) Fixes #146782. This PR adds testing for multiple einops versions in PyTorch CI. This occurs in a new "einops" CI job that runs for both Python 3.9 and 3.13 (aka, what we test Dynamo over). Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416 Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305	2025-07-07 09:41:11 -07:00
pytorchbot	d91af204d8	Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157641 ) Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630) This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch: 1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change. 2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630 Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt (cherry picked from commit 7275f280454f790414b24147a2ba7f94d0eabcf6) Co-authored-by: Andrey Talman <atalman@meta.com>	2025-07-04 19:22:01 -04:00
pytorchbot	3db12a59d7	Remove +PTX from CUDA 12.8 builds (#157634 ) Remove +PTX from CUDA 12.8 builds (#157516) Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh. Removing +PTX reduces binary size required to be able to upload binaries to pypi Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516 Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv (cherry picked from commit 84085229765698166f07c9220d5544023ab80d47) Co-authored-by: atalman <atalman@fb.com>	2025-07-04 14:54:10 -04:00
pytorchbot	2b0f8e9c86	Cleanup leftover miniconda brew installation (#157567 ) Cleanup leftover miniconda brew installation (#156898) That results in torch.compile being unable to produce working artifacts Should fix https://github.com/pytorch/pytorch/issues/156833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898 Approved by: https://github.com/seemethere, https://github.com/atalman (cherry picked from commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-04 14:53:20 -04:00
pytorchbot	8e00a755d6	Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157539 ) Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466) #149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782). This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466 Approved by: https://github.com/clee2000, https://github.com/atalman (cherry picked from commit 5d5a5b3501dfb0759ed36d0a88b65cdcd87c1e27) Co-authored-by: Klaus Zimmermann <klaus.zimmermann@quansight.com>	2025-07-04 14:48:28 -04:00
pytorchbot	3e6f088e40	[aarch64] Add back NCCL lib to cuda arm wheel (#157105 ) [aarch64] Add back NCCL lib to cuda arm wheel (#156888) We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime. https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build. related - https://github.com/pytorch/pytorch/issues/144768 ``` Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libnccl.so.2: cannot open shared object file: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888 Approved by: https://github.com/atalman (cherry picked from commit de45c5f673ce261e9a82c54280beeda36cff640e) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-04 14:45:41 -04:00
pytorchbot	9f3fd07c40	[MPS] Revert cumsum/cumprod to MPSGraph implementation (#157494 ) [MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708 Approved by: https://github.com/malfet (cherry picked from commit 2d7e6c6241971106a56073d7a53c7d1336b11a51) Co-authored-by: Manuel Candales <mcandales@meta.com>	2025-07-03 09:22:10 -07:00
pytorchbot	7d5aff5f25	[ez] Disable some failing periodic tests (#157560 ) [ez] Disable some failing periodic tests (#156731) test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda: Added in https://github.com/pytorch/pytorch/pull/150059 Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](`4491326fb0`) inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](`4491326fb0`) started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234 I'll ping people if this gets merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731 Approved by: https://github.com/huydhn (cherry picked from commit 2ff3280c77c705e11c5211d4be8fef9853cd0559) Co-authored-by: Catherine Lee <csl@fb.com>	2025-07-03 09:19:55 -07:00
Andrey Talman	ec45b4c0fe	Revert "Update triton version to 3.4" (#157471 ) Revert "Update triton version to 3.4 (#156890)" This reverts commit 03eb1e40f9ddf09cb9eef86ace74332e87f11a79.	2025-07-02 13:33:21 -04:00
pytorchbot	4f0798f34d	[ROCm] Bump AOTriton to 0.10b (#156845 ) [ROCm] Bump AOTriton to 0.10b (#156499) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd (cherry picked from commit d9577df312d477e8fa5b9d7bc61fb1f2c07b8e48) Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2025-06-30 09:03:56 -04:00
Camyll Harajli	998ffd4b25	[cherry-pick] revert #156517 on release 2.8 (#156768 ) Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)" This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405.	2025-06-26 10:42:35 -04:00
pytorchbot	3d53a53e50	Fix environment and push env var for docker image builds for binary builds (#156916 ) Fix environment and push env var for docker image builds for binary builds (#156910) Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io. You also don't do the push unless WITH_PUSH is true binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910 Approved by: https://github.com/atalman (cherry picked from commit 78ee2ee90eed957aec3dc80423b108b16938a8ae) Co-authored-by: Catherine Lee <csl@fb.com>	2025-06-25 22:15:13 -04:00
Andrey Talman	03eb1e40f9	Update triton version to 3.4 (#156890 ) [TEST] triton Update 3.4 - 2	2025-06-25 18:05:36 -04:00
Camyll Harajli	8de5ce7155	[cherry pick] revert #155412 (#156757 ) Revert "Remove remaining CUDA 12.4 CI code (#155412)" This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf.	2025-06-25 14:53:55 -04:00
Andrey Talman	421d45d9a1	[RELEASE 2.8] Release only changes (#156728 ) * [RELEASE 2.8] Release only changes * test * tests_files * docker	2025-06-24 16:35:15 -04:00
Nikita Shulga	3a7ff829c5	Fix MacOS MP hang in Python-3.12+ (#155698 ) By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 ) at exit, as at this point handle to child process might no longer be valid Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise - Remove `CONDA_RUN` macro... - Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable) Fixes https://github.com/pytorch/pytorch/issues/153050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698 Approved by: https://github.com/atalman	2025-06-24 12:13:35 +00:00
Xuehai Pan	f5e6e52f25	[BE][PYFMT] migrate PYFMT for `test/inductor/` to `ruff format` (#148186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186 Approved by: https://github.com/jansel	2025-06-24 11:12:11 +00:00
Mark Saroufim	4e8dd11be1	simplify nvrtc discovery login in compile_kernel (#156674 ) Followup from https://github.com/pytorch/pytorch/pull/156332 Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380 Works just fine on dev gpus Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674 Approved by: https://github.com/malfet	2025-06-24 08:55:40 +00:00
Mark Saroufim	ce73b0c53f	Validate custom op support for compile_kernel (#156332 ) Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added benchmark failure in CI is unrelated to this change, CI is green Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332 Approved by: https://github.com/zou3519, https://github.com/malfet	2025-06-24 08:21:21 +00:00
Yu, Guangye	35e44067c4	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-06-24 07:57:48 +00:00
cyy	ce1a07570d	Fix TORCH_CUDA_ARCH_LIST (#156667 ) Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667 Approved by: https://github.com/ngimel	2025-06-24 07:27:53 +00:00
fengqing.lu	04178d347c	[Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340 ) Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903). Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query. This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code. The function `alloc_with_matching_layout` is copied from cudnn `8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-06-24 06:09:59 +00:00
Ti-Tai Wang	a7b29c88b1	[ONNX] Preserve all legacy exporter params in fallback (#156659 ) Fixes #151693 Previous to this PR, the fallback does not take care of all user parameters. This pr preserves them to ensure a smooth transition for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156659 Approved by: https://github.com/justinchuby	2025-06-24 05:28:55 +00:00
Yu, Guangye	a6a8641c8a	Fix UT failure on non-cuda backend (#156577 ) # Motivation `HAS_TRITON` is a generic API that could return `True` on xpu backend. It will result in these cases failing on xpu. So we should use `HAS_CUDA` (equivalently `torch.cuda.is_available() && HAS_TRITON`) to avoid these failures. Please refer to https://github.com/pytorch/pytorch/actions/runs/15813693789/job/44569593370#step:15:2129 # Additional Context This PR aims to fix the CI failure soon. We will have a dedicated PR to generalize these UT to be generic. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @daisyden Fix https://github.com/pytorch/pytorch/issues/156576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156577 Approved by: https://github.com/jansel	2025-06-24 05:24:24 +00:00
zeshengzong	495c317005	Replace deprecated `is_compiling` method (#154476 ) Replace depreacted `is_compiling` in `torch._dynamo` with `torch.compiler` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154476 Approved by: https://github.com/eellison	2025-06-24 05:16:40 +00:00
Boyuan Feng	1044934878	[CUDAGraph] add config `cudagraph_capture_sizes` (#156551 ) Users may want CUDAGraph for certain sizes and fallback for other sizes. As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others. Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback. This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes. Example: ```python import torch torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)] def f(x): return x + 1 f = torch.compile(f, mode="reduce-overhead", dynamic=False) def run(batch_size, seq_len, d): x = torch.randn((batch_size, seq_len, d), device="cuda") # Need to mark the dimension as dynamic. Automated-dynamic # may have some ux issues on matching `cudagraph_capture_sizes` # with the actual dynamic shapes, since there are specialization and # multiple dynamo graphs. torch._dynamo.mark_dynamic(x, 0) torch._dynamo.mark_dynamic(x, 1) for _ in range(3): f(x) for i in range(2, 10): for j in range(2, 10): run(i, j, 8) num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id() assert num_cudagraph.id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551 Approved by: https://github.com/bobrenjc93	2025-06-24 05:14:49 +00:00
Ahmad Sharif	899d3d3e9e	Don't call `sum()` on a tensor that is not summable in layer_norm (#156600 ) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(args, kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const, char const, unsigned int, char const) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const, float const, float const, float const, long, long, at::Tensor, at::Tensor, CUstream_st) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-06-24 05:00:42 +00:00
Edward Z. Yang	17eb649d55	Implement guard collectives (optimized version) (#156562 ) This is a remix of https://github.com/pytorch/pytorch/pull/155558 Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562 Approved by: https://github.com/Microve	2025-06-24 04:59:49 +00:00
morrison-turnansky	73772919d2	remove deprecated numpy.typing.mypy_plugin in mypy.ini (#156601 ) Fixes #156489 removed deprecated numpy plugin in mypy.ini @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156601 Approved by: https://github.com/ezyang	2025-06-24 04:56:08 +00:00
Xuehai Pan	6d5c789ad5	[BE][PYFMT] migrate PYFMT for `test/[a-h]*/` to `ruff format` (#144555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555 Approved by: https://github.com/ezyang ghstack dependencies: #144551, #144554	2025-06-24 04:53:54 +00:00
PyTorch MergeBot	e600e044a7	Revert "[aotd] Support mutations of the same input in fw and bw (#155354 )" This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f. Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see `930b575389/1` ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))	2025-06-24 04:42:14 +00:00
fduwjj	930b575389	[symm_mem] Add sym mem test into ptd h100 ci (#156634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156634 Approved by: https://github.com/ngimel, https://github.com/mori360	2025-06-24 03:43:22 +00:00
tvukovic-amd	b2d473c8f8	[ROCm][Windows] Fix rocsolver undefined symbol error (#156591 ) Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591 Approved by: https://github.com/jeffdaily	2025-06-24 03:28:45 +00:00
fduwjj	87d615efab	[fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653 ) At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished. Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653 Approved by: https://github.com/kwen2501	2025-06-24 03:25:04 +00:00
cyy	b09bd414a6	Deprecate c10::string (#155084 ) Now there is no mention of c10::string in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084 Approved by: https://github.com/ezyang	2025-06-24 03:03:06 +00:00
Simon Fan	0a2ec7681d	Add fx_graph_runnable tests boilerplate (#156552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156552 Approved by: https://github.com/StrongerXi	2025-06-24 02:41:38 +00:00
dolpm	9665702c64	[nativert] reland D76832891 remove designated initializer cpp20 (#156565 ) Summary: fix windows build broke in https://github.com/pytorch/pytorch/pull/156508 Test Plan: ci Rollback Plan: Differential Revision: D77080420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156565 Approved by: https://github.com/zhxchen17	2025-06-24 02:38:08 +00:00
Andrey Talman	6a3d00aa3b	Add Windows cuda 12.9.1 build (#156630 ) Without Support for SegmentReduce.cu Test PR confirmed by Removing SegmentReduce.cu windows build for CUDA 12.9 can succeed Related to: https://github.com/pytorch/pytorch/issues/156181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156630 Approved by: https://github.com/malfet Co-authored-by: Ting Lu <tingl@nvidia.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-24 02:15:49 +00:00
Sidharth	a9ef7c4d04	[dynamo] update to lru_cache message and updated user stack trace in debug mode (#156639 ) I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639 Approved by: https://github.com/atalman	2025-06-24 01:52:13 +00:00
Paul Zhang	86996c15dc	[Inductor] Allow exhaustive autotuning across all GEMM options (#156610 ) Differential Revision: D76843916 Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature: Excessive register spillage Using much larger amounts of shared memory than available on the hardware This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive: ``` AUTOTUNE mm(1152x21504, 21504x1024) strides: [21504, 1], [1, 21504] dtypes: torch.bfloat16, torch.bfloat16 mm 0.1167 ms 100.0% triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610 Approved by: https://github.com/jansel	2025-06-24 01:42:05 +00:00
Isuru Fernando	40a785103c	[dynamo] fix debugging code_parts for relational guards (#154753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154753 Approved by: https://github.com/anijain2305 ghstack dependencies: #154772	2025-06-24 01:38:29 +00:00
Isuru Fernando	849468034d	[dynamo] fix selecting shape guards (#154772 ) Not all LAMBDA_GUARDs are shape guards. Only the epilogue guards are lambda guards Pull Request resolved: https://github.com/pytorch/pytorch/pull/154772 Approved by: https://github.com/anijain2305	2025-06-24 01:38:29 +00:00
Ankita George	5dd9652389	Clean up HF components (#155707 ) Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707 Approved by: https://github.com/saumishr	2025-06-24 00:07:37 +00:00
Francisco Massa	ca5a40395d	[partitioner] Fix _broadcast_on_rank0 to use deterministic hash function (#153734 ) Summary: I was using python's hash, which is not deterministic across different interpreter runs. Use hashlib instead. Test Plan: Run using it https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_cc-8e17be61ce?job_attempt=1&version=0&tab=summary&env=prod Differential Revision: D74882405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153734 Approved by: https://github.com/Microve	2025-06-24 00:06:23 +00:00
Georgia Phillips	24063ad109	Fix native static dispatch kernels (#156331 ) Summary: Fix for native static dispatch kernels not taking effect Test Plan: ``` buck2 test //sigmoid/backend/test:static_kernels_ops_test buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --pytorch_predictor_sigmoid_static_dispatch_enable=true --pytorch_predictor_sigmoid_graph_passes_enable=true --benchmarkEnableProfiling=true --load_lowered_merge=3 --using_aoti_lowering_allowlist=false --requestFilePath=/data/users/georgiaphillips/replayer/inputs/742055223/0/mix/742055223_0_mix.inputs.recordio --benchmarkNumIterations=2 ``` Rollback Plan: Reviewed By: dolpm Differential Revision: D76559764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156331 Approved by: https://github.com/Skylion007, https://github.com/jingsh	2025-06-24 00:05:49 +00:00
Shivam Raikundalia	380e30a723	[EZ/Profiler] Change 'b' to 'B' in FunctionEvent Frontend (#156250 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/149311 Test Plan: Just changes string output ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 60.993us 0.97% 60.993us 1.848us 0 B 0 B 33 ... ``` Rollback Plan: Differential Revision: D76857251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250 Approved by: https://github.com/sanrise	2025-06-23 23:25:04 +00:00
Yuanyuan Chen	07bb097698	Fix clang-tidy bugprone* warnings (#148529 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148529 Approved by: https://github.com/ezyang	2025-06-23 23:09:56 +00:00
IvanKobzarev	3f920f3d8f	[aotd] Support mutations of the same input in fw and bw (#155354 ) Original issue: https://github.com/pytorch/pytorch/issues/154820 The issue happens when there is a mutation for the same input in forward AND in backward. AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward). After that partitioner can put it either in forward or in backward. The fix: 1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation. 2/ Exposing mutation_counter to python We want to keep invariant that copy_ exist only in the end of joint graph. 3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward. Emit post_forward mutations after joint graph fully traced. add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward. 4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward. For this set MUST_SAVE for the source of mutation in forward. proxy_tensor changes: By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained. But we want that this copy_ will be independent and applied just to primals. For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354 Approved by: https://github.com/bdhirsh	2025-06-23 22:25:45 +00:00
Scott Wolchok	c82a174cea	Extract CPU log_softmax kernels to header (#156243 ) This allows sharing them with ExecuTorch. Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243 Approved by: https://github.com/janeyx99	2025-06-23 21:31:16 +00:00
Paul Zhang	96e4c95cd8	[Inductor] Subgraph as a choice symbolic expression as input (#156185 ) Differential Revision: D76514984 Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185 Approved by: https://github.com/masnesral	2025-06-23 21:29:17 +00:00
PyTorch MergeBot	b1d62febd0	Revert "Use official CUDAToolkit module in CMake (#154595 )" This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100. Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))	2025-06-23 21:15:31 +00:00
anwang	31e1274597	[MTIA Aten Backend] Migrate max.dim_max / min.dim_min (#156568 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate max.dim_max / min.dim_min to in-tree. Differential Revision: [D77095185](https://our.internmc.facebook.com/intern/diff/D77095185/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156568 Approved by: https://github.com/malfet ghstack dependencies: #156502, #156539, #156554	2025-06-23 20:43:39 +00:00
Colin Peppler	dfdd636cfa	[aoti] Check longlong upperbound for codegening input size check (#156522 ) Summary: Fixes ``` error: integer literal is too large to be represented in any integer type 38979 \| if (arg410_1_size[0] > 1171368248680556527362) { ``` Test Plan: ci Differential Revision: D77057898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156522 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-06-23 20:38:34 +00:00
anwang	edd9c09e73	[MTIA Aten Backend] Migrate isnan (#156554 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate isnan to in-tree. Differential Revision: [D77094811](https://our.internmc.facebook.com/intern/diff/D77094811/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156554 Approved by: https://github.com/malfet ghstack dependencies: #156502, #156539	2025-06-23 20:22:32 +00:00
anwang	070e580d30	[MTIA Aten Backend] Migrate _log_softmax.out / _log_softmax_backward_data.out (#156539 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate _log_softmax.out / _log_softmax_backward_data.out to in-tree. Differential Revision: [D77044380](https://our.internmc.facebook.com/intern/diff/D77044380/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156539 Approved by: https://github.com/malfet ghstack dependencies: #156502	2025-06-23 19:56:01 +00:00
anwang	93cd16512f	[MTIA Aten Backend] Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out (#156502 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out to in-tree. Differential Revision: [D76917384](https://our.internmc.facebook.com/intern/diff/D76917384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156502 Approved by: https://github.com/malfet	2025-06-23 19:56:01 +00:00
atalman	ee4d343499	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" (#156624 ) This reverts changes to [test/dynamo/test_repros.py](https://github.com/pytorch/pytorch/compare/main...atalman:revert_only_portion_of_file?expand=1#diff-4c82a5798a61d4cceb176b2700ba6fdd7c3e72d575b8e7e22458589139459caa) Missed by: `ee3d9969cc (diff-036cb21341ff8e390cc250e74fe9e3f0f15f259ea4bec4abcce49d95febf1553)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156624 Approved by: https://github.com/Camyll	2025-06-23 19:30:08 +00:00
Shangdi Yu	56b3bf0c74	[nativert] Move HigherOrderKernel (#156507 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Test Plan: CI Differential Revision: D77032074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156507 Approved by: https://github.com/zhxchen17	2025-06-23 19:29:27 +00:00
PyTorch MergeBot	d061a02e6e	Revert "[invoke_subgraph] make same subgraph share get_attr target (#156260 )" This reverts commit 39dd2f4d7defc63164a7969bfac0d0c62ffac900. Reverted https://github.com/pytorch/pytorch/pull/156260 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156260#issuecomment-2997478798))	2025-06-23 18:24:10 +00:00
PyTorch MergeBot	35d03398e5	Revert "[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347 )" This reverts commit f179b7198522e6d93bd103efba1a1ebd5a2cf891. Reverted https://github.com/pytorch/pytorch/pull/156347 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156347#issuecomment-2997453729))	2025-06-23 18:19:29 +00:00
Tom Ritchford	98a34e8d4b	Move code out of individual token linters (#152256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152256 Approved by: https://github.com/Skylion007	2025-06-23 18:16:33 +00:00
Prachi Gupta	da910e603a	[ROCm] update state check for test_trace_while_active* (#153545 ) When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-23 17:58:14 +00:00
PyTorch MergeBot	55ef7b15e0	Revert "[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463 )" This reverts commit afbf5420b8745099bf7d871f5a4fb6dec338f825. Reverted https://github.com/pytorch/pytorch/pull/156463 on behalf of https://github.com/atalman due to This is temoprary revert, to restore diff train sync. We should be good to reland this change ([comment](https://github.com/pytorch/pytorch/pull/156463#issuecomment-2997335541))	2025-06-23 17:44:36 +00:00
Boyuan Feng	a95504b10f	[torchbench] update environment setup script (#156465 ) Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model. This PR adds a config to only install torchbench models we want to run. Example usage: ``` # Install 1 torchbench model make build-deps TORCHBENCH_MODELS="alexnet" # Install 3 torchbench models make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch" # Install all models make build-deps # Install all models make build-deps TORCHBENCH_MODELS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465 Approved by: https://github.com/ezyang	2025-06-23 17:41:29 +00:00
PyTorch MergeBot	e583b88819	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))	2025-06-23 17:36:44 +00:00
Yidi Wu	f179b71985	[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #156260	2025-06-23 17:10:07 +00:00
Yidi Wu	39dd2f4d7d	[invoke_subgraph] make same subgraph share get_attr target (#156260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-06-23 17:10:07 +00:00
Prachi Gupta	276c790010	[ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587 ) This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072 One-Shot: ![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f) As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers. Two-Shot: ![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc) Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587 Approved by: https://github.com/jeffdaily	2025-06-23 16:14:01 +00:00
Nikita Shulga	5a533f74a1	Checkout optional submodules when publishing a release tarball (#156615 ) This includes Eigen and nccl for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/156615 Approved by: https://github.com/huydhn	2025-06-23 16:08:22 +00:00
Aby Mathew C	6835ba1b34	Register hpu device to fake backend (#156076 ) ## MOTIVATION This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created. ## CHANGES - Add "hpu" to the list of devices @ankurneog, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076 Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD	2025-06-23 16:08:08 +00:00
Ke Wen	cc410d3761	[SymmMem] Rename all_to_all_vdev ops (#156582 ) `all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156582 Approved by: https://github.com/fduwjj ghstack dependencies: #155134	2025-06-23 15:57:36 +00:00
Ryan Guo	640f5a7090	[dynamo] Support builtin bool on non-constant VTs (#155863 ) In practice `bool(...)` is either constant folded by Dynamo or used for branching (so most of its emulation logic lived in `InstructionTranslator.generic_jump`. This patch adds a dedicated `bool` hanlder (only for symbolic bool/int/float for now), and fixes #136075. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863 Approved by: https://github.com/williamwen42	2025-06-23 15:53:15 +00:00
Simon Fan	6b45af38a5	[easy] better copy_misaligned_inputs assertion failure message (#154472 ) internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472 Approved by: https://github.com/williamwen42	2025-06-23 15:39:15 +00:00
Randolf Scholz	2e9bd03f60	Implemented `Size.__radd__` (#152554 ) Fixes #144334 Builds on top of #146834 by @khushi-411 The needed trick was to add `PyNumberMethods` because these Number Protocol appears to be responsible for `__radd__` (see https://stackoverflow.com/q/18794169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152554 Approved by: https://github.com/albanD Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com> Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-06-23 15:38:37 +00:00
Nikita Shulga	3cbae6dde8	[MPSInductor][BE] Fix multistage reduction check (#156567 ) From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops. I.e. it changes shader code generated for ```python import torch def f(x): var, mean = torch.var_mean(x, dim=2, keepdim = True) return x / var, var torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps')) ``` from ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int x0 = xindex; threadgroup float3 tmp_acc_0[1024]; tmp_acc_0[r0_index * 1] = 0.0; for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp0 = in_ptr0[r0_1 + 1024x0]; tmp_acc_0[r0_index 1] = ::c10:🤘:welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0)); } auto tmp1 = c10:🤘:threadgroup_welford_combine(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp4 = in_ptr0[r0_1 + 1024x0]; auto tmp5 = tmp4 / tmp3; out_ptr2[r0_1 + 1024x0] = static_cast<float>(tmp5); } } ``` to ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int r0_1 = r0_index; int x0 = xindex; threadgroup float tmp_acc_0[1024]; auto tmp0 = in_ptr0[r0_1 + 1024x0]; tmp_acc_0[r0_index 1] = tmp0; auto tmp1 = c10:🤘:threadgroup_welford_reduce(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); auto tmp4 = tmp0 / tmp3; out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4); } `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156567 Approved by: https://github.com/dcci ghstack dependencies: #156566	2025-06-23 14:49:26 +00:00
Manuel Candales	e28925aa75	[MPS] Activation kernels: do compute at float precision (#155735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155735 Approved by: https://github.com/malfet ghstack dependencies: #155304, #155316, #155462, #155479, #155571, #155586	2025-06-23 14:48:57 +00:00
PyTorch MergeBot	f5e1b24945	Revert "Enable Leak Sanitizer (#154584 )" This reverts commit c79c7bbe615265b6b3d7df39d6d5a68afd7d6b2a. Reverted https://github.com/pytorch/pytorch/pull/154584 on behalf of https://github.com/cyyever due to Need to suppress more output ([comment](https://github.com/pytorch/pytorch/pull/154584#issuecomment-2995792265))	2025-06-23 10:08:40 +00:00
PyTorch MergeBot	4f70fbbd16	Revert "Use CMake wholearchive group (#156393 )" This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed. Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))	2025-06-23 09:03:19 +00:00
Yu, Guangye	92409b6c89	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD	2025-06-23 08:49:30 +00:00
Bob Ren	d5781c8d21	remove allow-untyped-defs from torch/fx/passes/utils/fuser_utils.py (#156538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156538 Approved by: https://github.com/ezyang	2025-06-23 08:18:16 +00:00
Pandya, Vivek Vasudevbhai	e0ae4ecca8	Refactor cpp codegen to support overridable class attributes. (#155553 ) - Refactored CppKernelProxy and CppScheduling to use class-level attributes (kernel_cls, kernel_proxy_cls) for backend-specific kernel customization. - Avoids method duplication (e.g., codegen_functions, codegen_node) for backend-specific overrides thus reduces downstream maintenance when upgrading Torch. - Ensures type safety with annotations while keeping core logic centralized and extensible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155553 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-06-23 07:36:30 +00:00
cyy	67ee0c6725	Remove outdated Android workarounds of nearbyintf (#151292 ) This PR uses std::nearbyint on all supported platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151292 Approved by: https://github.com/ezyang	2025-06-23 06:28:15 +00:00
cyy	d1b4e0fa9a	Use CMake wholearchive group (#156393 ) Use CMake wholearchive group to simplify code. It may also support more OSes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393 Approved by: https://github.com/ezyang	2025-06-23 06:22:34 +00:00
cyy	099d0d6121	Simplify nvtx3 CMake handling, always use nvtx3 (#153784 ) Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784 Approved by: https://github.com/ezyang	2025-06-23 06:12:46 +00:00
Michael Lazos	31659964a5	[Cutlass] Fix buffer missing issues (#155897 ) Handles constants and constant folding with aoti. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897 Approved by: https://github.com/henrylhtsang	2025-06-23 05:58:39 +00:00
cyy	c79c7bbe61	Enable Leak Sanitizer (#154584 ) It enables Leak Sanitizer and also provides a suppression file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154584 Approved by: https://github.com/ezyang	2025-06-23 05:20:27 +00:00
Yuanyuan Chen	9fed2added	Remove remaining CUDA 12.4 CI code (#155412 ) Because no 12.4 job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155412 Approved by: https://github.com/ezyang	2025-06-23 05:16:38 +00:00
Nikita Shulga	4cd6e96bf0	[MPSInductor] Fix nested loop var elimination (#156566 ) As reduction resuts must be kept around Add regression test that is specific for this issue Fixes https://github.com/pytorch/pytorch/issues/156426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156566 Approved by: https://github.com/dcci	2025-06-23 04:35:16 +00:00
Xuehai Pan	d55dc00f84	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-23 02:57:50 +00:00
Xuehai Pan	5b210bb3a6	[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316, #156317	2025-06-23 02:57:50 +00:00
Xuehai Pan	ced90016c1	[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316	2025-06-23 02:57:41 +00:00
Xuehai Pan	cec2977ed2	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-23 02:57:34 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
Xuehai Pan	1b2146fc6d	[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314 Approved by: https://github.com/jingsh ghstack dependencies: #156313	2025-06-23 02:57:19 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
leslie-fang-intel	c55eef79f8	[Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395 ) Summary Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395 Approved by: https://github.com/jansel ghstack dependencies: #156407, #156387	2025-06-23 02:00:42 +00:00
leslie-fang-intel	3c7079959c	[Inductor][CPP] Enable WOQ int4 concat linear (#156387 ) Summary Enable the concat linear optimization pass in Inductor for woq int4 linear. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387 Approved by: https://github.com/CaoE, https://github.com/jansel ghstack dependencies: #156407	2025-06-23 01:52:00 +00:00
Jack Taylor	03023f178c	FlexAttn config refactor + ROCm optimisations (#156307 ) This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends. This PR also introduces: 1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases. 3. Updates configs for ROCm flex autotune path providing perf optimisations AMD perf numbers on score mod benchmark (default inputs) flex_attn \| mode \| Speedup (Avg) \| Speedup (Max) -- \| -- \| -- \| -- fwd \| autotune before PR \| 2.608 \| 20.56 fwd \| autotune after PR \| 2.862 \| 22 fwd \| exhaustive_autotune \| 2.943 \| 22.471 bwd \| autotune before PR \| 2.196 \| 9.831 bwd \| autotune after PR \| 2.423 \| 11.331 bwd \| exhaustive_autotune \| 2.566 \| 13.87 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307 Approved by: https://github.com/drisspg, https://github.com/jansel	2025-06-22 22:27:38 +00:00
Varun Thumbe	a5cbb2bcb3	Improve All to All Perf for inter-node use-case (#156376 ) (#156389 ) Summary: For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf. Test Plan: Verify on two hosts Host1: TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=0 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda Host2: TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=1 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda Rollback Plan: Differential Revision: D76937048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389 Approved by: https://github.com/kwen2501	2025-06-22 20:45:46 +00:00
FFFrog	a28e6ae38f	[OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156401 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156401 Approved by: https://github.com/albanD ghstack dependencies: #156400	2025-06-22 18:40:38 +00:00
FFFrog	1d522325b4	[OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156400 ) As the title stated. Changes: - add resize_ for OpenReg - migrate related tests into test_openreg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/156400 Approved by: https://github.com/albanD	2025-06-22 18:40:38 +00:00
Aaron Orenstein	54b8087f63	Improve torch.ops typing (#154555 ) Summary: Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors. Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases. Decisions made along the way: 1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class. 2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables. The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues. Test Plan: CI Differential Revision: D75497142 Co-authored-by: Benjamin Glass <bglass@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1	2025-06-22 15:52:27 +00:00
James Wu	10fb98a004	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-22 15:05:08 +00:00
PyTorch MergeBot	b5c8b8d09f	Revert "[dynamo] control one_graph behavior additionally through config (#154283 )" This reverts commit b46eb1ccaff944cdcd43e9ce3958819226d2952f. Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	5e56db59d4	Revert "[dynamo] add set_fullgraph decorator/context manager (#154289 )" This reverts commit 2c372a0502578e0136a84423c3f49c19c26d6bb7. Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	c10eeb5bad	Revert "[dynamo] fix set_fullgraph for nested calls (#154782 )" This reverts commit 537b0877a87948bc221301a518fdbc1cf772bc7e. Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	ee3d9969cc	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" This reverts commit 24dc33b37b50ec92da08fc693dd83e7c87b74f8b. Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	5b427c92a8	Revert "[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 )" This reverts commit ead741c5fb0036e0fc95b79d4fe1af3a426e1306. Reverted https://github.com/pytorch/pytorch/pull/156314 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	3f44fdc03d	Revert "[BE][6/16] fix typos in torch/ (#156316 )" This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4. Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	035a68d25a	Revert "[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 )" This reverts commit ee72815f1180fe2d8bcdb23493999256169ac2fa. Reverted https://github.com/pytorch/pytorch/pull/156317 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:56 +00:00
PyTorch MergeBot	1d3bca40ed	Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 )" This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d. Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:56 +00:00
PyTorch MergeBot	4b55871e06	Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 )" This reverts commit c95f7fa874a3116f1067f9092456ee7281003614. Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))	2025-06-22 12:27:36 +00:00
Sidharth	afbf5420b8	[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463 ) This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352 This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.* As of right now with this code change the terminal outputs: if the code came from torch.* : Nothing, as we shouldn't warn for functions that come from torch.* else: /data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace. torch._dynamo.utils.warn_once(msg) If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro: /data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace. torch._dynamo.utils.warn_once(msg) V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9 V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module> V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] torch.compile(foo, backend="eager")(torch.randn(4)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463 Approved by: https://github.com/williamwen42	2025-06-22 11:40:28 +00:00
Sidharth	aeaf6b59e2	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-22 11:39:31 +00:00
Xuehai Pan	c95f7fa874	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-22 08:43:49 +00:00
Xuehai Pan	a23ccaa847	[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316, #156317	2025-06-22 08:43:49 +00:00
Xuehai Pan	ee72815f11	[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316	2025-06-22 08:43:41 +00:00
Xuehai Pan	b210cf1ea5	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-22 08:43:33 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Xuehai Pan	ead741c5fb	[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314 Approved by: https://github.com/jingsh ghstack dependencies: #156313	2025-06-22 08:43:18 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
cyy	08dae945ae	Use official CUDAToolkit module in CMake (#154595 ) Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake. Some CUDA targets are also renamed with `torch::` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595 Approved by: https://github.com/albanD	2025-06-22 05:44:29 +00:00
Edward Yang	1d993fa309	Don't change set_skip_guard_eval_unsafe for DisableContext, since compiler won't run (#156490 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156490 Approved by: https://github.com/anijain2305	2025-06-22 00:51:32 +00:00
Edward Yang	333e0e6147	Make build-deps drop builds into current venv again (#156200 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200 Approved by: https://github.com/malfet	2025-06-22 00:45:02 +00:00
Laith Sakka	74ebd8d14e	use guard_or_false for expand utils reduction (#155868 ) This is classic broadcast like pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155868 Approved by: https://github.com/bobrenjc93	2025-06-21 23:42:19 +00:00
Syed Tousif Ahmed	f70c80105e	Enables NCCL symmetric memory kernels through mempool registration (#155134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 23:24:04 +00:00
Isalia20	9e132b770e	[CUDA] Skip test on low vram machines (#156548 ) I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines. CC: @eqy @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548 Approved by: https://github.com/eqy, https://github.com/malfet	2025-06-21 22:32:57 +00:00
codingwithsurya	e4ae60a413	[SymmMem] Add NVSHMEM Quiet support to Triton (#156475 ) This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations. Changes: - Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py` - Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including: - A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion - Flag-based signaling only after `quiet()` completes, guaranteeing data delivery - Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475 Approved by: https://github.com/kwen2501 ghstack dependencies: #156472, #156473, #156474	2025-06-21 22:19:58 +00:00
Xuan Zhang	c2d1b225e6	[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 ) Problem & Solution: Assume we have something like: ``` x = some_op(...) x0 = x[0] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() x1 = x[1] ``` In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as ``` x = some_op(...) x0 = x[0] x1 = x[1] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() ``` Results: For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are * baseline: 7.73GiB * with the chage: 6.45GiB As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed. cc and credit to @ShatianWang for noticing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809 Approved by: https://github.com/fmassa, https://github.com/bdhirsh	2025-06-21 19:57:21 +00:00
codingwithsurya	04b91a9e43	[SymmMem] Add NVSHMEM Fence support to Triton (#156474 ) This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed. Changes: - Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py` - Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including: - A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls - Final fence before flag update to ensure all data transfers complete before signaling - Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474 Approved by: https://github.com/mandroid6, https://github.com/kwen2501 ghstack dependencies: #156472, #156473	2025-06-21 18:57:05 +00:00
Simon Fan	c06c2569ee	[ca] Support TorchDispatchMode via pass through (#156516 ) The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes In terms of differences with eager autograd engine: - For function mode, CA additionally disables/re-enables `_set_multithreading_enabled` - For dispatch mode: - accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead - Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516 Approved by: https://github.com/jansel ghstack dependencies: #156374, #156509	2025-06-21 18:33:47 +00:00
Simon Fan	5f2f343e1e	[ca] suggest to disable compiled autograd for trace-time NotImplementedErrors (#156509 ) Example: ```python File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd. You can disable compiled autograd for this operation by: 1. Relocating the unsupported autograd call outside the compiled region. 2. Wrapping the unsupported autograd call within a scope that disables compiled autograd. 3. Configuring the specific compilation unit to disable compiled autograd. 4. Globally disabling compiled autograd at the application's initialization. ``` No duplicate error messages for python side trace-time errors ```python ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture raise NotImplementedError( NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either: 1. Moving the unsupported autograd call outside of the torch.compile'd region. 2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager. 3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call. 4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509 Approved by: https://github.com/jansel ghstack dependencies: #156374	2025-06-21 18:33:46 +00:00
Simon Fan	f1968a5e76	[ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374 ) These aren't supported. Not sure how they passed CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374 Approved by: https://github.com/jansel	2025-06-21 18:33:38 +00:00
Animesh Jain	fab85fc5f9	[compile][hierarchical compilation] Release nested_compile_region API (#156449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-06-21 15:14:59 +00:00
Sam Larsen	fb75dea2c1	[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517 ) Summary: Discussed internally at https://fburl.com/workplace/v3hllrs9. With coordinate descent tuning enabled, we're missing the dynamo_timed logging. Test Plan: `TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 1 --performance --cold-start-latency` * tlparse: https://fburl.com/bh2hxw4z * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u88ogw39 * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/yqljow6c Rollback Plan: Differential Revision: D77053918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156517 Approved by: https://github.com/mengluy0125	2025-06-21 14:17:19 +00:00
atalman	a47ca4fc74	Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033 )" (#156546 ) Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](`9de23d0c29`) This reverts commit 9de23d0c29dfac8dc0f6f234bdbcd85a6375fa81. PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546 Approved by: https://github.com/jansel	2025-06-21 14:10:12 +00:00
PyTorch MergeBot	d846e21355	Revert "[nativert] move layout planner algorithms to libtorch (#156508 )" This reverts commit eab45643f22e58ee12d95d8b0162d51ca0a50801. Reverted https://github.com/pytorch/pytorch/pull/156508 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/15793524714/job/44524067679) [HUD commit link](`eab45643f2`) ([comment](https://github.com/pytorch/pytorch/pull/156508#issuecomment-2993589983))	2025-06-21 13:42:40 +00:00
Isalia20	1cfdcb975a	[CUDA] fix illegal memory access in attention (#155397 ) Fixes https://github.com/pytorch/pytorch/issues/150054 CI seemed to be messed up in the old one, old PR: https://github.com/pytorch/pytorch/pull/155145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397 Approved by: https://github.com/ngimel	2025-06-21 12:32:00 +00:00
fduwjj	cd75cf3cab	[symm_mem] Add one side put API for nvshvem (#156443 ) `nvshmem_put(Tensor tensor, int peer)`, where `tensor` must be a symmetric tensor, i.e. rendezvoused before this call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156443 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 12:16:36 +00:00
codingwithsurya	4ff0e033c1	[SymmMem] Add NVSHMEM signal_wait_until support to Triton (#156473 ) This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs. Changes: - Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py` - Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675)) - Implemented `test_triton_signal_wait_until` with: - Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block` - Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value - End-to-end validation of both data transfer and signal synchronization Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473 Approved by: https://github.com/kwen2501, https://github.com/mandroid6 ghstack dependencies: #156472	2025-06-21 10:55:40 +00:00
Laith Sakka	8485f19507	remove gso from vector_norm (#156530 ) guard_or_false here does same thing that guard_size_oblivuous do, note that size is >=0 and this is size like by definition since its a tensor size Pull Request resolved: https://github.com/pytorch/pytorch/pull/156530 Approved by: https://github.com/bobrenjc93	2025-06-21 08:42:36 +00:00
sanchitintel	6ffa03ef9e	[Inductor-CPU] int8 WoQ concat linear (#153004 ) ### Summary int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape. ### Perf data GPT-J 128 input tokens, 128 output tokens. 32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded. \| May 8 nightly first token latency \| First token latency with this implementation \| Rest token latency with May 8 nightly \| Rest token latency with this implementation combined with #149373 \| \|---\|---\|---\|---\| \|202 ms \| 190 ms \| 33 ms \| 30 ms\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>	2025-06-21 08:40:09 +00:00
Laith Sakka	35321b2ad6	remove make_fast_binary_impl from make_fast_binary_impl (#156528 ) This was added in https://github.com/pytorch/pytorch/pull/133584. Take slow path when we cant determine fast path is valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156528 Approved by: https://github.com/bobrenjc93	2025-06-21 08:27:54 +00:00
dolpm	eab45643f2	[nativert] move layout planner algorithms to libtorch (#156508 ) Summary: tt Test Plan: ci Rollback Plan: Differential Revision: D76832891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156508 Approved by: https://github.com/zhxchen17	2025-06-21 07:35:40 +00:00
Scott Wolchok	bf50d71553	Add missing `inline namespace CPU_CAPABILITY` to Gelu/Elu.h (#156512 ) As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512 Approved by: https://github.com/malfet	2025-06-21 06:26:23 +00:00
codingwithsurya	e3b44edfd8	[SymmMem] Add NVSHMEM wait_until support to Triton (#156472 ) This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs. Changes: - Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`. - Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including: - A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag. - Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data. Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until` ```python @triton.jit def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) @triton.jit def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr): nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val) ... if rank == 0: print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21") wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib) print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...") print(f"[RANK 0] Current out buffer contents: {out.tolist()}") torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device)) print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.") if rank == 1: print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0") put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0") put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib) print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.") ... ``` Output: ``` [RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21 [RANK 1] About to PUT 8 elements of value 13 to rank 0 [RANK 1] About to PUT flag value 21 to wake up rank 0 [RANK 1] FLAG PUT complete! Rank 0 should wake up now. [RANK 0] WAIT IS OVER! Flag was set, checking data now... [RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13] [RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values. [RANK 0] Test completed successfully! 🎉 [RANK 1] Test completed successfully! 🎉 ... ---------------------------------------------------------------------- Ran 1 test in 18.773s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472 Approved by: https://github.com/kwen2501	2025-06-21 06:18:31 +00:00
Pian Pawakapan	92c79f36db	[PGO] frame-specific whitelist logging (#155959 ) Summary: In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1). This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states). The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry. Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig Rollback Plan: Differential Revision: D76628834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959 Approved by: https://github.com/bobrenjc93	2025-06-21 06:15:51 +00:00
Sidharth	9de23d0c29	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-21 05:47:54 +00:00
Aby Mathew C	b8ace6f951	Make dtensor tests device agnostic (#155687 ) ## MOTIVATION This PR is a continuation of https://github.com/pytorch/pytorch/pull/154840 and we are trying to make the tests more device agnostic by removing hard coded references to any particular device. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 ## CHANGES 1. test_convolution_ops.py: - Replace "cuda" with self.device_type 2. test_random_ops.py: - Remove setting and using TYPE_DEVICE variable since device_type is set as per the environment (device) in DTensorTestBase class. - Replace "cuda" with self.device_type Pull Request resolved: https://github.com/pytorch/pytorch/pull/155687 Approved by: https://github.com/EikanWang, https://github.com/d4l3k	2025-06-21 04:51:59 +00:00
anwang	f3ec16c26a	[MTIA Aten Backend][3/n] Migrate mm.out from out-of-tree to in-tree (#154393 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate mm.out from out-of-tree to in-tree. We dispatch mm.out to MTIA separately from CPU/CUDA. So this diff adds the file `MTIAOps.cpp` under `ATen/native/mtia` to hold the dispatched functions. In future we can split `MTIAOps.cpp` to categorized ops files. Differential Revision: [D74743849](https://our.internmc.facebook.com/intern/diff/D74743849/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154393 Approved by: https://github.com/albanD, https://github.com/egienvalue, https://github.com/nautsimon	2025-06-21 04:31:04 +00:00
drisspg	88b9c285e0	Workaround for e4m2 dtype (#156461 ) Found in: https://github.com/pytorch/ao/pull/2408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461 Approved by: https://github.com/vkuzo	2025-06-21 04:01:44 +00:00
soulitzer	554b568040	Add internal use only utility to allow externally visible side effects within HOPs (#155715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155715 Approved by: https://github.com/zou3519	2025-06-21 03:55:28 +00:00
Arsh Zahed	c09b054878	Add runtime profiler info for AOTDispatcher prologue (#155785 ) Fixes #155721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155785 Approved by: https://github.com/bdhirsh	2025-06-21 03:34:07 +00:00
fduwjj	fd8ea3c8a3	[symm_mem] Add nccl as a backend for symmetric memory (#155740 ) Running unit test: TORCH_SYMMMEM=NCCL TORCH_DISTRIBUTED_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO pytest test/distributed/test_nccl.py -k test_nccl_symmem_alloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/155740 Approved by: https://github.com/kwen2501	2025-06-21 03:22:23 +00:00
Nikita Shulga	ee56e9f8a8	[BE] Make Eigen an optional dependency (#155955 ) Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found. Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example. Remove eigen submodule and replace it with eigen_pin.txt Fixes https://github.com/pytorch/pytorch/issues/108773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955 Approved by: https://github.com/atalman	2025-06-21 03:02:02 +00:00
Xuehai Pan	b4228a94d1	Split the exclude pattern for `CODESPELL` linter (#156229 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156229 Approved by: https://github.com/albanD ghstack dependencies: #156080, #156081	2025-06-21 02:47:40 +00:00
Xuehai Pan	e3507c3777	[BE] fix typos in functorch/ and scripts/ (#156081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156081 Approved by: https://github.com/albanD ghstack dependencies: #156080	2025-06-21 02:47:40 +00:00
Xuehai Pan	2ccfd14e23	[BE] fix typos in docs/ (#156080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-06-21 02:47:32 +00:00
clr	9aaa184105	dynamo: Don't crash when someone tries to access a non existent list member (#156335 ) dynamo: Don't crash when someone tries to access a non existent list member Test added which reproduces the failure. Note that I'm using the new unimplemented_v2 API. Let me know if people have a strong preference that I use something else. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335 Approved by: https://github.com/jansel	2025-06-21 02:26:31 +00:00
Frank Lin	ac86ec0e60	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-21 01:34:41 +00:00
Yiming Zhou	e98dd95446	[nativert] Move SerialGraphExecutor to PyTorch core (#156459 ) Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner Test Plan: CI Rollback Plan: Differential Revision: D76917966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459 Approved by: https://github.com/zhxchen17, https://github.com/jingsh	2025-06-21 01:32:06 +00:00
bobrenjc93	a67eb1a0d6	[ez] remove unused functions (#156466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466 Approved by: https://github.com/jingsh	2025-06-21 00:38:34 +00:00
Animesh Jain	2ee23175d9	[dynamo][guards] Catch exception and return false in the backend match (#156341 ) Its difficult to write a test. I found this while debugging a sefgault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341 Approved by: https://github.com/williamwen42	2025-06-21 00:13:26 +00:00
Ke Wen	0f0c010714	[c10d] init_process_group supports index-only device id (#156214 ) Before: ``` acc = torch.accelerator.current_accelerator() if acc: local_idx = ... dist.init_process_group( device_id=torch.device(acc.type, local_idx) ) ``` After: ``` dist.init_process_group(device_id=local_idx) ``` That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-06-21 00:02:37 +00:00
Justin Chu	fbbab794ef	[ONNX] Implement Attention-23 (#156431 ) Implement Attention-23 using sdpa and flexattention. - I used copilot for this. - Also updated the conversion logic to remove trailing None inputs. @gramalingam @kunal-vaishnavi @titaiwangms Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431 Approved by: https://github.com/titaiwangms Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 23:54:57 +00:00
Menglu Yu	0ad88a2224	Support environement var for autotune log (#156254 ) Summary: Titled Test Plan: See the scadcastle signal Rollback Plan: Differential Revision: D76860928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254 Approved by: https://github.com/Mingming-Ding	2025-06-20 23:06:33 +00:00
Nicolas Macchioni	6098209bff	[BE][5/X] Phase out usage of use_max_autotune() (#156269 ) These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269 Approved by: https://github.com/masnesral	2025-06-20 22:37:45 +00:00
Animesh Jain	5ab257c74c	[invoke_subgraph] Make invoke_subgraph cacheable (#156448 ) Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-06-20 21:20:23 +00:00
Scott Wolchok	e2351f2dcf	fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379 ) This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379 Approved by: https://github.com/janeyx99	2025-06-20 20:54:53 +00:00
Guilherme Leobas	b8fc5e0c0d	skip flaky test in CPython 3.13 tests (#155561 ) Changed files: * test_math.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561 Approved by: https://github.com/zou3519	2025-06-20 20:25:35 +00:00
PyTorch MergeBot	754c04aa06	Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 )" This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b. Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))	2025-06-20 20:18:24 +00:00
Bhagirath Mehta	de1930a429	Add ONNX dynamo metadata documentation (#155816 ) Describe auto-generated metadata when calling torch.onnx.export Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 20:12:22 +00:00
bobrenjc93	a69e27ca5a	Remove unused MultiKernelCall import from inductor codegen (#156158 ) Since it's now actually used within async_compile.multi_kernel ``` def multi_kernel(self, args, kwargs) -> Any: from torch._inductor.codegen.multi_kernel import MultiKernelCall # no need to call this in parallel since the sub-kernels are already parallel tasks return MultiKernelCall(args, **kwargs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-06-20 19:55:24 +00:00
Shangdi Yu	e5ea24fb27	[nativert] Move auto_functionalize_kernel (#156454 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Copied from original auto_functionalize Diff Summary D53776805: This is a non-functional kernel implementation for auto_functionalize In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs. This would mutates the input tensors inplace, which is unsafe in general. However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut. In the proper functional implementation, it will make a clone of the mutating input tensor return these new instance of tensors as AutoFunctionalizeKernel output. If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid. Test Plan: See internal for test plan Differential Revision: D76926383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454 Approved by: https://github.com/zhxchen17	2025-06-20 19:53:16 +00:00
Jane Xu	eb331b59fe	Add shim fallback for narrow (#156496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496 Approved by: https://github.com/albanD	2025-06-20 19:47:00 +00:00
Aleksandar Samardžić	6ed85bfe6a	Refine alignment check along dynamic dimension for grouped MMs (#155466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466 Approved by: https://github.com/ngimel	2025-06-20 19:42:57 +00:00
Nikita Shulga	ef6d2cee7a	[BE][MPS] Refactor core matmul logic into matmul_core (#155969 ) In preparation of adding integer addmm, move matmul computation part into matmul_inner function Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969 Approved by: https://github.com/Skylion007	2025-06-20 18:54:38 +00:00
sekyondaMeta	18e4c461fb	Update index.md (#155143 ) Related to: https://github.com/pytorch/pytorch/issues/152134 Update to index.md to add language for Stable and Unstable Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143 Approved by: https://github.com/AlannaBurke, https://github.com/atalman Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-20 18:53:32 +00:00
Kevin Fu	502486d946	[PT2]Add weight and constant config path template (#156359 ) Summary: At title. Test Plan: N/A Rollback Plan: Differential Revision: D76925510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359 Approved by: https://github.com/SherlockNoMad	2025-06-20 18:46:01 +00:00
Jane Xu	4b6cbf528b	Add C shim fallback for fill_ (#156245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245 Approved by: https://github.com/desertfire	2025-06-20 18:45:48 +00:00
PyTorch MergeBot	208ec60e72	Revert "[BE] Make Eigen an optional dependency (#155955 )" This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019. Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))	2025-06-20 18:43:52 +00:00
PyTorch MergeBot	d309cd1d50	Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969 )" This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d. Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))	2025-06-20 18:40:38 +00:00
PyTorch MergeBot	96d082d06b	Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385 )" This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66. Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))	2025-06-20 18:17:18 +00:00
Shunting Zhang	39270430c9	[inductor] force min num-split (off by default) (#155941 ) This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ). The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance. The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order: - [xnumel * num_split, rnumel / num_split] - [xnumel, num_split] A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2. Here are some benchmarking results. ``` import torch from triton.testing import do_bench import functools from torch._inductor import config from torch._dynamo.decorators import mark_dynamic import os @torch.compile(dynamic=True) def f(x): return x.sum(dim=0) N = 512 C = functools.partial(torch.randn, device="cuda") x_small = C(4096, N) x_large = C(4096 * 1000, N) if os.getenv("HINT_WITH_SMALL_INPUT") == "1": x = x_small else: x = x_large mark_dynamic(x, 0) f(x) ms = do_bench(lambda: f(x_large)) # 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15 # 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272 # 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba print(ms) ``` This test mimic what we see in the original problem. - If we compile with large inputs and benchmark for large inputs, latency is 4.03ms - if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms - with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test. - Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms - Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62 Perf is neutral (0.011 ms v.s. 0.011ms) A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325 # Update for not applying the change to static shape: from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example: ``` import torch from torch._inductor import config aten = torch.ops.aten def f(x): return aten.bernoulli(x).sum() x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda") torch.compile(f)(x) ``` With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue. Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6) Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941 Approved by: https://github.com/jansel, https://github.com/bobrenjc93	2025-06-20 18:01:28 +00:00
Jane Xu	55dae0bf7a	Add a basic shim and stable::Tensor is_contiguous API (#156228 ) Add a limited is_contiguous in shim, stable::Tensor API with a test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228 Approved by: https://github.com/desertfire	2025-06-20 17:59:52 +00:00
Catherine Lee	49ee1e7106	[CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138 ) Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this Use `--no-renames` to make both the old name and the old name show up in the diff. Without it I think only the new name shows up in git diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever	2025-06-20 17:58:04 +00:00
Mwiza Kunda	e31f205292	[Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504 ) Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504 Approved by: https://github.com/blaine-rister Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>	2025-06-20 17:43:38 +00:00
Frost Mitchell	d83ff89d3b	Add toggle functionality for XPU profiler (#155135 ) Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-06-20 17:27:48 +00:00
Nikita Shulga	1b50c12584	[BE] Make Eigen an optional dependency (#155955 ) Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found. Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example. Remove eigen submodule and replace it with eigen_pin.txt Fixes https://github.com/pytorch/pytorch/issues/108773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955 Approved by: https://github.com/atalman ghstack dependencies: #155947, #155954	2025-06-20 17:21:27 +00:00
Xuehai Pan	63360e64da	[BE][Easy] do not install yanked `types-pkg-resources` in lint environment (#156462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462 Approved by: https://github.com/ezyang	2025-06-20 16:00:43 +00:00
PyTorch MergeBot	1036f6d114	Revert "[ROCm] Bump AOTriton to 0.10b (#156290 )" This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3. Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))	2025-06-20 15:35:25 +00:00
PyTorch MergeBot	b4442f42a9	Revert "Upgrade to DLPack 1.0. (#145000 )" This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6. Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))	2025-06-20 15:32:47 +00:00
PyTorch MergeBot	edd45f3a02	Revert "[Precompile] Hook up backend="inductor" (#155387 )" This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840. Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](`2c68c3e8d5`) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))	2025-06-20 15:30:04 +00:00
Hari Krishna Sai Kodali	e1f28fe17b	add device generalisation support for distributed tests (#152471 ) ### MOTIVATION To generalize Distributed test cases for non-CUDA devices ### CHANGES - test/distributed/optim/test_zero_redundancy_optimizer.py - test/distributed/test_c10d_logger.py - test/distributed/test_compute_comm_reordering.py Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp. DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions. - torch/testing/_internal/common_distributed.py extended common utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471 Approved by: https://github.com/d4l3k	2025-06-20 07:35:42 +00:00
William Wen	0aed855b2b	[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 ) This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #155166	2025-06-20 07:03:29 +00:00
William Wen	24dc33b37b	[dynamo] handle fullgraph toggle using nested torch.compile (#155166 ) See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not. Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782	2025-06-20 07:03:29 +00:00
William Wen	537b0877a8	[dynamo] fix set_fullgraph for nested calls (#154782 ) - Make the fullgraph argument of set_fullgraph a positional argument - Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289	2025-06-20 07:03:16 +00:00
William Wen	2c372a0502	[dynamo] add set_fullgraph decorator/context manager (#154289 ) Implements https://github.com/pytorch/pytorch/issues/144908. Implementation notes: - `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing. - Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example). - InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` . - `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #154283	2025-06-20 07:03:07 +00:00
William Wen	b46eb1ccaf	[dynamo] control one_graph behavior additionally through config (#154283 ) `torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-06-20 07:02:57 +00:00
James Wu	2c68c3e8d5	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-20 06:38:29 +00:00
Xuehai Pan	d5b4a32960	[BE] fix `PYPROJECT` linting errors in `test/` and `tools/` (#156021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021 Approved by: https://github.com/Skylion007	2025-06-20 06:19:05 +00:00
Nikita Shulga	4cbbc8b458	[MPS] Implement backward pass for interpolate_trilinear (#156373 ) Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373 Approved by: https://github.com/dcci ghstack dependencies: #156375	2025-06-20 05:41:24 +00:00
angelayi	c37ddcaefb	Fix torchgen update-aoti-shim (#156323 ) will remove the fill changes before landing and let Jane merge her changes! Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323 Approved by: https://github.com/janeyx99	2025-06-20 05:23:06 +00:00
leslie-fang-intel	f7a5ad6c29	[Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407 ) Summary There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407 Approved by: https://github.com/CaoE	2025-06-20 03:08:02 +00:00
Cui, Yifeng	72c8751b61	Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048 ) There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing. Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](`3a9419c8bb`). The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048 Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel	2025-06-20 01:41:03 +00:00
CaoE	159a39ad34	Add an option for cpp_wrapper to compile entry and kernel separately (#156050 ) Fixes #156037. Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050 Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel	2025-06-20 01:11:16 +00:00
atalman	ebab279942	Forward fix inductor benchmark after #150287 (#156455 ) Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455 Approved by: https://github.com/huydhn	2025-06-20 00:04:15 +00:00
cyy	3c2324c64a	[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 ) This PR removes all `cppcoreguidelines-init-variables` suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237 Approved by: https://github.com/ezyang	2025-06-19 23:26:42 +00:00
Aaron Orenstein	52f873adc2	Add logging for async compile worker statistics (#155820 ) Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start. Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820 Approved by: https://github.com/masnesral	2025-06-19 23:10:15 +00:00
Yiming Zhou	c60d8188d2	[nativert] Move GraphExecutorBase to PyTorch core (#156196 ) Summary: Moves GraphExecutorBase class to PyTorch core. GraphExecutorBase is a lightweight abstraction to execute a graph with execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported. Torch Native Runtime RFC: pytorch/rfcs#72 Test Plan: CI Rollback Plan: Differential Revision: D76830436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196 Approved by: https://github.com/zhxchen17	2025-06-19 22:42:35 +00:00
Xinya Zhang	34d8e64ef6	[ROCm] Bump AOTriton to 0.10b (#156290 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-06-19 21:13:58 +00:00
Ti-Tai Wang	3644b41a7c	[ONNX] Note on attention op symbolic function (#156441 ) Follow up https://github.com/pytorch/pytorch/pull/156367 Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441 Approved by: https://github.com/justinchuby	2025-06-19 21:00:05 +00:00
Dmitry Rogozhkin	443b5b43c3	xpu: fix AOT compilation in sycl cpp extension (#156364 ) Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts. Fixes: #156249 Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)") CC: @pengxin99, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364 Approved by: https://github.com/ezyang	2025-06-19 20:11:38 +00:00
Ke Wen	d32deb664a	[c10d] Disable NCCL NVLS when using deterministic mode (#156381 ) via setting env `NCCL_ALGO=^NVLS`. Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381 Approved by: https://github.com/ngimel	2025-06-19 20:09:24 +00:00
Huy Do	69f2e09cc2	Add more shards to H100 benchmark, and also run it more frequently (#156429 ) There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours. I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090 With this computing power, we could also run the whole suite every 4 hours now. I could run this less frequently later if I see queueing Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429 Approved by: https://github.com/atalman	2025-06-19 20:02:56 +00:00
Catherine Lee	aac0e8f0e9	[build] Create target for flash attention (#156235 ) Create a target for flash attention? so it can be built using ninja flash_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-06-19 20:02:38 +00:00
Nikita Shulga	c2f4cc59a7	[MPS] Fix bug in 3d coords calculation (#156375 ) Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate` Though it's indirectly tested by `test_upsample_nearest3d` inductor test Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375 Approved by: https://github.com/atalman	2025-06-19 19:56:15 +00:00
Xuehai Pan	c0ee01c2fb	tools/nightly.py: only download `torch` via pip and install dependenices via `uv` (#156409 ) Setup time (cpu-only): 70s -> 27.6s -> 17.4s The tool can setup the pinned NVIDIA dependencies correctly: ```console $ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda" make[1]: Entering directory '/home/PanXuehai/Projects/pytorch' /home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log Creating virtual environment Removing existing venv: /home/PanXuehai/Projects/pytorch/venv Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv Installing packages Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128): - uv - pip - setuptools - packaging - wheel - build[uv] Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128 Collecting uv Using cached `f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl` (17.8 MB) Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1) Collecting setuptools Using cached `17031897da/setuptools-80.9.0-py3-none-any.whl` (1.2 MB) Collecting packaging Using cached `38679034af/packaging-25.0-py3-none-any.whl` (66 kB) Collecting wheel Using cached `87f3254fd8/wheel-0.45.1-py3-none-any.whl` (72 kB) Collecting build[uv] Using cached `80633736cd/build-1.2.2.post1-py3-none-any.whl` (22 kB) Collecting pyproject_hooks (from build[uv]) Using cached `12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl` (10 kB) Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1 Installing packages took 6.251 [s] Creating virtual environment took 9.050 [s] Downloading packages Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128 Collecting torch Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB) Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl Successfully downloaded torch Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww: - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl Downloading packages took 6.284 [s] Unpacking wheel file Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK Unpacking wheel file took 15.107 [s] Installing dependencies Installing packages Installing package(s) (https://download.pytorch.org/whl/nightly/cu128): - filelock - typing-extensions>=4.10.0 - setuptools; python_version >= "3.12" - sympy>=1.13.3 - networkx - jinja2 - fsspec - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64" - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux" - numpy - cmake - ninja - packaging - ruff - mypy - pytest - hypothesis - ipython - rich - clang-format - clang-tidy - sphinx Using Python 3.13.4 environment at: venv Resolved 78 packages in 2.95s Installed 76 packages in 93ms + alabaster==1.0.0 + asttokens==3.0.0 + attrs==24.2.0 + babel==2.17.0 + certifi==2024.8.30 + charset-normalizer==3.3.2 + clang-format==20.1.6 + clang-tidy==20.1.0 + cmake==3.25.0 + decorator==5.2.1 + docutils==0.21.2 + executing==2.2.0 + filelock==3.18.0 + fsspec==2025.5.1 + hypothesis==6.135.11 + idna==3.10 + imagesize==1.4.1 + iniconfig==2.1.0 + ipython==9.3.0 + ipython-pygments-lexers==1.1.1 + jedi==0.19.2 + jinja2==3.1.6 + markdown-it-py==3.0.0 + markupsafe==2.1.5 + matplotlib-inline==0.1.7 + mdurl==0.1.2 + mpmath==1.3.0 + mypy==1.16.1 + mypy-extensions==1.0.0 + networkx==3.5 + ninja==1.11.1.4 + numpy==2.3.0 + nvidia-cublas-cu12==12.8.4.1 + nvidia-cuda-cupti-cu12==12.8.90 + nvidia-cuda-nvrtc-cu12==12.8.93 + nvidia-cuda-runtime-cu12==12.8.90 + nvidia-cudnn-cu12==9.10.2.21 + nvidia-cufft-cu12==11.3.3.83 + nvidia-cufile-cu12==1.13.1.3 + nvidia-curand-cu12==10.3.9.90 + nvidia-cusolver-cu12==11.7.3.90 + nvidia-cusparse-cu12==12.5.8.93 + nvidia-cusparselt-cu12==0.7.1 + nvidia-nccl-cu12==2.27.3 + nvidia-nvjitlink-cu12==12.8.93 + nvidia-nvshmem-cu12==3.2.5 + nvidia-nvtx-cu12==12.8.90 + parso==0.8.4 + pathspec==0.12.1 + pexpect==4.9.0 + pluggy==1.6.0 + prompt-toolkit==3.0.51 + ptyprocess==0.7.0 + pure-eval==0.2.3 + pygments==2.19.1 + pytest==8.4.1 + pytorch-triton==3.3.1+gitc8757738 + requests==2.32.3 + rich==14.0.0 + roman-numerals-py==3.1.0 + ruff==0.12.0 + snowballstemmer==3.0.1 + sortedcontainers==2.4.0 + sphinx==8.2.3 + sphinxcontrib-applehelp==2.0.0 + sphinxcontrib-devhelp==2.0.0 + sphinxcontrib-htmlhelp==2.1.0 + sphinxcontrib-jsmath==1.0.1 + sphinxcontrib-qthelp==2.0.0 + sphinxcontrib-serializinghtml==2.0.0 + stack-data==0.6.3 + sympy==1.14.0 + traitlets==5.14.3 + typing-extensions==4.14.0 + urllib3==2.2.3 + wcwidth==0.2.13 Installing packages took 3.080 [s] Installing dependencies took 3.080 [s] Pulling nightly PyTorch Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef Already up to date. Pulling nightly PyTorch took 0.017 [s] Moving nightly files into repo Moving nightly files into repo took 4.898 [s] Writing pytorch-nightly.pth Writing pytorch-nightly.pth took 0.021 [s] ------- PyTorch Development Environment set up! Please activate to enable this environment: $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409 Approved by: https://github.com/ezyang ghstack dependencies: #156408	2025-06-19 19:42:15 +00:00
Xuehai Pan	71faa7e5b9	tools/nightly.py: use `uv pip install` instead of `pip install` (#156408 ) Setup time: 70s -> 27.6s Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408 Approved by: https://github.com/ezyang	2025-06-19 19:42:15 +00:00
Chen Haifeng	134dfb3fe6	[dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791 ) Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used. This issue may cause major tensors will not freed timely even there are no user references to these tensors. We saw OOM issues because of this problem in many cases including training and inference using torch.compile. The fix is to use iterative function implementation to replace the recursive function implementation. Fixes #155778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791 Approved by: https://github.com/ezyang	2025-06-19 19:37:44 +00:00
Shangdi Yu	e4c9f6d9a2	[nativert] Move c10_kernel (#156208 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:c10_kernel_test ``` Differential Revision: D76825830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208 Approved by: https://github.com/zhxchen17	2025-06-19 17:36:23 +00:00
Dmitry Nikolaev	f402eed4d9	[ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611 ) This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4) CUDAHooks::versionMIOpen() was added to detect MIOpen version Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2025-06-19 17:22:37 +00:00
Doru Bercea	085f270a00	[ROCm] Enable more parallelism for multi-dimensional reductions (#155806 ) Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2025-06-19 17:19:40 +00:00
Shangdi Yu	eaf704914e	[aoti] package weights to disk and dedup (#155241 ) We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name. Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`). - Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile - If we see `Weights` in aoti_files, we'll automatically package them to disk - `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently. - Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False. Test Plan: ``` buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights" ``` Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7 Rollback Plan: Differential Revision: D74747190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241 Approved by: https://github.com/desertfire	2025-06-19 17:17:17 +00:00
Yukio Siraichi	6e185c5312	Upgrade to DLPack 1.0. (#145000 ) This PR makes the necessary changes in order to upgrade PyTorch DLPack support to version 1.0. In summary, we add support for the following: - Support both `DLManagedTensor` and `DLManagedTensorVersioned` when producing and consuming DLPack capsules - New parameter for `__dlpack__` method: `max_version` - Version checks: - Fallback to old implementation if no `max_version` or if version lower than 1.0 - Check that the to-be-consumed capsule is of version up to 1.X In order to accommodate these new specifications, this PR adds the following main changes: - `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python API for creating a versioned DLPack capsule (called by `__dlpack__` method) - `DLPackTraits<T>` class (DLConvertor.h): select the correct traits (e.g. capsule name, conversion functions) depending on which DLPack tensor class is being used - `toDLPackImpl<T>` function (DLConvertor.cpp): populates the common fields of both classes - `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor from a DLPAck capsule - `fillVersion<T>` function (DLConvertor.cpp): populates the version field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`) - `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function for constructing a tensor out of a DLPack capsule that also marks the capsule as used Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000 Approved by: https://github.com/albanD	2025-06-19 16:27:42 +00:00
Raman-RH	6eb6f198e1	update codebase structure documentation to include mps (#156297 ) 📚 The doc update adding description about mps folder in code structure guide @albanD @malfet @svekars @sekyondaMeta Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297 Approved by: https://github.com/ezyang	2025-06-19 16:16:29 +00:00
Zhengxu Chen	7f0cddfb55	[dynamo] Add documentation for guard_filter_fn (#156114 ) Summary: Adding a section of doc for guard_filter_fn. Test Plan: CI Rollback Plan: Differential Revision: D76756743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114 Approved by: https://github.com/jansel	2025-06-19 16:13:12 +00:00
Benjamin Glass	c9afcffed0	[AOTInductor] Call most runtime fallback ops without calling into Python (#154142 ) Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python. This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered. Fixes #150988 Fixes #153478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142 Approved by: https://github.com/desertfire	2025-06-19 15:27:15 +00:00
PyTorch MergeBot	317af4c87b	Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 )" This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c. Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))	2025-06-19 15:09:29 +00:00
Jeff Daily	ab3393e923	[ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368 ) Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-19 15:02:40 +00:00
PyTorch MergeBot	0b62465b99	Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466 )" This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e. Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))	2025-06-19 14:25:38 +00:00
Kaichao You	fec8af8b98	[bugfix] [build] guard cuda version for ipc with fabric handle (#156394 ) https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8). this pr improves the support by adding some compilation-time check against cuda versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394 Approved by: https://github.com/ngimel	2025-06-19 13:54:01 +00:00
Nikita Shulga	769d754ab2	[BE][MPS] Refactor core matmul logic into matmul_core (#155969 ) In preparation of adding integer addmm, move matmul computation part into matmul_inner function Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969 Approved by: https://github.com/Skylion007	2025-06-19 13:22:41 +00:00
xinan.lin	8cb0c4a4da	[Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586 ) This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #150287	2025-06-19 13:17:22 +00:00
xinan.lin	83259cf7a7	[Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287 ) This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode. The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion. The performance improvement: \| Suite \| Inductor Speedup (Baseline) \| Inductor Speedup (Compared) \| Acc Failed \| Perf Failed \| Inductor Perf Ratio \| Speedup \| \|-------------\|-----------------------------\|------------------------------\|------------\|--------------\|----------------------\|----------\| \| Huggingface \| 2.134838 \| 2.125740314 \| 0 \| 0 \| 1.001462504 \| 100.43% \| \| Torchbench \| 1.808558 \| 1.675100479 \| 0 \| 0 \| 1.075722187 \| 107.97% \| \| Timm \| 2.343893 \| 2.070476653 \| 0 \| 0 \| 1.131023832 \| 113.21% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287 Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel	2025-06-19 13:17:22 +00:00
Ting Lu	0504480f37	Add CUDA 12.9 libtorch nightly (#155895 ) https://github.com/pytorch/pytorch/issues/155196 with libtorch docker added, we can add the build script Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895 Approved by: https://github.com/atalman	2025-06-19 13:15:42 +00:00
Daisy Deng	ccb1f687d6	Port two dynamo test cases for Intel GPU (#156056 ) For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-06-19 12:49:04 +00:00
PyTorch MergeBot	a8fe982993	Revert "[build] Create target for flash attention (#156235 )" This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0. Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](`6d02321472`) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))	2025-06-19 11:47:27 +00:00
codingwithsurya	4da98351b9	[SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211 ) Adds NVSHMEM PUT with Signal operation support for Triton kernels: - Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block - Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`) - Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`) Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add` ```python @skipIfRocm @requires_triton() def test_triton_put_signal_set(self) -> None: @triton.jit def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr, signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer) # ... setup code ... val = 11 inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val) out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1) # destination buffer # Signal flag buffer - starts at 0, will be set to 1 upon completion flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0) peer = 1 - rank NVSHMEM_SIGNAL_SET = 0 # atomic set operation SIGNAL_VAL = 1 # completion signal value if rank == 0: # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1 put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr, signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET, peer=peer, extern_libs=nvshmem_lib) dist.barrier() # Rank 1 can check flag to know data transfer completed! print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] flag buffer: {flag}") ``` ``` [Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8) [Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8) [Rank 0] got data from peer 1 [Rank 0] flag buffer: tensor([0], device='cuda:0') [Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 1] flag buffer: tensor([1], device='cuda:1') ---------------------------------------------------------------------- Ran 2 tests in 17.046s OK ``` Working as expected! Data is received, and flag set to 1 for completion signal! ```python @skipIfRocm @requires_triton() def test_triton_put_signal_add(self) -> None: @triton.jit def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr, signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer) # ... setup code ... # Signal buffer (uint64 flag) flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0) peer = 1 - rank NVSHMEM_SIGNAL_ADD = 5 # atomic add operation SIGNAL_VAL = 16 # Signal value to add if rank == 0: # Rank 0 puts into Rank 1 and adds to signal put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr, signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] flag buffer: {flag}") ``` ``` [Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8) [Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8) [Rank 0] got data from peer 1 [Rank 0] flag buffer: tensor([0], device='cuda:0') [Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 1] flag buffer: tensor([16], device='cuda:1') ---------------------------------------------------------------------- Ran 1 test in 17.145s OK ``` The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation! Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211 Approved by: https://github.com/kwen2501, https://github.com/mandroid6	2025-06-19 10:24:30 +00:00
bobrenjc93	348e2a76df	s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397 Approved by: https://github.com/laithsakka	2025-06-19 10:18:28 +00:00
Yuan Yao	02080c2cd9	Fix num_heads inference in ONNX Attention-23 exporter (#156367 ) Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105 Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3: ![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a) But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed. Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2025-06-19 09:40:01 +00:00
Ke Wen	8fcda2c60d	[SymmMem] Add runtime detection of NVSHMEM (#156291 ) so that we can pick the default backend for SymmetricMemory without fully relying on env var `TORCH_SYMMMEM=CUDA \| NVSHMEM` On Python side, the following API is added: `torch.distributed._symmetric_memory.is_nvshmem_available()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117	2025-06-19 08:26:11 +00:00
Pian Pawakapan	eabf7cd3c5	[export] update docs for Dims (#156262 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262 Approved by: https://github.com/angelayi	2025-06-19 06:25:21 +00:00
Pian Pawakapan	ec0276103f	[PGO] fix whitelist scalar bug (#156194 ) Test Plan: test_pgo Rollback Plan: Differential Revision: D76830552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194 Approved by: https://github.com/bobrenjc93	2025-06-19 05:51:21 +00:00
Xuehai Pan	1c960c5638	[Makefile] lazily setup `lintrunner` on first `make lint` run (#156058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058 Approved by: https://github.com/ezyang	2025-06-19 05:43:35 +00:00
Nikita Shulga	242eb19c83	[InductorBench] Fix accuracy validation logic for MPS (#156385 ) As it does not support full fp64, validate against float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385 Approved by: https://github.com/Skylion007	2025-06-19 05:37:51 +00:00
Junjie Wang (PyTorch)	ce8180a61d	[c10d] Disable stack trace call in logging (#156362 ) Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more. Test Plan: CI Rollback Plan: Differential Revision: D76929722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362 Approved by: https://github.com/fegin	2025-06-19 05:11:57 +00:00
Yiming Zhou	a21806f038	[ez][export] Better error message for schema check in torch.export.load (#156361 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/156354 torch.export.load() only supports files generated by torch.export.save() Test Plan: CI Rollback Plan: Differential Revision: D76928725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361 Approved by: https://github.com/zhxchen17	2025-06-19 04:50:56 +00:00
Laith Sakka	3f69e3b3a0	Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-19 04:50:18 +00:00
Simon Fan	3bec588bf5	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately. Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #156120	2025-06-19 03:47:41 +00:00
Catherine Lee	6d02321472	[build] Create target for flash attention (#156235 ) Create a target for flash attention? so it can be built using ninja flash_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-06-19 03:35:04 +00:00
Wang, Chuanqi	77518d1a13	[CI] fix xpu-smi hang in XPU test container (#156171 ) Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171 Approved by: https://github.com/huydhn	2025-06-19 02:48:11 +00:00
Teja Rao	19ffdf4ea0	[dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192 ) Summary: This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype. this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug. This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging. This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed. Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references. I am using data_ptr of the storage to keep track of this. Please share thoughts on this. The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner. The API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: with staging_context(stager): cpu_state_dict = copy.deepcopy(state_dict) ``` Also, adds support for pinned-memory. One problem this implementation does not address is that we lose the original device. The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing. Update: Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution: 1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed. 2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied? 3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this. The new API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: cpu_state_dict = copy.stage(state_dict) ``` Test Plan: unit tests Differential Revision: D75993324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192 Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn	2025-06-19 02:04:21 +00:00
Luca Wehrstedt	d4ad280429	Enable querying the build and runtime NCCL versions (#156305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305 Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin	2025-06-19 02:00:08 +00:00
Thanh Ha	bc9bd2a766	Use linux.2xlarge runner (#156351 ) The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-06-19 01:50:56 +00:00
bobrenjc93	e5a1197191	Fix fx tracing for mark dynamic (#156346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346 Approved by: https://github.com/tony-ivchenko	2025-06-19 01:03:09 +00:00
Ben Koopman	6959b5febe	Context on torch.cuda.memory._record_memory_history max_entries (#155889 ) Context on torch.cuda.memory._record_memory_history buffer behavior ## Description Answer questions: - Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM? - If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)? - What is the expected size on disk of the snapshots? Some KBs, MBs? Fixes #129674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889 Approved by: https://github.com/ngimel	2025-06-19 00:44:43 +00:00
Jeff Daily	6303cc41b7	[ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262 ) We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing. Fixes #155045. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262 Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-18 23:52:35 +00:00
Feng Shi	b8c2d4c259	add a corner test case of dynamic sizes for combo kernel (#156035 ) Summary: Added a unit test case for a corner case of combo kernel where all below are true: 1. more than 1 dimensions are dynamic size 2. no_x_dim presistent reduce op Test Plan: ``` buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2 ``` Rollback Plan: Differential Revision: D76699002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035 Approved by: https://github.com/mlazos	2025-06-18 22:57:09 +00:00
Scott Wolchok	76d07e919f	Unbreak //c10/util:base (#156216 ) Missing dep. Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216 Approved by: https://github.com/janeyx99, https://github.com/desertfire	2025-06-18 22:44:20 +00:00
Meet Vadakkanchery	9bfefda296	[DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092 ) Summary: ### Diff Context 1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned state dict. 2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff. Test Plan: CI Differential Revision: D75553096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092 Approved by: https://github.com/pradeepfn	2025-06-18 22:41:25 +00:00
dolpm	a5b4463d60	[nativert] session state (#156190 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76827309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190 Approved by: https://github.com/zhxchen17	2025-06-18 22:40:44 +00:00
Yiming Zhou	6918758f55	[export] Update documents for ExportGraphSiganture (#156244 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/156184 The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused. Test Plan: Document change only. CI Rollback Plan: Differential Revision: D76849097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244 Approved by: https://github.com/yushangdi	2025-06-18 22:37:34 +00:00
Justin Chu	1e474cc9c8	[ONNX] Fix how shapes are computed for float4 (#156353 ) Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[shape[:-1], shape[-1]2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion. Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353 Approved by: https://github.com/titaiwangms	2025-06-18 22:28:02 +00:00
henrylhtsang	9afee0fa96	[inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201 ) internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201 Approved by: https://github.com/masnesral	2025-06-18 22:15:32 +00:00
anwang	e5a0b73ce9	[MTIA Aten Backend] Migrate logical_and.out (#156286 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logical_and.out to in-tree Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286 Approved by: https://github.com/nautsimon, https://github.com/jingsh ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285	2025-06-18 21:57:05 +00:00
PyTorch MergeBot	bfccfa0b31	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))	2025-06-18 21:48:50 +00:00
dolpm	f5eb42e4c0	[nativert] move layoutplanneralgorithm to libtorch (#156205 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76831634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205 Approved by: https://github.com/zhxchen17	2025-06-18 21:46:38 +00:00
anwang	d1c924c68a	[MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate t.Tensor_out / lt.Scalar_out to in-tree. Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047, #156283, #156284	2025-06-18 21:40:26 +00:00
anwang	5c7e1d39ab	[MTIA Aten Backend] Migrate logit (#156284 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logit to in-tree. Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047, #156283	2025-06-18 21:36:27 +00:00
anwang	706e236b08	[MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logical_or.out / log.out / log2.out to in-tree. Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047	2025-06-18 21:27:58 +00:00
anwang	ab81fb846c	[MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047 ) Migrate remainder.Tensor_out / reciprocal.out / neg.out Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046	2025-06-18 21:17:34 +00:00
anwang	c26ce593d8	[MTIA Aten Backend] Migrate nan_to_num.out (#156046 ) Migrate nan_to_num.out Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046 Approved by: https://github.com/nautsimon ghstack dependencies: #155634	2025-06-18 21:14:13 +00:00
anwang	2f1c5c4131	[MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634 ) # Context MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support. # This diff Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override. The benefits of this approach: 1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback. 2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp. ---------------- p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634 Approved by: https://github.com/nautsimon, https://github.com/albanD	2025-06-18 21:10:18 +00:00
Mu-Chu Lee	e99cc126a4	[AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133 ) Summary: When we encounter unbacked symint during autotuning, we try to reuse existing symbols from user provided inputs, then fallback. Test Plan: python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid Rollback Plan: Differential Revision: D76769711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133 Approved by: https://github.com/jingsh	2025-06-18 20:53:21 +00:00
PyTorch MergeBot	728cf6721e	Revert "[PT2]load dense delta by trimming prefixes (#155872 )" This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f. Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))	2025-06-18 20:05:56 +00:00
Kevin Fu	c74fd35050	[PT2]load dense delta by trimming prefixes (#155872 ) Summary: In PT2 with GPU with AOTI, weight names are like ```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb``` but when publishing delta snapshots, lowering is skipped so weights are like ```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb``` so when loading delta weights in original model runner, we need to: 1. Redo tensorName -> weight idx look up, because the weight ordering may be different. 2. use trimmed tensorName to find the correct weight path. Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights. Test Plan: Merge only: ``` MODEL_TYPE=mtml_ctr_instagram_model MODULE=merge MODEL_ENTITY_ID=900234243 SNAPSHOT_ID=7 DENSE_DELTA_SNAPSHOT_ID=13 CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} ``` Local replayer: ``` MODEL_TYPE=mtml_ctr_instagram_model MODEL_ENTITY_ID=900234243 SNAPSHOT_ID=7 DENSE_DELTA_SNAPSHOT_ID=13 USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455 USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455 ``` Rollback Plan: Differential Revision: D76520301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872 Approved by: https://github.com/SherlockNoMad	2025-06-18 19:13:22 +00:00
Alexander Zhipa	48de3da253	fix: avoid flamegraph script setup conflicts (#156310 ) Fixes #156309 Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310 Approved by: https://github.com/malfet Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>	2025-06-18 19:06:22 +00:00
Luca Wehrstedt	cbafba5794	Allow forcing FSDP2 to always use SUM reductions (#155915 ) NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions. Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904. This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion). Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915 Approved by: https://github.com/xunnanxu	2025-06-18 18:57:47 +00:00
windsonsea	9944cd0949	Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520 ) Related to #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 18:46:04 +00:00
Jeff Daily	30d3cf62fb	support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680 ) Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680 Approved by: https://github.com/malfet, https://github.com/atalman	2025-06-18 18:39:01 +00:00
xinan.lin	aee2bfc5ba	[Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194 ) As title. Thanks @anmyachev for the work on compatibility adaptation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194 Approved by: https://github.com/jansel	2025-06-18 18:17:07 +00:00
Ayan Das	2620361d19	Add batching rule for torch.matrix_exp (#155202 ) ## Summary Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support. Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs. Fixes #115992 ## Details `torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations. ## Testing The existing test suite for vmap and matrix_exp should cover this change. The fix enables: - No performance warning when using `vmap(torch.matrix_exp)` - Efficient native batched execution instead of loop-based fallback Edit: Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202 Approved by: https://github.com/zou3519	2025-06-18 17:35:35 +00:00
eqy	a5f59cc2ea	[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 ) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN https://github.com/pytorch/pytorch/issues/155225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140 Approved by: https://github.com/ngimel	2025-06-18 17:32:36 +00:00
PyTorch MergeBot	94f8679019	Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 )" This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365. Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))	2025-06-18 16:52:19 +00:00
Nikita Shulga	36f7a027b5	[MPS] Implement upsample_trilinear as Metal shader (#156263 ) But only forward for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263 Approved by: https://github.com/dcci ghstack dependencies: #156256, #156090	2025-06-18 16:10:02 +00:00
charan-ponnada	bf06190e21	Integrated AMD AWS runners into Pytorch CI (#153704 ) Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>	2025-06-18 15:58:22 +00:00
PyTorch MergeBot	ce3406817d	Revert "[dynamo] control one_graph behavior additionally through config (#154283 )" This reverts commit fe37db4f1270745d6c523623143332ddf263af55. Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))	2025-06-18 15:53:32 +00:00
PyTorch MergeBot	c5d3e7a4ff	Revert "[dynamo] add set_fullgraph decorator/context manager (#154289 )" This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0. Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))	2025-06-18 15:51:06 +00:00
PyTorch MergeBot	408d9884b0	Revert "[dynamo] fix set_fullgraph for nested calls (#154782 )" This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8. Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))	2025-06-18 15:47:21 +00:00
PyTorch MergeBot	6201981f48	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c. Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](`a6a3a44144`) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))	2025-06-18 15:43:22 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	d290fe7690	Remove legacy export testing path (#156093 ) Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test. Test Plan: CI Rollback Plan: Differential Revision: D76734378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093 Approved by: https://github.com/desertfire	2025-06-18 15:36:44 +00:00
Jeff Daily	7531bd6491	[ROCm] upgrade to 6.4.1 patch release (#156112 ) Fixes #155292. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-18 15:21:44 +00:00
Aleksandar Samardžić	830a335a7d	Refine alignment check along dynamic dimension for grouped MMs (#155466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466 Approved by: https://github.com/ngimel	2025-06-18 15:15:05 +00:00
Xuan Zhang	6d3a4356f6	[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 ) Problem & Solution: Assume we have something like: ``` x = some_op(...) x0 = x[0] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() x1 = x[1] ``` In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as ``` x = some_op(...) x0 = x[0] x1 = x[1] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() ``` Results: For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are * baseline: 7.73GiB * with the chage: 6.45GiB As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed. cc and credit to @ShatianWang for noticing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809 Approved by: https://github.com/fmassa, https://github.com/bdhirsh ghstack dependencies: #155943	2025-06-18 14:38:55 +00:00
Pearu Peterson	c177abd217	Disable pinning check when loading sparse tensors (#154638 ) Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled. Fixes https://github.com/pytorch/pytorch/issues/153143 . For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638 Approved by: https://github.com/amjames, https://github.com/ngimel	2025-06-18 14:33:36 +00:00
PyTorch MergeBot	8f02161d10	Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 )" This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0. Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](`a6a3a44144`) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))	2025-06-18 14:19:39 +00:00
Luca Wehrstedt	b30e04b3c8	Make the NCCL PG Options and Config copyable and safe to init standalone (#155700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700 Approved by: https://github.com/kwen2501	2025-06-18 13:36:27 +00:00
Xia, Weiwen	1bb9b1858b	[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174 ) Summary We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used. For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token. Test plan ``` pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174 Approved by: https://github.com/leslie-fang-intel	2025-06-18 13:17:46 +00:00
frost-intel	d99cac2816	[Kineto][submodule] Update kineto pin for XPU toggle feature (#155488 ) Part of #154898 Update kineto submodule Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-06-18 12:39:58 +00:00
Aleksei Nikiforov	c11888e7a6	Skip more tests on s390x (#155210 ) Make CI for s390x green before fixing and restoring tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210 Approved by: https://github.com/seemethere	2025-06-18 12:07:17 +00:00
Xuehai Pan	402ae09e41	[BE] fix typos in c10/ (#156078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078 Approved by: https://github.com/malfet, https://github.com/cyyever	2025-06-18 10:24:44 +00:00
David Berard	f45f483884	[user triton] AOT Inductor support for new host-side TMA api (#155879 ) This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support. Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor. What this PR contains: 1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation 2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel 3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin. Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2} and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879 Approved by: https://github.com/desertfire	2025-06-18 09:35:11 +00:00
Junjie Wang (PyTorch)	577baa4116	[c10d] Add a logger for all nccl collectives with its time duration when completed (#156008 ) Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives. Test Plan: CI + dry run. Rollback Plan: Differential Revision: D76552340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008 Approved by: https://github.com/fegin, https://github.com/eqy	2025-06-18 09:08:42 +00:00
Wang, Chuanqi	c5a4fe9c17	[CI] fix the ci image name for public copy in ghcr (#156169 ) After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169 Approved by: https://github.com/malfet	2025-06-18 08:16:56 +00:00
William Wen	a6a3a44144	[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 ) This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #155166	2025-06-18 07:27:20 +00:00
William Wen	614a415145	[dynamo] handle fullgraph toggle using nested torch.compile (#155166 ) See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not. Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782	2025-06-18 07:27:20 +00:00
William Wen	3c8c48f793	[dynamo] fix set_fullgraph for nested calls (#154782 ) - Make the fullgraph argument of set_fullgraph a positional argument - Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289	2025-06-18 07:27:09 +00:00
William Wen	920f6e681e	[dynamo] add set_fullgraph decorator/context manager (#154289 ) Implements https://github.com/pytorch/pytorch/issues/144908. Implementation notes: - `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing. - Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example). - InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` . - `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #154283	2025-06-18 07:27:00 +00:00
William Wen	fe37db4f12	[dynamo] control one_graph behavior additionally through config (#154283 ) `torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-06-18 07:26:52 +00:00
Brian Hirsh	ccc6279b40	flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719 ) This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198 There are two problems: (1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run. This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention. Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[\|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though. (2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation. Testing Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic. This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719 Approved by: https://github.com/ydwu4 Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-06-18 07:02:04 +00:00
Ruben Rodriguez Buchillon	bdb1553b77	[inductor][cutlass] binary remote cache (#156248 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally this is the OSS only part of the change, to facilitate integration Test Plan: ## prove that we can upload successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Reviewed By: henrylhtsang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248 Approved by: https://github.com/henrylhtsang	2025-06-18 06:51:22 +00:00
PyTorch UpdateBot	96df866410	[audio hash update] update the pinned audio hash (#156259 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259 Approved by: https://github.com/pytorchbot	2025-06-18 06:02:46 +00:00
Kaichao You	a5df6ffbc2	Improve IPC for Expandable Segments to use fabric handle when possible (#156074 ) Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074 Approved by: https://github.com/ngimel, https://github.com/malfet	2025-06-18 05:22:06 +00:00
henrylhtsang	29867b211a	[cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234 ) When using docker with cutlass backend, we can get ``` No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions' ``` First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888 Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2025-06-18 05:03:43 +00:00
Nikita Shulga	c28e74e457	[MPS] Add nearest_3d forward and backward (#156090 ) Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS Delete `upsample_nearest3d` MPS fallback and replace it with proper shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090 Approved by: https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #156256	2025-06-18 04:48:15 +00:00
functionstackx	a82c171bb2	remove skipifrocm from composability tests (#156036 ) Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too tested locally <img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036 Approved by: https://github.com/jeffdaily	2025-06-18 04:24:42 +00:00
Daniel Galvez	9ed0060225	Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164 ) There are a few considerations here: 1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404 2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine. 3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications. 4. There are two foot guns: a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()). b. The following seuquence won't work as intended: ``` g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): foo() g.replay() raw_graph = g.raw_cuda_graph() modify(raw_graph) g.replay() ``` This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive. I think these two foot guns are probably okay given that this a bit of an experts' API. Fixes #155106 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164 Approved by: https://github.com/ngimel	2025-06-18 03:39:28 +00:00
Simon Fan	17b38b850e	[ca] Allow using compiled autograd context managers during backward runtime (#156120 ) Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check. FIXES https://github.com/pytorch/pytorch/issues/152219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120 Approved by: https://github.com/jansel ghstack dependencies: #155521, #155480	2025-06-18 03:01:15 +00:00
CaoE	10d41c7d20	Add SDPA patterns for T5 models (#155455 ) * Add SDPA patterns for T5 models. * Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455 Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-06-18 02:09:55 +00:00
Mark Molinaro	4851863e3f	fix hack to check if register_buffer has been overridden (#155963 ) Followup on https://github.com/pytorch/pytorch/pull/125971 `self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work! Example demonstration: ```python class Base: def register_buffer(self, buffer): pass class InheritedOk(Base): pass class InheritedOverride(Base): def register_buffer(self, buffer): pass b = Base() ok = InheritedOk() override = InheritedOverride() print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False ``` (I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963 Approved by: https://github.com/mikaylagawarecki	2025-06-18 01:50:30 +00:00
nirajkamalk	202d2ae53a	Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430 ) Fixes #155033 - [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst) - [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst) - [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst) - [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size. - [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 01:27:04 +00:00
Nicolas Macchioni	68996dc183	[BE][2/X] Phase out usage of `use_max_autotune()` (#155848 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848 Approved by: https://github.com/masnesral	2025-06-18 01:18:09 +00:00
Jane Xu	e8bfce9a43	Document how to use stack-based APIs with StableIValue (#155984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-06-18 01:10:23 +00:00
Nikita Shulga	541297daae	[Build] Allow metal shaders to include ATen headers (#156256 ) No-op change that will be used later to share structs between CPU and Metal Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256 Approved by: https://github.com/dcci	2025-06-18 01:03:25 +00:00
xinan.lin	3dabc351bb	[Break XPU] Fix XPU UT failures introduced by community. (#156091 ) Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091 Approved by: https://github.com/jansel	2025-06-17 23:43:37 +00:00
Howard Huang	38e1e5d54c	Add get_pipeline_order() for Gpipe and 1F1B (#155935 ) The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR. The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935 Approved by: https://github.com/wconstab	2025-06-17 23:39:17 +00:00
bobrenjc93	5435e75399	[ez] rename choice_timings -> choice_timings_fn (#156099 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099 Approved by: https://github.com/mlazos ghstack dependencies: #155982, #155996, #156053	2025-06-17 23:30:27 +00:00
Manuel Candales	12b02137af	[MPS] Add benchmark for scan operations (#156241 ) Comparison of cumsum performance before and after Metal implementaton: Previous performance (using torch==2.7.1): ```[------------------------------- -------------------------------] \| eager \| compile 1 threads: ------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 131.0 \| 136.9 cumsum-dim0-128x128 (torch.float16) \| 116.9 \| 121.2 cumsum-dim0-512x512 (torch.float16) \| 132.5 \| 151.9 cumsum-dim0-1024x1024 (torch.float16) \| 150.0 \| 163.0 cumsum-dim1-32x32 (torch.float16) \| 125.9 \| 140.9 cumsum-dim1-128x128 (torch.float16) \| 116.4 \| 129.4 cumsum-dim1-512x512 (torch.float16) \| 135.9 \| 150.1 cumsum-dim1-1024x1024 (torch.float16) \| 139.5 \| 154.2 cumsum-1d-100 (torch.float16) \| 119.5 \| 127.1 cumsum-1d-10000 (torch.float16) \| 128.9 \| 142.5 cumsum-1d-1000000 (torch.float16) \| 140.6 \| 145.6 cumsum-dim0-32x32 (torch.float32) \| 115.7 \| 132.5 cumsum-dim0-128x128 (torch.float32) \| 118.0 \| 131.5 cumsum-dim0-512x512 (torch.float32) \| 138.8 \| 151.6 cumsum-dim0-1024x1024 (torch.float32) \| 155.5 \| 164.2 cumsum-dim1-32x32 (torch.float32) \| 127.2 \| 141.7 cumsum-dim1-128x128 (torch.float32) \| 117.7 \| 130.5 cumsum-dim1-512x512 (torch.float32) \| 138.2 \| 152.3 cumsum-dim1-1024x1024 (torch.float32) \| 144.4 \| 158.6 cumsum-1d-100 (torch.float32) \| 118.6 \| 128.0 cumsum-1d-10000 (torch.float32) \| 125.5 \| 141.5 cumsum-1d-1000000 (torch.float32) \| 143.9 \| 158.4 cumsum-dim0-32x32 (torch.bfloat16) \| 106.6 \| 137.6 cumsum-dim0-128x128 (torch.bfloat16) \| 118.1 \| 131.0 cumsum-dim0-512x512 (torch.bfloat16) \| 140.0 \| 154.3 cumsum-dim0-1024x1024 (torch.bfloat16) \| 153.2 \| 164.4 cumsum-dim1-32x32 (torch.bfloat16) \| 127.9 \| 132.6 cumsum-dim1-128x128 (torch.bfloat16) \| 116.5 \| 129.6 cumsum-dim1-512x512 (torch.bfloat16) \| 136.5 \| 151.2 cumsum-dim1-1024x1024 (torch.bfloat16) \| 139.8 \| 144.8 cumsum-1d-100 (torch.bfloat16) \| 115.7 \| 129.4 cumsum-1d-10000 (torch.bfloat16) \| 125.0 \| 143.3 cumsum-1d-1000000 (torch.bfloat16) \| 127.8 \| 143.4 Times are in microseconds (us). ``` Current performance: ``` [-------------------------------- --------------------------------] \| eager \| compile 1 threads: --------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 107.4 \| 123.8 cumsum-dim0-128x128 (torch.float16) \| 134.2 \| 145.8 cumsum-dim0-512x512 (torch.float16) \| 207.3 \| 231.6 cumsum-dim0-1024x1024 (torch.float16) \| 318.9 \| 355.3 cumsum-dim1-32x32 (torch.float16) \| 98.0 \| 114.3 cumsum-dim1-128x128 (torch.float16) \| 110.8 \| 121.6 cumsum-dim1-512x512 (torch.float16) \| 193.0 \| 209.1 cumsum-dim1-1024x1024 (torch.float16) \| 844.7 \| 870.8 cumsum-1d-100 (torch.float16) \| 108.4 \| 125.0 cumsum-1d-10000 (torch.float16) \| 784.7 \| 852.3 cumsum-1d-1000000 (torch.float16) \| 65855.2 \| 66725.9 cumsum-dim0-32x32 (torch.float32) \| 114.7 \| 115.7 cumsum-dim0-128x128 (torch.float32) \| 139.0 \| 151.6 cumsum-dim0-512x512 (torch.float32) \| 197.3 \| 208.0 cumsum-dim0-1024x1024 (torch.float32) \| 312.7 \| 332.9 cumsum-dim1-32x32 (torch.float32) \| 92.0 \| 110.8 cumsum-dim1-128x128 (torch.float32) \| 114.2 \| 125.0 cumsum-dim1-512x512 (torch.float32) \| 186.2 \| 196.1 cumsum-dim1-1024x1024 (torch.float32) \| 752.0 \| 825.0 cumsum-1d-100 (torch.float32) \| 112.4 \| 122.0 cumsum-1d-10000 (torch.float32) \| 793.5 \| 863.5 cumsum-1d-1000000 (torch.float32) \| 66431.8 \| 66040.0 cumsum-dim0-32x32 (torch.bfloat16) \| 111.6 \| 121.6 cumsum-dim0-128x128 (torch.bfloat16) \| 139.0 \| 138.4 cumsum-dim0-512x512 (torch.bfloat16) \| 217.6 \| 230.1 cumsum-dim0-1024x1024 (torch.bfloat16) \| 305.2 \| 325.6 cumsum-dim1-32x32 (torch.bfloat16) \| 100.5 \| 110.9 cumsum-dim1-128x128 (torch.bfloat16) \| 112.8 \| 125.0 cumsum-dim1-512x512 (torch.bfloat16) \| 187.8 \| 208.9 cumsum-dim1-1024x1024 (torch.bfloat16) \| 790.9 \| 864.7 cumsum-1d-100 (torch.bfloat16) \| 111.6 \| 124.6 cumsum-1d-10000 (torch.bfloat16) \| 778.1 \| 844.9 cumsum-1d-1000000 (torch.bfloat16) \| 64654.3 \| 64082.5 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241 Approved by: https://github.com/malfet	2025-06-17 22:30:22 +00:00
PyTorch MergeBot	fa4f07b5b8	Revert "[Docs] Convert to markdown to fix 155032 (#155520 )" This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e. Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))	2025-06-17 22:22:50 +00:00
Julian De la Barrera Brandner	54998c2daa	Document padding size limitations in nn.modules.padding (#134840 ) (#155618 ) Fixes #134840 Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding: - Circular padding: size must be less than or equal to the corresponding input dimension - Reflection padding: size must be less than the corresponding input dimension - Replication padding: output dimensions must remain positive These changes help prevent runtime errors when users attempt to use large padding values. ## PR Checklist - [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format) - [x] The PR is made against the correct branch - [x] The PR is labeled with `docathon` - [x] The PR is labeled with `module: nn` - [x] The PR is labeled with `documentation` - [x] The PR description includes a reference to the issue being fixed - [x] The PR includes tests if applicable - [x] The PR includes documentation changes - [x] The PR has been tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618 Approved by: https://github.com/AlannaBurke, https://github.com/malfet	2025-06-17 22:16:48 +00:00
Jane Xu	937529f0b3	Pass by const ref instead of by value in StableIValue from (#156126 ) I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only. I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz: 1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive. 2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const). Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126 Approved by: https://github.com/swolchok ghstack dependencies: #155367, #155977	2025-06-17 22:11:30 +00:00
Daniel Galvez	4c0aa37dda	Support stream capture of event record and wait nodes in cuda graphs (#155372 ) These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution. Finishes #146145 I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372 Approved by: https://github.com/ngimel	2025-06-17 21:44:51 +00:00
Oguz Ulgen	8e02cd9c5a	Skip cache related configs for cache config serialization (#156195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195 Approved by: https://github.com/masnesral	2025-06-17 21:24:07 +00:00
Junjie Wang (PyTorch)	3106a33e41	[fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156 ) Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient. Test Plan: Test with the failed case with the script and we now see the correct behavior + new unit test case. Rollback Plan: Differential Revision: D76775373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156 Approved by: https://github.com/VieEeEw	2025-06-17 21:17:58 +00:00
henrylhtsang	bb462a6237	[cutlass backend] Fix prescreening non-deterministic problem (#156144 ) Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/) What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening. However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have 1. prescreening of first matmul chooses a set of kernels to advance to autotuning 2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit 3. second prescreening results in a slightly different set of kernels 4. since not all timings are present, an autotune is re-done. With this diff: ``` SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144 Approved by: https://github.com/mlazos	2025-06-17 20:39:06 +00:00
windsonsea	cd66ff8030	[Docs] Convert to markdown to fix 155032 (#155520 ) Fix #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-17 20:29:45 +00:00
Nicolas Macchioni	50940270ae	[BE][3/X] Phase out usage of `use_max_autotune()` (#155849 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849 Approved by: https://github.com/masnesral	2025-06-17 20:26:29 +00:00
Xuehai Pan	b020971e78	[BE] fix typos in torchgen/ (#156083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083 Approved by: https://github.com/jingsh ghstack dependencies: #156079, #156082	2025-06-17 19:25:50 +00:00
Xuehai Pan	a69785b3ec	[BE] fix typos in tools/ (#156082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082 Approved by: https://github.com/soulitzer ghstack dependencies: #156079	2025-06-17 19:25:50 +00:00
Xuehai Pan	ccea6ddac3	[BE] fix typos in cmake/ (#156079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079 Approved by: https://github.com/Skylion007	2025-06-17 19:25:43 +00:00
Dmitry Nikolaev	5eb5c3700b	[ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525 ) This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly. cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required. `syevD_batched` will be used on ROCm if all the following conditions are met: - `rocSolver version >= 3.26` - input data type is `float` or `double` - batch size >= 2 Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1, rocSolver <3.26) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525 Approved by: https://github.com/Mellonta	2025-06-17 19:20:36 +00:00
PyTorch MergeBot	ec08eb8ba2	Revert "[inductor][cutlass] binary remote cache (#156106 )" This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959. Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))	2025-06-17 19:07:49 +00:00
Aidyn-A	4a26bb8a12	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-17 18:59:44 +00:00
Thien Tran	fc177801af	Enable FP8 row-wise scaled-mm for sm12x (#155991 ) ## Update using Cutlass 3.x (2025/06/15) Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS) M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|17.56 64\|4096\|4096\|69.63 256\|4096\|4096\|266.57 1024\|4096\|4096\|339.28 4096\|4096\|4096\|388.91 This uses the same SM100 template. The only difference is - Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature - ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64 Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`. ## Original using SM89 I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends). Simple benchmark script ```python import torch from torch._inductor.utils import do_bench_using_profiling N, K = 4096, 4096 for M in [16, 64, 256, 1024, 4096]: A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn) B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T scale_A = torch.ones(M, 1).cuda() scale_B = torch.ones(1, N).cuda() out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16) out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16() torch.testing.assert_close(out, out_ref) latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)) tflops = (2 * M * N * K) / latency_us / 1e9 print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS") ``` M \| N \| K \| TFLOPS ---\|---\|---\|--- 16 \| 4096 \| 4096 \| 25.73 TFLOPS 64 \| 4096 \| 4096 \| 71.84 TFLOPS 256 \| 4096 \| 4096 \| 86.40 TFLOPS 1024 \| 4096 \| 4096 \| 112.12 TFLOPS 4096 \| 4096 \| 4096 \| 121.24 TFLOPS Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here... However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|27.13 TFLOPS 64\|4096\|4096\|84.84 TFLOPS 256\|4096\|4096\|96.75 TFLOPS 1024\|4096\|4096\|110.21 TFLOPS 4096\|4096\|4096\|122.98 TFLOPS Small M slightly improves, but large M is still bad. If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|313.28 4096\|4096\|4096\|376.73 Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs. To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|144.03 4096\|4096\|4096\|156.86 A bit better than current configs, but still very far away from hardware limit. @alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89? Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|25.60 64\|4096\|4096\|71.74 256\|4096\|4096\|161.64 1024\|4096\|4096\|185.89 4096\|4096\|4096\|215.53 Better than default configs, but still far away from the config above for compute-bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-06-17 18:52:44 +00:00
Scott Wolchok	e323d46b61	ELU: compute ELU(0) with the cheaper definition (#155765 ) Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.) Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765 Approved by: https://github.com/ezyang	2025-06-17 18:20:22 +00:00
Animesh Jain	8b0e0e4f23	[dynamo] Support tracing of functools.lru_cached method (#156125 ) Fixes https://github.com/pytorch/pytorch/issues/155841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125 Approved by: https://github.com/williamwen42	2025-06-17 18:11:32 +00:00
Svetlana Karslioglu	fc5ae12293	Fix issue with right-nav (#156119 ) Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example: * Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty. * Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed <img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" /> vs <img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119 Approved by: https://github.com/albanD	2025-06-17 18:09:51 +00:00
Catherine Lee	32c1611263	[CI][run_test] Fix rerun logic for failing at exit (#155853 ) Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly. The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case Things that still need to be fixed: * log says "running single test" which is not true * no xml reports get generated here * also no xml reports get generated on segfault * docs for this I think I have a PR that fixes the above but its old so I need to take another look Testing: This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref: `cc862d2c14` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853 Approved by: https://github.com/malfet	2025-06-17 17:51:40 +00:00
Nikita Shulga	6629eaf0c6	[CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193 ) Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp` Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193 Approved by: https://github.com/manuelcandales	2025-06-17 17:49:33 +00:00
Manuel Candales	a4ea242edc	[MPS] Implement scan metal kernels (#156100 ) Implements metal kernels for scan operations: - Migrates cumsum and cumprod from MPSGraph implementation to Metal. - Fixes #154881 - Adds MPS backend support for cummin and cummax Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100 Approved by: https://github.com/malfet	2025-06-17 17:44:22 +00:00
Jane Xu	9a5c59368d	Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977 Approved by: https://github.com/albanD ghstack dependencies: #155367	2025-06-17 17:36:31 +00:00
Jane Xu	b115a4c03a	torch::stable::Tensor beginnings, mainly mem mgmt (#155367 ) ``` // The torch::stable::Tensor class is a highlevel C++ header-only wrapper around // the C shim Tensor APIs. We've modeled this class after TensorBase, as custom // op kernels only really need to interact with Tensor metadata (think sizes, // strides, device, dtype). Other functions on Tensor (like empty_like) should // live like the ATen op that they are and exist outside of this struct. // // There are several goals of this class over AtenTensorHandle and // RAIIAtenTensorHandle: // 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the // C APIs with AtenTensorHandle. Under the hood we still call to these C shim // APIs to preserve stability. // 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass // around ownership. This makes it difficult to pass one input into 2 // different functions, e.g., doing something like c = a(t) + b(t) for // stable::Tensor t. Thus, we use a shared_ptr here. ``` This PR: - exemplifies the above - adds test cases in libtorch_agnostic to make sure the file actually works - includes the results of a battle with template specialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367 Approved by: https://github.com/albanD	2025-06-17 17:36:31 +00:00
Soumith Chintala	2625c70aec	Update CODEOWNERS (#156182 ) as title says. removing me as codeowner for cpp extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182 Approved by: https://github.com/albanD	2025-06-17 17:15:41 +00:00
rzou	a24afbff3f	Support torch.cuda.*Tensor in Dynamo (#156107 ) Summary: This PR adds support for torch.cuda.FloatTensor and friends in Dynamo. These are indeed legacy APIs, but that doesn't stop us from adding support for them in torch.compile. I add support for these in the same way that we support torch.Tensor: these APIs can be safely put into the Dynamo graph. Fixes #130722 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107 Approved by: https://github.com/williamwen42	2025-06-17 16:31:10 +00:00
Ruben Rodriguez Buchillon	9a2c669425	[inductor][cutlass] binary remote cache (#156106 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally Test Plan: ## prove that we can upload successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Rollback Plan: Reviewed By: henrylhtsang Differential Revision: D76454741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106 Approved by: https://github.com/henrylhtsang Co-authored-by: atalman <atalman@fb.com>	2025-06-17 16:24:10 +00:00
David Berard	d66b4bcc3f	[inductor][triton pin] Support triton builtins after #7054 (#156031 ) Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder. To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used. (Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37` Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031 Approved by: https://github.com/NikhilAPatel	2025-06-17 16:09:55 +00:00
Yuki Kobayashi	d083841c0e	Fix a small sphinx markup error (#156061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061 Approved by: https://github.com/colesbury	2025-06-17 15:36:02 +00:00
Catherine Lee	0079c80b35	[CI] Do not constrain memory for ROCm testing in CI (#156115 ) Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115 Approved by: https://github.com/jeffdaily	2025-06-17 15:30:36 +00:00
windsonsea	7fcad0231c	[Docs] Convert to markdown to fix 155025 (#155789 ) Related to #155025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789 Approved by: https://github.com/svekars	2025-06-17 15:08:14 +00:00
Nikita Shulga	4886ba64dc	[BE] Refactor functions from optional_submodules (#155954 ) And use `pathlib.Path` instead of `os.path` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954 Approved by: https://github.com/Skylion007 ghstack dependencies: #155947	2025-06-17 14:41:52 +00:00
Frank Lin	cf90c9f8d1	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-17 14:15:49 +00:00
Xuehai Pan	42015db6a9	[BE] fix typos in benchmarks/ (#156077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #156069	2025-06-17 13:12:18 +00:00
Luca Wehrstedt	0a0023d984	Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564 ) In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed	2025-06-17 12:54:58 +00:00
Wang, Chuanqi	11bb1ece50	[CI] Fix triton version split issue (#155670 ) Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670 Approved by: https://github.com/atalman, https://github.com/EikanWang	2025-06-17 12:42:40 +00:00
Xuehai Pan	1cce73b5f4	[build] Change `--cmake{,-only}` arguments to envvars to support modern Python build frontend (#156045 ) See also: - #156029 - #156027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045 Approved by: https://github.com/ezyang ghstack dependencies: #156040, #156041	2025-06-17 11:40:24 +00:00
Xuehai Pan	57084ca846	[BE][setup] allow passing pytorch-specific `setup.py` options from envvars (#156041 ) See also: - #156029 - #156027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041 Approved by: https://github.com/ezyang ghstack dependencies: #156040	2025-06-17 11:40:24 +00:00
LuFengqing	092aed1b18	[Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992 ) In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v. In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR. SDPA UTs pass in local test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992 Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-17 11:09:51 +00:00
Wei (Will) Feng	4a8f5e752b	[FSDP2] explain user contract for fully_shard (#156070 ) <img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070 Approved by: https://github.com/mori360	2025-06-17 10:03:19 +00:00
Xuehai Pan	8d7ee0f4fb	[BE] fix typos in .ci/, .circleci/, .github/ (#156069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-17 09:43:14 +00:00
Xuehai Pan	2e0e08588e	[BE][PYFMT] migrate PYFMT for `torch/[e-n]*/` to `ruff format` (#144553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553 Approved by: https://github.com/ezyang ghstack dependencies: #144551	2025-06-17 08:18:47 +00:00
cyy	95cb42c45d	Use CMAKE_COLOR_DIAGNOSTICS (#154583 ) `CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583 Approved by: https://github.com/ezyang	2025-06-17 04:52:31 +00:00
cyy	d43c0bdf46	[CI] Move ASAN jobs to clang-18 (#149099 ) Use clang-18 for ASAN jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099 Approved by: https://github.com/ezyang	2025-06-17 04:51:07 +00:00
Animesh Jain	7b0118884e	[invoke_subgraph][inductor] Dont fallback on complex dtype (#155885 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #155828	2025-06-17 04:47:12 +00:00
Animesh Jain	ffcc6fea7b	[invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828 Approved by: https://github.com/zou3519	2025-06-17 04:47:12 +00:00
Nikita Shulga	b1713c6655	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci ghstack dependencies: #156121	2025-06-17 04:46:26 +00:00
Ke Wen	82672206b7	[SymmMem] Make get_rank_to_global_rank return const ref (#156117 ) Avoiding a copy, not expecting a caller to change its value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117 Approved by: https://github.com/fegin ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116	2025-06-17 04:13:18 +00:00
Ke Wen	eea3bcb3d1	[SymmMem] Cache rank_to_global_rank exchange (#156116 ) The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation. We should cache its result on per-group basis. Before this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 18 ``` After this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971, #155975	2025-06-17 04:12:37 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
henrylhtsang	45382b284d	[cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875 ) Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache. Also I want to change how cutlass .so are compiled. Hence this change. This change is broken down since @coconutruben is trying to make a change to the same files too. Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875 Approved by: https://github.com/ColinPeppler	2025-06-17 02:06:54 +00:00
cyy	64bb6317a5	[Accelerator] Fix Python typing in accelerator (#152394 ) There are some changes: 1. Use keywords for arguments if possible. 2. `__exit__ ` of `device_index` is changed to return None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394 Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com> Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-17 01:27:40 +00:00
William Wen	1f0eb79e3e	[dynamo] fix KeyError in LOAD_FAST_CHECK (#155763 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #155761	2025-06-17 00:54:16 +00:00
William Wen	4e833c2005	[dynamo] support tracing weakref callback (#155761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-06-17 00:54:16 +00:00
xadupre	e6252f62ef	[ONNX] Implements converter for higher order ops scan (#154513 ) Fixes #151327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-06-17 00:54:07 +00:00
Pian Pawakapan	b618817479	[PGO] include ints/floats in suggested whitelist (#155980 ) Made the mistake of dropping these Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980 Approved by: https://github.com/bobrenjc93	2025-06-17 00:41:38 +00:00
Benjamin Glass	4311aea5e7	[AOTInductor] Add class declarations to torch._C._aoti interface file (#155128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128 Approved by: https://github.com/desertfire ghstack dependencies: #155149	2025-06-17 00:10:57 +00:00
yifanmao	82fb904140	Add warning for incorrected grad results at world size 1 (#154928 ) Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928 Approved by: https://github.com/weifengpy	2025-06-17 00:08:04 +00:00
mori360	eb4cf59ecd	Add FSDP2 logging (#155826 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826 Approved by: https://github.com/weifengpy	2025-06-16 23:49:58 +00:00
Nikita Shulga	6e2992a998	Remove unused Azure pipeline trigger script (#156134 ) ## Summary - delete `.circleci/scripts/trigger_azure_pipeline.py` ## Testing - `python3 -m pip install flake8` - `python3 -m flake8 .circleci/scripts` ------ https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134 Approved by: https://github.com/izaitsevfb	2025-06-16 23:42:52 +00:00
codingwithsurya	4781b0ee60	[SymmMem] Add NVSHMEM GET support to Triton (#155890 ) Adds NVSHMEM GET operation support for Triton kernels: - Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block - Add basic `test_triton_get` for 2-rank GET operation - Add `test_triton_get_ring` for ring topology GET across arbitrary ranks Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get` ```python @skipIfRocm @requires_triton() def test_triton_get(self) -> None: @triton.jit def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer) # ... setup code ... val = 7 inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_( val if rank == 0 else -1 ) out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1) peer = 1 - rank if rank == 1: # Rank 1 gets data from rank 0 get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] got data from peer {peer}") ``` ``` [Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8) [Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8) ... [Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8) ... [Rank 1] got data from peer 0 ---------------------------------------------------------------------- Ran 2 tests in 17.046s OK ``` ```python @skipIfRocm @requires_triton() def test_triton_get_ring(self) -> None: @triton.jit def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer) # ... setup code ... # Ring topology: each rank gets data from the rank to its left peer = (rank - 1) % world_size # All ranks execute the get operation get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] got data from peer {peer}") ``` ``` Output (8 GPUs): [Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8) [Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8) [Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8) [Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8) [Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8) [Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8) [Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8) [Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8) [Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8) [Rank 2] got data from peer 1 [Rank 5] got data from peer 4 [Rank 0] got data from peer 7 [Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8) [Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8) [Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8) [Rank 6] got data from peer 5 [Rank 3] got data from peer 2 [Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8) [Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8) [Rank 7] got data from peer 6 [Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8) [Rank 4] got data from peer 3 ---------------------------------------------------------------------- Ran 1 test in 41.045s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890 Approved by: https://github.com/kwen2501, https://github.com/mandroid6	2025-06-16 23:18:15 +00:00
Nikita Shulga	bb1f3d1a55	[MPSInductor] Improve `_default` dtype inference (#156121 ) By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121 Approved by: https://github.com/dcci	2025-06-16 23:11:53 +00:00
Nicolas Macchioni	508cdc4fc9	[BE][4/X] Phase out usage of `use_max_autotune()` (#155850 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850 Approved by: https://github.com/masnesral	2025-06-16 23:10:26 +00:00
Yiming Zhou	f2d70898c6	[nativert] Move OpKernel to PyTorch core (#156011 ) Summary: Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph. Torch Native Runtime RFC: pytorch/rfcs#72 Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test Rollback Plan: Differential Revision: D76525939 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011 Approved by: https://github.com/zhxchen17	2025-06-16 22:53:10 +00:00
PyTorch MergeBot	35ecd7c2d4	Revert "[Cutlass] Fix buffer missing issues (#155897 )" This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb. Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))	2025-06-16 22:44:39 +00:00
PyTorch MergeBot	190f76fa31	Revert "Implement guard collectives (#155558 )" This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230. Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but `c92f1075aa/1` shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))	2025-06-16 22:26:52 +00:00
Ting Lu	c92f1075aa	Fix if condition for CUDA 12.9 Win build (#156108 ) follow-up for https://github.com/pytorch/pytorch/pull/155799/files Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108 Approved by: https://github.com/atalman	2025-06-16 21:57:34 +00:00
Scott Wolchok	cce4d213a6	Remove non-header-only dep from c10_headers target (#155858 ) It depends on /c10/util:base which is not header-only. Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858 Approved by: https://github.com/ezyang	2025-06-16 21:41:25 +00:00
bobrenjc93	a24ce67dee	[ez] fix grammar error in comment (#156053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053 Approved by: https://github.com/jingsh ghstack dependencies: #155982, #155996	2025-06-16 20:53:07 +00:00
Colin Peppler	247113e03e	Add size_hint_or_throw (#155615 ) ## Summary `TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484 - One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr. - Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite. This PR adds `size_hint_or_throw` which will throw if unbacked symints exist - use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints ------ with Codex https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615 Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-06-16 20:46:51 +00:00
Justin Silver	008345be9d	Fix #155018 (convert distributed rst to markdown) (#155528 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155018 Docs comparison (check out the 'new' whenever docs build) 1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html)) 2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html)) 3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html)) 4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html)) 5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528 Approved by: https://github.com/wz337, https://github.com/svekars	2025-06-16 20:46:09 +00:00
Xuan Zhang	eb2af14f8e	[PT2][partitioners] Add aten.split to view_ops list [relanding #155424 ] (#155943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943 Approved by: https://github.com/ShatianWang	2025-06-16 20:42:54 +00:00
PyTorch MergeBot	03488d820c	Revert "[MPS][Testing][BE] Fix samples for full_like (#156026 )" This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06. Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](`2d832c9587`) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))	2025-06-16 19:50:26 +00:00
Pian Pawakapan	6d2155db49	[PGO] no code state update on dynamic=False (#155961 ) Summary: When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic. A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs. Test Plan: test/dynamo/test_pgo.py Rollback Plan: Differential Revisi,on: D76630499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961 Approved by: https://github.com/bobrenjc93	2025-06-16 19:47:55 +00:00
Edward Z. Yang	5a5a05a6a3	Implement guard collectives (#155558 ) When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is: 1. Perform compiled code lookup (evaluating guards) 2. Run a collective, communicating if you found a compiled code or not 3. If anyone requires recompile, force everyone to recompile One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective. I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558 Approved by: https://github.com/Microve	2025-06-16 19:46:16 +00:00
PyTorch MergeBot	61b271e0f3	Revert "Implement guard collectives (#155558 )" This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe. Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](`38e5e81e55`) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))	2025-06-16 19:40:46 +00:00
Georgia Phillips	7cf38d2a05	Make benchmark by op for TS model work with sample inputs (#155988 ) Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR Test Plan: ``` buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle ``` Rollback Plan: Reviewed By: dolpm Differential Revision: D76554920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988 Approved by: https://github.com/dolpm	2025-06-16 19:15:07 +00:00
Julian De la Barrera Brandner	2dc1627451	[doc] Add documentation for division by zero behavior in autograd (#155987 ) Fixes #128796 This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains: 1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic 2. How autograd handles these cases and why masking after division can lead to `nan` gradients 3. Provides concrete examples showing the issue 4. Recommends two solutions: - Masking before division - Using MaskedTensor (experimental API) The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue. This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models. dditional changes: - Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987 Approved by: https://github.com/soulitzer Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>	2025-06-16 19:02:12 +00:00
Simon Fan	907d0931cc	[ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480 ) For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480 Approved by: https://github.com/jansel ghstack dependencies: #155521	2025-06-16 18:45:03 +00:00
Simon Fan	9ff9c28fe8	[ca] Functionalize AccumulateGrad (#155521 ) This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up. Before: - Compiled Autograd Engine: proxies call to inductor accumulate_grad op - Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract) ```python new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1); getitem_1 = None copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1); copy_ = None ``` - AOTAutograd: functionalizes the copy_ After: - Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call. - Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd - AOTAutograd: traces into the op, with FunctionalTensors. While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case. In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521 Approved by: https://github.com/jansel	2025-06-16 18:45:02 +00:00
Benjamin Glass	42ff6a4a5c	[Inductor] Delay codegen for fallback arguments and improve typing (#154371 ) Delays code generation for arguments to fallback ops. This is inspired by #155642, and likely fixes similar memory leaks. Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled: 1. removing a number of now clearly unnecessary asserts 2. adding a few more targeted asserts to validate the code's current assumptions 3. removing some unneeded control flow in several functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371 Approved by: https://github.com/desertfire	2025-06-16 18:00:04 +00:00
Xuehai Pan	4162c0f702	[BE][setup] gracefully handle envvars representing a boolean in `setup.py` (#156040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040 Approved by: https://github.com/malfet	2025-06-16 17:56:31 +00:00
angelayi	f48a157660	[aoti] Add more to error message (#155974 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974 Approved by: https://github.com/yushangdi	2025-06-16 17:49:52 +00:00
windsonsea	fbd88ae2b5	Convert to markdown: checkpoint.rst (#156009 ) Related to #155014 Use two commits to have a try. ```bash 1800 git mv docs/source/checkpoint.rst docs/source/checkpoint.md 1802 git commit -m "[Docs] Rename checkpoint.rst" 1803 git push origin ckpoint # update the markdown file 1805 git add . 1806 git commit -m "modify checkpoint.md" 1807 git push origin ckpoint ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009 Approved by: https://github.com/svekars	2025-06-16 17:48:23 +00:00
windsonsea	a10024d7de	Convert complex_numbers.rst to markdown (#156039 ) Related to #155014 Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039 Approved by: https://github.com/svekars	2025-06-16 17:24:37 +00:00
PyTorch MergeBot	e9fdaf8701	Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 )" This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9. Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))	2025-06-16 17:22:55 +00:00
Justin Chu	45596ec58f	Delete tools/onnx/update_default_opset_version.py (#156055 ) The tool is no longer relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055 Approved by: https://github.com/titaiwangms	2025-06-16 17:21:36 +00:00
PyTorch MergeBot	365ce465f3	Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 )" This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7. Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](`8142a02860`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))	2025-06-16 16:59:25 +00:00
albanD	2a4e357192	Fix compilation warning with gcc14 (#155934 ) Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934 Approved by: https://github.com/malfet, https://github.com/janeyx99	2025-06-16 16:43:15 +00:00
PyTorch MergeBot	503362d019	Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 )" This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968. Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))	2025-06-16 15:13:53 +00:00
PyTorch MergeBot	b8d96c3f78	Revert "[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 )" This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c. Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))	2025-06-16 14:59:01 +00:00
Xuehai Pan	013dfeabb4	[BE] fix typos in top-level files (#156067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067 Approved by: https://github.com/malfet ghstack dependencies: #156066	2025-06-16 14:56:07 +00:00
Xuehai Pan	6c493e2b14	[BE] add `codespell` linter (#156066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066 Approved by: https://github.com/malfet	2025-06-16 14:56:07 +00:00
Nikita Shulga	2d832c9587	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci	2025-06-16 14:27:42 +00:00
Nikita Shulga	831c9010c7	[BE] Remove non-existing operator from unimplemented list (#156025 ) Never heard of torch.login :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025 Approved by: https://github.com/dcci	2025-06-16 14:14:58 +00:00
Edward Z. Yang	38e5e81e55	Implement guard collectives (#155558 ) When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is: 1. Perform compiled code lookup (evaluating guards) 2. Run a collective, communicating if you found a compiled code or not 3. If anyone requires recompile, force everyone to recompile One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective. I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558 Approved by: https://github.com/Microve	2025-06-16 14:09:14 +00:00
dependabot[bot]	05faba4028	Bump requests from 2.32.2 to 2.32.4 in /.github (#155491 ) Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4) --- updated-dependencies: - dependency-name: requests dependency-version: 2.32.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-16 06:48:08 -07:00
PyTorch UpdateBot	d6ee5144ca	[xla hash update] update the pinned xla hash (#156064 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064 Approved by: https://github.com/pytorchbot	2025-06-16 11:11:10 +00:00
Aidyn-A	8142a02860	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-16 10:55:47 +00:00
Anthony Barbier	bf7e290854	Add __main__ guards to jit tests (#154725 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. In jit tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725 Approved by: https://github.com/clee2000	2025-06-16 10:28:45 +00:00
Justin Chu	f810e98143	[ONNX] Update default opset to 18 (#156023 ) Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023 Approved by: https://github.com/Skylion007	2025-06-16 08:40:49 +00:00
Bob Ren	39c605e8b3	remove allow-untyped-defs from context.py (#155622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622 Approved by: https://github.com/Skylion007	2025-06-16 07:38:34 +00:00
Xia, Weiwen	d9799a2ee7	Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699 ) Fixes #153310 As the title Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168	2025-06-16 07:10:06 +00:00
PyTorch UpdateBot	156b28e62a	[audio hash update] update the pinned audio hash (#155648 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648 Approved by: https://github.com/pytorchbot	2025-06-16 03:57:28 +00:00
Dhia naouali	c620d0b5c7	convert: rst to myst pr2/2 (#155911 ) Fixes #155038 parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check) this PR converts the following three .rst files with the mentioned referenced in each file - [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst) - torch.compiler_troubleshooting - nonsupported_numpy_feats - torchdynamo_fine_grain_tracing - [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst) - None - [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst) - torch.compiler_overview - torch.compiler_api - torchdynamo_fine_grain_tracing I made the suggested edits by the maintainers as commented in the parent PR (used git mv on all files, yet it still appeared as delete-create action) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-16 00:44:44 +00:00
David Berard	c83041cac2	[test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827 ) Tests added: ``` python test/inductor/test_triton_kernels.py -k test_on_device_tma python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma ``` These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827 Approved by: https://github.com/FindHao ghstack dependencies: #155777, #155814	2025-06-15 20:24:19 +00:00
David Berard	bc9b8ea230	[user triton] JIT inductor support for new host-side TMA api (#155814 ) This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api. * handle TensorDescriptor.from_tensor in ir.py * codegen TensorDescriptor.from_tensor in wrapper.py * generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator) AOTI support is not implemented yet. Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814 Approved by: https://github.com/aakhundov ghstack dependencies: #155777	2025-06-15 20:24:19 +00:00
David Berard	b7c95acc6c	[user triton] triton_kernel_wrap support for new host-side TMA API (#155777 ) This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis. Major changes: * TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor * mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir) mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata) Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777 Approved by: https://github.com/aakhundov	2025-06-15 20:24:19 +00:00
Animesh Jain	54976bca10	[dynamo] Provide helper functions for guard filter hook (#155083 ) Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083 Approved by: https://github.com/zhxchen17, https://github.com/jansel	2025-06-15 17:49:36 +00:00
Yu, Guangye	0935a97d95	[Dynamo] Add torch.accelerator API to trace_rules (#155884 ) # Motivation - Add binding API and non-binding API in torch.accelerator to trace rules. - Add some function in torch.accelerator to const fold functon list for Dynamo capature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #155787, #155788	2025-06-15 17:09:57 +00:00
Yu, Guangye	b51d803785	[Dynamo] Add XPU API to trace_rules (#155788 ) # Motivation - Add binding API and non-bindling API to trace rules for XPU; - Add some XPU API to the const fold function for Dynamo capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #155787	2025-06-15 17:09:57 +00:00
Yu, Guangye	69acba2b19	[Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787 ) # Motivation - Add XPU-specific Stream and Event to in graph calss list for Dynamo capture. - Add generic Stream and Event to i graph class list for Dynamo capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-06-15 17:09:57 +00:00
nirajkamalk	53cd18f6b3	Update gradient behavior note in torch.amin and torch.amax (#155071 ) Fixes #155048 The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change: New note: `amax, amin, max(dim), min(dim) evenly distributes gradient between equal values when there are multiple input elements with the same minimum or maximum value.` cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071 Approved by: https://github.com/soulitzer	2025-06-15 16:09:31 +00:00
PyTorch UpdateBot	655b3b14ff	[executorch hash update] update the pinned executorch hash (#156007 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007 Approved by: https://github.com/pytorchbot	2025-06-15 04:51:37 +00:00
Austin Wahle	517d2995e0	Add__int__ and __float__ methods to _sympy.functions.Identity (#155873 ) Fixes #155688 Root Cause: in [`torch/_inductor/index_propagation.py`](`f151b20123/torch/_inductor/index_propagation.py (L57-L68)`) When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`. This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`. Fix: Define `__int__` method for `torch.utils._sympy.functions.Identity`. wlog for `float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873 Approved by: https://github.com/williamwen42	2025-06-15 04:24:40 +00:00
Aaron Gokaslan	6ebe9a4f47	[BE][Ez]: Optimize nvshmem alloc with missing move (#156000 ) Saw this in another PR where there was a missing move on this potentially very hot path with Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-06-15 03:04:08 +00:00
Ke Wen	32eee8ed22	[SymmMem] Add nvshmem_free (#155975 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed. Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975 Approved by: https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971	2025-06-15 01:23:49 +00:00
fduwjj	b8aee84fb9	[c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949 ) While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949 Approved by: https://github.com/Skylion007	2025-06-15 00:37:42 +00:00
Howard Huang	3159ee2ad3	Update test_schedule_multiproc to use world_size=2 (#155921 ) The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481 In follow up PRs I will refactor the tests and move some tests to use large world sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921 Approved by: https://github.com/fduwjj, https://github.com/Skylion007 ghstack dependencies: #155920	2025-06-15 00:24:18 +00:00
Howard Huang	8e1471bdc9	Allow MultiProcContinuousTest to set world_size (#155920 ) `MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920 Approved by: https://github.com/fduwjj	2025-06-15 00:24:17 +00:00
Michael Lazos	9bd42c1570	[Cutlass] Fix buffer missing issues (#155897 ) Handles constants and constant folding with aoti. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897 Approved by: https://github.com/henrylhtsang	2025-06-15 00:08:50 +00:00
Henry Tsang	a35b3a9b95	[cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939 ) Differential Revision: D76571828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939 Approved by: https://github.com/Skylion007, https://github.com/masnesral	2025-06-15 00:01:57 +00:00
eqy	47c8810b52	[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 ) Fixes https://github.com/pytorch/pytorch/issues/153590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170 Approved by: https://github.com/malfet	2025-06-14 23:34:31 +00:00
bobrenjc93	0fa361e429	[ez] fix typo in _inductor/scheduler.py (#155996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996 Approved by: https://github.com/Skylion007 ghstack dependencies: #155982	2025-06-14 21:21:35 +00:00
Ke Wen	77ac3a0965	[SymmMem] Remove wrappers around nvshmem APIs (#155971 ) `NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018. Therefore there is no need to build an extra layer to hide dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968	2025-06-14 19:58:09 +00:00
Ke Wen	2c0d94a7de	[SymmMem] Remove unused ptr_to_symm_mem_ (#155968 ) No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty. This PR removes it and supports relying functionalities via the `allocations_` map. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835	2025-06-14 19:57:06 +00:00
Aaron Gokaslan	a317c63d1b	[BE]: Update NCCL to 2.27.3 (#155233 ) Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517 This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233 Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy	2025-06-14 19:20:31 +00:00
Jithun Nair	794ef6c9b8	Enable manywheel build and smoke test on main branch for ROCm (#153287 ) Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287 Approved by: https://github.com/jeffdaily	2025-06-14 19:14:31 +00:00
albanD	5285d10243	remove duplicated pybind flag in mps code (#155936 ) gcc14 (at least) warns that this is already defined Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936 Approved by: https://github.com/Skylion007	2025-06-14 18:41:12 +00:00
Aaron Orenstein	e95e8eed0a	mypy 1.16.0 (#155821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-06-14 18:18:43 +00:00
Marcin Pioch	ce79056471	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-14 17:29:54 +00:00
Laith Sakka	603a54a9b3	Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 ) The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function. This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically, - it introduces size_vars.expect_true and size_vars.check. - guard_lt becomes check_lt - guard_leq becomes check_leq - guard_equals becomes check_equals I am also seeing a couple of wrong usages !! that i will fix in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154774	2025-06-14 17:13:53 +00:00
Laith Sakka	c219dbd2fc	avoid gso in has_internal_overlap (#155870 ) existing comment already explains it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870 Approved by: https://github.com/bobrenjc93	2025-06-14 17:13:20 +00:00
Xuehai Pan	279cae52e7	[BE][PYFMT] migrate PYFMT for `torch/ao/` to `ruff format` (#148185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185 Approved by: https://github.com/ezyang	2025-06-14 16:47:04 +00:00
cyy	c2beeadeb4	[Reland] Use 3.27 as the minimum CMake version (#154783 ) Reland of #153153, which was incidentally closed. Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783 Approved by: https://github.com/ezyang	2025-06-14 16:37:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	370fc49dde	Handle aten.to at submodule boundaries (#153972 ) Summary: #buildall Test Plan: CI Differential Revision: D74582970 When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972 Approved by: https://github.com/avikchaudhuri	2025-06-14 16:13:29 +00:00
PyTorch UpdateBot	d42c11819f	[executorch hash update] update the pinned executorch hash (#153436 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436 Approved by: https://github.com/pytorchbot	2025-06-14 16:09:41 +00:00
bobrenjc93	70b68caf58	Fix logging of failed tensorified ops (#155982 ) Tested via ``` TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow> I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982 Approved by: https://github.com/flaviotruzzi	2025-06-14 14:23:54 +00:00
Xia, Weiwen	e375d21bb9	[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 ) Fixes #154328 Summary Fail reason: The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected. Fix: Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t. Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-06-14 14:12:38 +00:00
Xuehai Pan	1a568f4e5d	[BE][Easy] bump `isort` to 6.0.1 (#155919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919 Approved by: https://github.com/Skylion007 ghstack dependencies: #155909, #155914	2025-06-14 12:29:01 +00:00
Xuehai Pan	5467765990	[BE][Easy] bump `ruff` to 0.11.13 (#155914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914 Approved by: https://github.com/Skylion007 ghstack dependencies: #155909	2025-06-14 12:29:01 +00:00
Xuehai Pan	736a15a81a	[torchgen] Fix `ruff format` for `# fmt: skip` comment for function signature (#155909 ) See also: - astral-sh/ruff#18658 This fix follows the suggestion from: - https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909 Approved by: https://github.com/ezyang	2025-06-14 12:28:55 +00:00
Aaron Gokaslan	d859e65826	[DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912 ) Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912 Approved by: https://github.com/ezyang	2025-06-14 12:14:14 +00:00
Xuehai Pan	596b418391	[BE][PYFMT] migrate PYFMT for `{torch,test}/{nn,optim}/**` to `ruff format` (#144548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548 Approved by: https://github.com/ezyang	2025-06-14 11:27:04 +00:00
penknife6153	3e38feb05f	[inductor] Add configuration control for CUTLASS operation selection. (#155770 ) Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time. Fixes #155718 ## Usage Examples ```bash # Enable CUTLASS for all operations (default behavior) export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL" # Enable CUTLASS only for matrix multiplication operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm" # Enable CUTLASS only for batch operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm" # Disable CUTLASS for all operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770 Approved by: https://github.com/henrylhtsang	2025-06-14 08:19:54 +00:00
FFFrog	1982ec2d22	Add api info for torch._C._nn.pyi (#148405 ) APis involved are as followed: - adaptive_avg_pool2d - adaptive_avg_pool3d - binary_cross_entropy - col2im ISSUE Related: https://github.com/pytorch/pytorch/issues/148404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405 Approved by: https://github.com/ezyang	2025-06-14 07:57:07 +00:00
Laith Sakka	7070ab3180	use guard_or_false in checkInBoundsForStorage (#155874 ) this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874 Approved by: https://github.com/bobrenjc93	2025-06-14 07:21:26 +00:00
Laith Sakka	d79651571f	assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869 ) preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this, and make it work for more complex unbacked sizes with ranges [-inf, inf] Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869 Approved by: https://github.com/bobrenjc93	2025-06-14 07:19:56 +00:00
Xuehai Pan	e7da21806f	[Easy][BE] update recommanded VS Code settings (#152760 ) Changes: - Remove old invalid settings and replace with new settings. - Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760 Approved by: https://github.com/drisspg	2025-06-14 07:11:10 +00:00
cyy	1393f71e07	Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979 ) The CMake CUDA module has been deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979 Approved by: https://github.com/ezyang	2025-06-14 06:52:51 +00:00
David Berard	c843909d9e	[flex attention][triton pin] use new TMA API (#155771 ) Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are replacing the experimental TMA API usage with the stable TMA API in flex attention. This means that flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1 for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone] Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771 Approved by: https://github.com/mandroid6, https://github.com/nmacchioni	2025-06-14 06:34:16 +00:00
Oguz Ulgen	92b7ed6d07	Add Helion softmax test (#155976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976 Approved by: https://github.com/jansel	2025-06-14 05:53:21 +00:00
Phillip Liu	9338d85d45	[ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754 ) Summary: TSIA Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: eyes Sandcastle run Differential Revision: D76472617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754 Approved by: https://github.com/fduwjj, https://github.com/Skylion007	2025-06-14 04:34:29 +00:00
Ti-Tai Wang	bf897b4cea	[ONNX] Support 0/1 on dynamic dimension (#155717 ) Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export. However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export. For example, ```python import torch class Model(torch.nn.Module): def forward(self, x, y, z): return torch.cat((x, y), axis=1) + z model = Model() x = torch.randn(2, 3) y = torch.randn(2, 5) z = torch.randn(1, 8) model(x, y, z) DYN = torch.export.Dim.DYNAMIC ds = {0: DYN, 1: DYN} with torch.fx.experimental._config.patch(backed_size_oblivious=True): ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds)) print(ep) """ ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"): # sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0) sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1) sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0) sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1) sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0) sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1) # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1); x = y = None # eq: "Sym(True)" = sym_size_int_2 == sym_size_int; sym_size_int_2 = None _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'"); eq = _assert_scalar_default = None add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3; sym_size_int_1 = sym_size_int_3 = None eq_1: "Sym(True)" = add_1 == sym_size_int_5; add_1 = sym_size_int_5 = None _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'"); eq_1 = _assert_scalar_default_1 = None eq_2: "Sym(True)" = sym_size_int == sym_size_int_4; sym_size_int = sym_size_int_4 = None _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'"); eq_2 = _assert_scalar_default_2 = None # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z); cat = z = None return (add,) Graph signature: # inputs x: USER_INPUT y: USER_INPUT z: USER_INPUT # outputs add: USER_OUTPUT Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]} """ ep.module()(x, y, z) """ Traceback (most recent call last): File "/home/titaiwang/pytorch/test_export.py", line 20, in <module> ep.module()(x, y, z) File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped return self._wrapped_call(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__ raise e File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__ return super(self.cls, obj).__call__(args, *kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl return inner() ^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook _check_input_constraints_for_graph( File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph _check_symint( File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint raise RuntimeError( RuntimeError: Expected input at args[2].shape[0] to be equal to 2, but got 1 """ ``` The explanation (from @pianpwk): In the model we have `return torch.cat((x, y), axis=1) + z`. Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are: 1) Yes -> then we must specialize `s8` to 1 2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`. Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2). For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717 Approved by: https://github.com/justinchuby	2025-06-14 04:04:47 +00:00
FFFrog	187828dcb4	[OpenReg][5/N] add set_.source_Storage for openreg (#155191 ) Changes: - add set_.source_Storage for openreg to support torch.load & torch.serialization - uncomment some related tests in the test_openreg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101	2025-06-14 03:44:32 +00:00
FFFrog	e4fd0bf771	[OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101 ) As the title stated. Involved testcases: - test_open_device_storage_pin_memory - test_open_device_serialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106, #154181	2025-06-14 03:44:32 +00:00
FFFrog	1e7989cad5	[OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181 ) As the title stated. Involved testcases: - test_open_device_quantized - test_open_device_random - test_open_device_tensor - test_open_device_packed_sequence - test_open_device_storage Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106	2025-06-14 03:44:32 +00:00
FFFrog	7e5f29b2de	[OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106 Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD ghstack dependencies: #153947, #154018, #154019	2025-06-14 03:44:32 +00:00
FFFrog	676abded4b	[OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018	2025-06-14 03:44:32 +00:00
FFFrog	d3d469092f	[Openreg] Split TestOpenReg into two parts (#154018 ) ---- - TestPrivateUse1: testing 3rd accelerator integration mechinasm itself - TestOpenReg: testing openreg itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018 Approved by: https://github.com/albanD ghstack dependencies: #153947	2025-06-14 03:44:31 +00:00
FFFrog	cafd2344d6	[OpenReg] add manual_seed related capabilities (#153947 ) Changes: - Add manual_seed manual_seed_all initial_seed and so on - Delay execution of self._lazy_init more deeply Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947 Approved by: https://github.com/albanD	2025-06-14 03:44:31 +00:00
Sean McGovern	297805fd8f	Typo fixes for "overridden" in comments and function names (#155944 ) This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944 Approved by: https://github.com/Skylion007	2025-06-14 03:37:38 +00:00
Edgar Romo Montiel	ca3cabd24a	Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696 ) Fixes #155028 This pull request updates the documentation by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-14 03:32:00 +00:00
dolpm	cdfa33a328	[nativert] move execution frame to torch (#155830 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76369008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830 Approved by: https://github.com/zhxchen17	2025-06-14 03:28:55 +00:00
Nicolas Macchioni	a6084b71ed	[BE][1/X] Phase out usage of `use_max_autotune()` (#155847 ) `use_max_autotune()` is likely not what people expect it to be; Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning. Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847 Approved by: https://github.com/jingsh, https://github.com/masnesral	2025-06-14 03:16:20 +00:00
Yiming Zhou	7982b8c703	[BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867 ) Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`. Test Plan: CI Rollback Plan: Differential Revision: D76558294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867 Approved by: https://github.com/jingsh	2025-06-14 02:03:27 +00:00
Yiming Zhou	8f5f01bf19	[BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871 ) Summary: As title. Move the duplicate definition to the base class header `proxy_executor.h` Test Plan: CI Rollback Plan: Differential Revision: D76559180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871 Approved by: https://github.com/yushangdi	2025-06-14 01:52:43 +00:00
PyTorch MergeBot	4574b39aa4	Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709 )" This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c. Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](`bbbced94a4`) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))	2025-06-14 01:43:16 +00:00
Nikita Shulga	c10339559d	[BE] Better uv detection in `pip init` (#155972 ) If one has some UV and non-UV environments locally, one shoudl call `uv pip install` only on the UV-enabled ones, which could be detected by checking if `uv/python` path is present in `sys.base_prefix` Fixes https://github.com/pytorch/pytorch/issues/152999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972 Approved by: https://github.com/janeyx99	2025-06-14 01:35:50 +00:00
PyTorch MergeBot	d7e3c9ce82	Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287 )" This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f. Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](`3b6569b1ef`) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))	2025-06-14 01:32:27 +00:00
anwang	c165b36a31	[MTIA Aten Backend] Migrate relu / relu_ (#155927 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate relu / relu_. Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel. https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520 Let me know if any concern about this Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659, #155925, #155926	2025-06-14 01:24:48 +00:00
anwang	50f6431e0a	[MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate sqrt.out / rsqrt.out / sin.out / silu.out Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659, #155925	2025-06-14 01:24:48 +00:00
anwang	7b11cb8c12	[MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate tanh.out and tanh_backward.grad_input Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659	2025-06-14 01:24:41 +00:00
anwang	0185d3a5ed	[MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate bitwise_or.Tensor_out from out-of-tree to in-tree. Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659 Approved by: https://github.com/egienvalue ghstack dependencies: #154632	2025-06-14 01:24:34 +00:00
anwang	163cdaaa3a	[MTIA Aten Backend] Migrate bitwise_not.out (#154632 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate bitwise_not.out from out-of-tree to in-tree. Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632 Approved by: https://github.com/egienvalue	2025-06-14 01:24:27 +00:00
Jinhang Choi	04cf2c9d24	fix tensor print behavior for MAIA (#155609 ) This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609 Approved by: https://github.com/soulitzer	2025-06-14 01:04:12 +00:00
cz2h	dabb55baff	Add resolve in add decomp to enable view (#153945 ) Fixes #148950. During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301 ), the functional argument of conj complex tensor gets cloned. This result in always having .is_conj() evaluted to false in decomposition function. Propose a fix of calling resolve_conj() in the decomposition of complex tensor add. Test as below `python test/dynamo/test_repros.py ReproTests.test_add_complex_conj` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945 Approved by: https://github.com/jansel	2025-06-14 00:41:50 +00:00
Nikita Shulga	fec571cfd4	[BE][CI] Remove hardshrink integer exclusions (#155965 ) As they are not called anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965 Approved by: https://github.com/dcci	2025-06-14 00:32:57 +00:00
Boyuan Feng	38410cf9b5	Fix DDPOptimizer issue on static tensor index (#155746 ) We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. `bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)` The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules. `bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)` Fixes #140395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746 Approved by: https://github.com/xmfan, https://github.com/bdhirsh	2025-06-14 00:15:58 +00:00
Jithun Nair	3b6569b1ef	Enable manywheel build and smoke test on main branch for ROCm (#153287 ) Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287 Approved by: https://github.com/jeffdaily	2025-06-14 00:05:57 +00:00
Aaron Gokaslan	bbbced94a4	[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709 ) followup for https://github.com/pytorch/pytorch/pull/154980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709 Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever	2025-06-13 23:10:01 +00:00
Nikita Shulga	d512584718	[BE] Refactor clamp dtypes check (#155930 ) By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000 ghstack dependencies: #155470	2025-06-13 23:05:02 +00:00
Nikita Shulga	0cb85c188f	[BE] Move optional submodules checkout to its own module (#155947 ) To expand it to optional eigen checkout later Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947 Approved by: https://github.com/Skylion007	2025-06-13 23:02:38 +00:00
dggaytan	3003c681ef	Converting .rst files to .md files (#155377 ) Fixes #155036 This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 22:54:27 +00:00
ggsmith842	799443605b	Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297 ) Fixes #155019 ## Description Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst ## Checklist - [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html) - [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html) - [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html) - [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 22:08:37 +00:00
Nikita Shulga	764c02b78b	[BE] Raise `NotImplementedError` (#155470 ) When op is unimplemented for a specific dtype Which makes more sense, than a RuntimeError Example ```python >>> import torch >>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,))) NotImplementedError: "hardshrink_cpu" not implemented for 'Long' ``` release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-06-13 22:07:03 +00:00
Catherine Lee	d59ed21d0f	[CI] Reuse old whl: track why failed to use the old whl (#155860 ) As in title Any other things I should track? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860 Approved by: https://github.com/malfet	2025-06-13 22:01:31 +00:00
Catherine Lee	3596c0c77f	Fix test after revert (#155946 ) ex test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](`06408dae49`) started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d idk what the test does so maybe theres a better way to fix this Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946 Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet	2025-06-13 21:52:07 +00:00
Catherine Lee	eef253d9f6	[CI] Keep going display on HUD: upload log when test fails (#155371 ) I guess this is more of an RFC Goal: Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly. Proposal: A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line. I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough. After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well. I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files Also upload the temp results to a temp attribute instead of the real one To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry `13a990b678/aws/lambda/log-classifier/src/network.rs (L85)` to add a new field like "will_fail": true, and also triggers the log classifier to run Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones Pros: Not many changes outside of HUD/UI Cons: Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371 Approved by: https://github.com/malfet	2025-06-13 21:21:55 +00:00
mori360	e5ed267f83	Update h100-distributed image (#155861 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861 Approved by: https://github.com/seemethere	2025-06-13 21:17:05 +00:00
Joona Havukainen	20a74c370b	Add error message with assert to topK if ndims() - dim > 4 (#155475 ) Addressing #154890 Not really a proper fix but at least it's more informative than the current crash. For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475 Approved by: https://github.com/kulinseth	2025-06-13 21:10:06 +00:00
Runtian (Rachel) Li	049dc48d1e	fix code chunk indentation for `jit_language_reference_v2.md` (#155937 ) Fixes https://github.com/pytorch/pytorch/issues/155023 Related PR: #155781 Description: As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation. Checklist: - [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023) - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label "module: docs" Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937 Approved by: https://github.com/jingsh, https://github.com/svekars	2025-06-13 21:05:23 +00:00
GdoongMathew	731351bb4a	Convert rst to markdown - optim.rst #155031 (#155813 ) Fixes #155031 ![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1) @pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs" Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813 Approved by: https://github.com/AlannaBurke	2025-06-13 21:03:39 +00:00
Benjamin Glass	92388bb2ab	[export] Remove broken check for multiple cpp files in PT2 package (#155149 ) This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149 Approved by: https://github.com/angelayi	2025-06-13 21:02:31 +00:00
windsonsea	7d1b3f599d	[Docs] Convert to markdown cond.rst, config_mod.rst (#155653 ) Related to #155014 Only included 2 files in this PR: - cond.rst - config_mod.rst Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653 Approved by: https://github.com/svekars	2025-06-13 20:58:57 +00:00
henrylhtsang	fdf5d97fa8	[cutlass backend][ez] Log timings from prescreening (#155757 ) Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757 Approved by: https://github.com/ColinPeppler	2025-06-13 20:44:04 +00:00
Justin Silver	f3e6c8e834	Fix #155016 for Docathon - convert rst to markdown (#155198 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though. Fixes #155016 Docs comparison (check out the 'new' whenever docs build) 1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html)) 2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html)) 3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings 4. cudnn_rnn_determinism.rst as is because it's reused in docstrings. 5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198 Approved by: https://github.com/albanD, https://github.com/svekars	2025-06-13 20:24:34 +00:00
Ankita George	bf798a2f01	Change _hfstorage to hfstorage (#155837 ) Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D76364024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837 Approved by: https://github.com/saumishr	2025-06-13 20:19:51 +00:00
zeshengzong	77f884c2ec	Optimize Tensor.backward type hints (#155656 ) Fixes #81963 ## Test Result ![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656 Approved by: https://github.com/soulitzer	2025-06-13 19:16:48 +00:00
PyTorch MergeBot	06408dae49	Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 )" This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec. Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))	2025-06-13 19:11:43 +00:00
Michael Lazos	4628f1b7a9	[Hierarchical-Compile] Track mutations for setitem (#155880 ) This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880 Approved by: https://github.com/anijain2305	2025-06-13 18:59:31 +00:00
Ting Lu	344731fb25	Add CUDA 12.9.1 sbsa nightly binaries (#155819 ) https://github.com/pytorch/pytorch/issues/155196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819 Approved by: https://github.com/atalman	2025-06-13 18:52:41 +00:00
fduwjj	ce44877961	[c10d][PGNCCL] Make watchdog thread a class (#155831 ) By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-06-13 18:05:22 +00:00
Dhia-naouali	c5d00e150a	convert: rst to myst pr 1/2 (#155840 ) Fixes #155038 parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check) this PR converts the following two .rst files - [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst) - [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840 Approved by: https://github.com/sekyondaMeta	2025-06-13 18:02:28 +00:00
Nikita Shulga	36bf81e363	[BE] Fix minifier when one has multiple Python runtimes (#155918 ) By using `sys.executable` instead of `"python"` Otherwise, it fails on Ubuntu with `python not found` error Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519	2025-06-13 17:55:04 +00:00
Runtian (Rachel) Li	093aaccae2	convert `jit_language_reference_v2.rst` to `jit_language_reference_v2.md` (#155781 ) Fixes https://github.com/pytorch/pytorch/issues/155023 Description: converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md` I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged. Checklist: - [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023) - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781 Approved by: https://github.com/svekars	2025-06-13 17:33:10 +00:00
PyTorch UpdateBot	f0bee87eea	[xla hash update] update the pinned xla hash (#155779 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779 Approved by: https://github.com/pytorchbot	2025-06-13 17:13:37 +00:00
Aidyn-A	1f3cc4875c	[ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695 ) This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-06-13 17:12:26 +00:00
Nikita Shulga	b6add8c8ba	[MPSInductor] Fix remainder implementation for int types (#155891 ) Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers. This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types. This fixes compilation of something like ```python @torch.compile def f(x, y): return x[y % 5] ``` which beforehand failed to compile with ``` torch._inductor.exc.InductorError: SyntaxError: failed to compile #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant long* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = 12; auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1)); auto tmp3 = 1024; auto tmp4 = static_cast<long>(tmp3); auto tmp5 = tmp2 + tmp4; auto tmp6 = tmp2 < 0; auto tmp7 = tmp6 ? tmp5 : tmp2; if ((tmp7 < 0) && (tmp7 > 1024)) return; auto tmp9 = in_ptr1[tmp7]; out_ptr0[x0] = static_cast<float>(tmp9); } with program_source:372:28: error: array subscript is not an integer auto tmp9 = in_ptr1[tmp7]; ^~~~~ ``` This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891 Approved by: https://github.com/manuelcandales	2025-06-13 16:42:56 +00:00
Georgia Phillips	9462106b7e	[nativert] Move graph_passes to nativert (#155411 ) Summary: Move graph_passes to nativert Test Plan: CI Rollback Plan: Differential Revision: D76205048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411 Approved by: https://github.com/zhxchen17	2025-06-13 16:41:01 +00:00
Chao Gu	338a8c7853	fix slice w/ dynamic shapes (#153131 ) Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes. Cause overflow error with a huge number “9223372036854776048” Test Plan: CIs should pass. Differential Revision: D74354663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131 Approved by: https://github.com/laithsakka	2025-06-13 15:53:17 +00:00
Chang Pan	a5938ff431	[BE][c10d/Store]add check in pyi (#155855 ) (#155865 ) Summary: "check" is already binded https://fburl.com/code/9lx1zf9o which is also documented in https://docs.pytorch.org/docs/stable/distributed.html add it to pyi for type checking Test Plan: skip Rollback Plan: Differential Revision: D76547457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865 Approved by: https://github.com/fduwjj	2025-06-13 15:39:27 +00:00
Stephen Jia	bee93f9f0d	Move glslc to cas to enable remote execution (#155832 ) Meta: `fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid. See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734 This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE. Here are commands executed: ``` % manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz % manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz % manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar % ls -rw-r--r-- 1 navidq staff 2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz -rw-r--r-- 1 navidq staff 4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz -rw-r--r-- 1 navidq staff 4.4M Jun 12 10:03 glslc-windows-v2024_3.tar % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937 % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz 1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649 % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar 76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984 ``` Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832 Approved by: https://github.com/GregoryComer	2025-06-13 14:38:51 +00:00
PyTorch MergeBot	ce6e0523f9	Revert "[BE] Raise `NotImplementedError` (#155470 )" This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82. Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))	2025-06-13 14:32:50 +00:00
James Wu	3819584f12	[precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415 ) This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling, dump it all into bytes, and then stitch it back together into a cache of callables. ## Why use CacheArtifactManager? Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly. ## How is it different from CacheArtifactManager? Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable. Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so: ![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca) Whereas we'd visualize PrecompileContext's result like so: ![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b) For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext. After this PR, precompile consists of three main interfaces: ### CompilePackage - Everything needed to run one torch.compile'd function (including graph breaks) - `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry - `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends - `cache_entry()` return a serializable cache entry to save ### DynamoStore - Responsible for tracking CompilePackages on disk (and/or in memory) - `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact - `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data - `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization) ### PrecompileContext - Overarching context for serializing and deserializing precompile artifacts. Supports global and local setups. - `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes - `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO) - `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key <img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415 Approved by: https://github.com/zhxchen17 ghstack dependencies: #155118	2025-06-13 14:11:24 +00:00
James Wu	b2fc9cfea1	[precompile] Add CompilePackage to serialize dynamo states. (#155118 ) Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface: 1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states. a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object. b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state. 2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry. 3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions. This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend. Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118 Approved by: https://github.com/jamesjwu Co-authored-by: James Wu <jjwu@meta.com>	2025-06-13 13:54:10 +00:00
xinan.lin	670dab6c63	[AOTI] Enable OP `test__weight_int4pack_mm_with_scales_and_zeros` in AOTI. (#155780 ) The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780 Approved by: https://github.com/jansel	2025-06-13 11:12:13 +00:00
Avik Chaudhuri	463fe36532	fix error message on specialization with Dim.DYNAMIC (#155738 ) Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`. This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation. Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources. Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738 Approved by: https://github.com/bobrenjc93	2025-06-13 10:33:46 +00:00
anwang	6abe450a6f	[pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631 ) I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314 This diff/PR deletes it as this duplicate caused build failure for me: ``` ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type' ``` Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon ghstack dependencies: #154589, #154591	2025-06-13 10:01:51 +00:00
anwang	1cc31b213d	[MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff - Migrate where.self and where.self_out - Add tests for dtype casting and shape broadcasting Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591 Approved by: https://github.com/malfet ghstack dependencies: #154589	2025-06-13 10:01:51 +00:00
fengqing.lu@intel.com	65b9c13cce	[Intel GPU] Enable safe softmax for XPU SDPA (#151999 ) Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975 When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption. With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-13 08:53:47 +00:00
anwang	56b03df6ac	[MTIA Aten Backend] Migrate where.self and where.self_out (#154589 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff - Migrate where.self and where.self_out - Add tests for dtype casting and shape broadcasting Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589 Approved by: https://github.com/malfet	2025-06-13 08:25:13 +00:00
ZhaoqiongZ	3d595fd559	update get start xpu (#151886 ) update link and product name add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-13 07:46:13 +00:00
Isuru Fernando	53d06e18d9	[dynamo] add missing algorithm header (#154754 ) Needed for `std::max(<initializer-list>)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2025-06-13 06:56:11 +00:00
Bob Ren	6020440683	remove allow-untyped-defs from adaround_fake_quantize.py (#155621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621 Approved by: https://github.com/Skylion007	2025-06-13 06:14:22 +00:00
Ke Wen	99e99d5bfe	[a2av] Test must allocate tensors symmetrically (#155835 ) This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks. In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835 Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007 ghstack dependencies: #155506	2025-06-13 06:05:38 +00:00
angelayi	0860606729	[export] Add meta[val] to getattr nodes (#154934 ) Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934 Approved by: https://github.com/yushangdi, https://github.com/muchulee8	2025-06-13 05:48:21 +00:00
Nikita Shulga	25717da8c8	[BE] Don't run the same tests on 2xlarge and 4xlarge (#155859 ) Also, speedup builds by moving them to 4xlarge instances Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859 Approved by: https://github.com/ZainRizvi, https://github.com/wdvr	2025-06-13 05:40:20 +00:00
fduwjj	a87dfc7480	[symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823 ) We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList. Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823 Approved by: https://github.com/Skylion007	2025-06-13 05:27:59 +00:00
Justin Silver	70bb34929a	Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155020. This PR is split into 3 to pass sanity check. Docs comparison (check out the 'new' whenever docs build) 1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html)) 2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html)) 3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567 Approved by: https://github.com/svekars	2025-06-13 05:19:43 +00:00
Henry Tsang	b878ca0c91	[cutlass backend] add fp8 to cutlass benchmark script (#155507 ) Summary: Add fp8. Right now FP8 only allows fast_accum. Test Plan: ``` Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn +-----------------------+--------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| teraflops (TFLOPS) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+--------------------+----------------------+--------------------+ \| aten \| 967.1226739883423 \| 1136.8895149998868 \| 1.219131228979677 \| NA \| \| triton \| 1764.6185159683228 \| 623.08743664783 \| 20.373826419003308 \| 82.46067054670186 \| \| triton_persistent_tma \| 1769.0335512161255 \| 621.5323768280928 \| 20.48663099599071 \| 82.91718297956578 \| \| cutlass_lvl_default \| 790.5075550079346 \| 1390.8932568835019 \| 13.788519630907103 \| -18.26191482535096 \| \| cutlass_lvl_3332 \| 803.7384748458862 \| 1367.996757884245 \| 226.81587297911756 \| -16.89384434227684 \| +-----------------------+--------------------+--------------------+----------------------+--------------------+ ``` Rollback Plan: Differential Revision: D76310809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507 Approved by: https://github.com/ColinPeppler	2025-06-13 05:11:15 +00:00
GdoongMathew	2ba930d4ce	Convert rst to markdown - profiler.rst #155031 (#155559 ) Fixes https://github.com/pytorch/pytorch/issues/155031 * [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559 Approved by: https://github.com/svekars	2025-06-13 05:02:54 +00:00
Runtian (Rachel) Li	e8b3dfa7c0	convert jit_language_reference.rst to jit_language_reference.md (#155633 ) Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429) - converted jit_language_reference.rst to jit_language_reference.md @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 04:58:28 +00:00
Runtian (Rachel) Li	3f65e38b73	Convert hub.rst to hub.md (#155483 ) Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429) @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483 Approved by: https://github.com/svekars	2025-06-13 04:39:55 +00:00
Will Constable	0a6b66c881	Inductor comms reorder logs to tlparse (#155737 ) Hacked test_inductor_collectives test to demonstrate this works: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Follow up: it would be nice to move the logging out of this pass and into the broader comms pass loop, where the before/after each pass visualization could be logged into the same tlparse file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737 Approved by: https://github.com/bdhirsh	2025-06-13 02:59:42 +00:00
Bin Bao	f151b20123	[AOTI] Remove the emit_current_arch_binary option (#155768 ) Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768 Approved by: https://github.com/angelayi	2025-06-13 02:06:07 +00:00
FFFrog	020da74437	[Easy] Remove empty file (#155796 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796 Approved by: https://github.com/malfet ghstack dependencies: #155772	2025-06-13 01:42:11 +00:00
zeshengzong	905b194a2e	Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318 ) Fixes #136849 ## Test Result ```python >>> import torch >>> device = torch.cuda.device_count() + 1 >>> torch.cuda.current_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream streamdata = torch._C._cuda_getCurrentStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.default_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream streamdata = torch._C._cuda_getDefaultStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.set_per_process_memory_fraction(0.5, device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction torch._C._cuda_setMemoryFraction(fraction, device) RuntimeError: Allocator not initialized for device : did you call init? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318 Approved by: https://github.com/albanD	2025-06-13 01:20:19 +00:00
Laith Sakka	d7e657da35	pyfmt lint more torch/utils files (#155812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812 Approved by: https://github.com/Skylion007 ghstack dependencies: #155782, #155783	2025-06-12 23:51:42 +00:00
angelayi	4d3ecefda5	[aoti][mps] Use cpp sym-expr printer (#155646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646 Approved by: https://github.com/desertfire ghstack dependencies: #155752, #154287, #155582, #155583	2025-06-12 23:33:28 +00:00
angelayi	2e65d72e1e	[aoti][mps] Fix int/symint kernel args (#155583 ) Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583 Approved by: https://github.com/desertfire ghstack dependencies: #155752, #154287, #155582	2025-06-12 23:33:28 +00:00
angelayi	ffbda61fbe	[aoti][mps] Fix dynamic dispatch size (#155582 ) In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582 Approved by: https://github.com/malfet ghstack dependencies: #155752, #154287	2025-06-12 23:33:15 +00:00
angelayi	a4ab392251	[aoti][mps] mps constants support (#154287 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287 Approved by: https://github.com/malfet ghstack dependencies: #155752	2025-06-12 23:33:07 +00:00
angelayi	8821a9dc4e	[BE][aoti][mps] Fix tests to use common function (#155752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752 Approved by: https://github.com/desertfire, https://github.com/malfet	2025-06-12 23:32:59 +00:00
Nikita Shulga	5ab6a3fb6f	[BE] Raise `NotImplementedError` (#155470 ) When op is unimplemented for a specific dtype Which makes more sense, than a RuntimeError Example ```python >>> import torch >>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,))) NotImplementedError: "hardshrink_cpu" not implemented for 'Long' ``` release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-06-12 23:19:12 +00:00
Natalia Gimelshein	d9b8369f39	fix warning spam for list indexing (#155815 ) Per title, #154806 incorrectly placed a warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-06-12 23:07:24 +00:00
Laith Sakka	2903e5ad3c	pyfmt lint more export files (#155783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783 Approved by: https://github.com/Skylion007 ghstack dependencies: #155782	2025-06-12 23:04:11 +00:00
Laith Sakka	86b1116f22	pyfmt lint torch/_custom_op/* (#155782 ) file torch/_custom_op/functional.py does not exisits file torch/_custom_op/__init__.py is empty. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782 Approved by: https://github.com/Skylion007	2025-06-12 23:04:11 +00:00
Jithun Nair	4cdbdcdbcf	Switch to miniconda for ROCm CI (#155239 ) Related to https://github.com/pytorch/pytorch/issues/148335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239 Approved by: https://github.com/jeffdaily	2025-06-12 22:55:47 +00:00
Randolf Scholz	f04fd4dc4e	typing: allow integer in bitwise operations (#155704 ) Fixes #155701 (false positives) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704 Approved by: https://github.com/Skylion007, https://github.com/aorenste	2025-06-12 22:40:17 +00:00
angelayi	938515fa75	[aoti] Update cshim for all backends (#155604 ) Fixes https://github.com/pytorch/pytorch/issues/155349 `python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604 Approved by: https://github.com/desertfire, https://github.com/janeyx99	2025-06-12 22:10:58 +00:00
Mikayla Gawarecki	38bfd462b8	Use swap_tensors path in nn.Module.to for FakeTensor (#152539 ) Fixes https://github.com/pytorch/pytorch/issues/148977 Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539 Approved by: https://github.com/albanD	2025-06-12 22:08:21 +00:00
Frost Mitchell	db01f1032f	Support XPU in memory tracker (#150703 ) This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU. Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k	2025-06-12 21:33:52 +00:00
Brian Hirsh	154a39bfbd	basic compile support for grouped_mm (#153384 ) grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384 Approved by: https://github.com/eellison	2025-06-12 21:24:51 +00:00
Prachi Gupta	f2b44424a1	[ROCm] Skip _stress_cuda and test_ddp_apply_optim_in_backward (#155724 ) These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-06-12 21:18:04 +00:00
lzhang2	590fe4d2d7	Skip updating the default device distributed backend if already registered (#155320 ) Motivation: PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`). Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However, if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want. Therefore, we would like to skip updating the default distributed backend if one is already registered in the map. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-06-12 21:17:06 +00:00
Catherine Lee	29391c7cf9	[ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811 ) `9df2e8020f/1` `8e8d4b13b0 (43980565863-box)` started OOMing after switching to cuda 12.8 Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory ``` 2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy	2025-06-12 21:05:32 +00:00

2851 changed files with 49149 additions and 22229 deletions

									
										6

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,10 +3,8 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"

				if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

									
										6

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -88,7 +88,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnvToolsExt.so.1",

				        "/usr/local/cuda/lib64/libnccl.so.2",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				@ -108,9 +108,9 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/local/lib/libnvpl_blas_core.so.0",

				    ]

				    if "128" in desired_cuda:

				    if "129" in desired_cuda:

				        libs_to_copy += [

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				        ]

									
										5

.ci/docker/build.sh
									
												View File
												
				@ -333,6 +333,8 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -342,6 +344,8 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -428,6 +432,7 @@ docker build \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "OPENBLAS=${OPENBLAS:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

				       --build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

									
										1

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -39,6 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 f50bfa92602b45dca884a9e511e5d9ddbe8ba314
 aa978594cc155fa8af48cd949f5b5f1823a

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.26.5-1
 v2.27.3-1

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b0e26b7359c147b8aa0af686c20510fb9b15990a
 ae324eeac8e102a2b40370e341460f3791353398

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 c8757738a7418249896224430ce84888e8ecdd79
 ec6354315768a85da41032535e3b7b99c5f706

									
										14

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -4,12 +4,8 @@ set -ex

				# Optionally install conda

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				  CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				@ -21,7 +17,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				      exit 1

				      ;;

				  esac

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -64,6 +59,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge

				  # Miniforge installer doesn't install sqlite by default

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    conda_install sqlite

				  fi

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    conda_install "openblas==0.3.29=*openmp*"

									
										2

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -20,7 +20,7 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnxscript==0.3.0

				pip_install onnxscript==0.3.1

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										6

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,8 +4,9 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				OPENBLAS_BUILD_FLAGS="

				NUM_THREADS=128

				USE_OPENMP=1

				@ -13,9 +14,8 @@ NO_SHARED=0

				DYNAMIC_ARCH=1

				TARGET=ARMV8

				CFLAGS=-O3

				BUILD_BFLOAT16=1

				"

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

									
										19

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -26,6 +26,11 @@ Pin: release o=repo.radeon.com

				Pin-Priority: 600

				EOF

				    # we want the patch version of 6.4 instead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        ROCM_VERSION="${ROCM_VERSION}.1"

				    fi

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				@ -67,19 +72,23 @@ EOF

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				            VER_PATCH=.1

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				        fi

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

2

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -26,7 +26,7 @@ ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

4

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -2,7 +2,7 @@ FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 ARG GCCTOOLSET_VERSION=13
 # Language variabes
 # Language variables
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 @ -64,7 +64,7 @@ RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM base as final
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -60,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM openssl as final
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

4

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -120,11 +120,13 @@ RUN python3 -mpip install cmake==3.28.0
 # so just build it from upstream repository.
 # h5py is dependency of onnxruntime_training.
 # h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
 # h5py 3.11.0 doesn't build with numpy >= 2.3.0.
 # install newest flatbuffers version first:
 # for some reason old version is getting pulled in otherwise.
 # packaging package is required for onnxruntime wheel build.
 RUN pip3 install flatbuffers && \
   pip3 install h5py==3.11.0 && \
   pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
   pip3 install --no-build-isolation h5py==3.11.0 && \
   pip3 install packaging && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \

									
										2

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -41,7 +41,7 @@ case ${image} in

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        OPENBLAS_VERSION="v0.3.29"

				        OPENBLAS_VERSION="v0.3.30"

				        ;;

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

10

.ci/docker/requirements-ci.txt

View File

 @ -90,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.15.0
 mypy==1.16.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.14.0
 #Pinned versions: 1.16.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -339,7 +339,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.2.6
 onnxscript==0.3.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -382,3 +382,7 @@ cmake==4.0.0
 tlparse==0.3.30
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0
 #Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
 #test that import: test_cuda.py

10

.ci/docker/requirements-docs.txt

View File

 @ -1,11 +1,11 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably
 # something related to Docker setup. We can investigate this later.
 sphinxcontrib.katex==0.8.6
 #Description: This is used to generate PyTorch docs
 @ -49,8 +49,8 @@ IPython==8.12.0
 #Pinned versions: 8.12.0
 myst-nb==0.17.2
 #Description: This is used to generate PyTorch functorch docs
 #Pinned versions: 0.13.2
 #Description: This is used to generate PyTorch functorch and torch.compile docs.
 #Pinned versions: 0.17.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .3.1
 .4.0

2

.ci/docker/triton_xpu_version.txt

View File

 @ -1 +1 @@
 .3.1
 .4.0

									
										1

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -25,6 +25,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

									
										6

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -147,6 +147,12 @@ RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi

				RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				ARG OPENBLAS

				COPY ./common/install_openblas.sh install_openblas.sh

				RUN if [ -n "${OPENBLAS}" ]; then bash ./install_openblas.sh; fi

				RUN rm install_openblas.sh

				ENV INSTALLED_OPENBLAS ${OPENBLAS}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				ARG SKIP_SCCACHE_INSTALL

				COPY ./common/install_cache.sh install_cache.sh

									
										4

.ci/manywheel/build_common.sh
									
												View File
												
				@ -31,7 +31,6 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				else

				@ -98,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				@ -151,7 +151,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake

				    CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

									
										25

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -51,16 +51,23 @@ else

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.8|12.9)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				    #removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

				    #however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517

				    12.8)

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"

				        ;;

				    12.9)

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"

				        # WAR to resolve the ld error in libtorch build with CUDA 12.9

				        if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then

				            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"

				        fi

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				@ -103,12 +110,11 @@ DEPS_SONAME=(

				)

				# CUDA_VERSION 12.6, 12.8

				# CUDA_VERSION 12.6, 12.8, 12.9

				if [[ $CUDA_VERSION == 12* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				@ -124,7 +130,6 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				@ -143,7 +148,6 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libcublasLt.so.12"

				            "libcusparseLt.so.0"

				            "libcudart.so.12"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libcufile.so.0"

				@ -164,7 +168,6 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvshmem/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				            '$ORIGIN/../../nvidia/cufile/lib'

				        )

									
										1

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -92,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

									
										2

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -187,7 +187,7 @@ do

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

									
										1

.ci/pytorch/build.sh
									
												View File
												
				@ -257,6 +257,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e -o pipefail

				  get_bazel

				  python3 tools/optional_submodules.py checkout_eigen

				  # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing

				  # the runner

									
										2

.ci/pytorch/common.sh
									
												View File
												
				@ -15,6 +15,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

				  export PYTORCH_TEST_WITH_ROCM=1

				fi

				# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

				# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

				# shellcheck disable=SC2034

				BUILD_TEST_LIBTORCH=0

									
										2

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -197,7 +197,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    git clone --recursive -b r2.8 https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

									
										5

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -5,11 +5,6 @@ set -x

				# shellcheck source=./macos-common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"

				if [[ -n "$CONDA_ENV" ]]; then

				  # Use binaries under conda environment

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

									
										2

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -93,7 +93,7 @@ def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				            f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				        )

				    if num_cxx11_symbols < 100:

				        raise RuntimeError("Didn't find enought cxx11 symbols")

				        raise RuntimeError("Didn't find enough cxx11 symbols")

				def main() -> None:

									
										2

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -276,7 +276,7 @@ def smoke_test_cuda(

				            torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())

				            print(f"Torch nccl; version: {torch_nccl_version}")

				        # Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.

				        # Pypi dependencies are installed on linux only and nccl is available only on Linux.

				        if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:

				            compare_pypi_to_torch_versions(

				                "cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

									
										32

.ci/pytorch/test.sh
									
												View File
												
				@ -196,7 +196,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Check XPU status before testing

				  xpu-smi discovery

				  timeout 30 xpu-smi discovery || true

				fi

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				@ -224,7 +224,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export PYTORCH_TEST_WITH_ASAN=1

				    export PYTORCH_TEST_WITH_UBSAN=1

				    # TODO: Figure out how to avoid hard-coding these paths

				    export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer

				    export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-18/bin/llvm-symbolizer

				    export TORCH_USE_RTLD_GLOBAL=1

				    # NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our

				    # default behavior.

				@ -325,6 +325,11 @@ test_python_smoke() {

				test_h100_distributed() {

				  # Distributed tests at H100

				  time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  # This test requires multicast support

				  time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  # symmetric memory test

				  time python test/run_test.py --include distributed/test_symmetric_memory.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -356,6 +361,17 @@ test_dynamo_wrapped_shard() {

				  assert_git_not_dirty

				}

				test_einops() {

				  pip install einops==0.6.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.7.0

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.8.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_inductor_distributed() {

				  # Smuggle a few multi-gpu tests here so that we don't have to request another large node

				  echo "Testing multi_gpu tests in test_torchinductor"

				@ -594,7 +610,9 @@ test_perf_for_dashboard() {

				  local device=cuda

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				    if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then

				      device=zen_cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then

				      device=cpu_aarch64

				@ -1135,6 +1153,12 @@ test_custom_backend() {

				test_custom_script_ops() {

				  echo "Testing custom script operators"

				  if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				    echo "Skipping custom script operators until it's fixed"

				    return 0

				  fi

				  CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				  pushd test/custom_operator

				  cp -a "$CUSTOM_OP_BUILD" build

				@ -1682,6 +1706,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				  install_torchvision

				  test_dynamo_wrapped_shard "${SHARD_NUMBER}"

									
										2

.ci/pytorch/test_example_code/CMakeLists.txt
									
												View File
												
				@ -16,7 +16,7 @@ target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse

				find_library(CUDNN_LIBRARY NAMES cudnn)

				target_link_libraries(simple-torch-test  ${CUDNN_LIBRARY} )

				if(MSVC)

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll")

				  message("dlls to copy "  ${TORCH_DLLS})

				  add_custom_command(TARGET simple-torch-test

				                     POST_BUILD

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -31,7 +31,7 @@ PYLONG_API_CHECK=$?

				if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

				  echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."

				  echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."

				  echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the corresponding APIs instead."

				  echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"

				  echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"

				  echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

									
										2

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -10,7 +10,7 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol

				:: able to see what our cl.exe commands are (since you can actually

				:: just copy-paste them into a local Windows setup to just rebuild a

				:: single file.)

				:: log sizes are too long, but leaving this here incase someone wants to use it locally

				:: log sizes are too long, but leaving this here in case someone wants to use it locally

				:: set CMAKE_VERBOSE_MAKEFILE=1

									
										2

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -52,7 +52,7 @@ if __name__ == "__main__":

				            if os.path.exists(debugger):

				                command_args = [debugger, "-o", "-c", "~*g; q"] + command_args

				                command_string = " ".join(command_args)

				                print("Reruning with traceback enabled")

				                print("Rerunning with traceback enabled")

				                print("Command:", command_string)

				                subprocess.run(command_args, check=False)

				            sys.exit(e.returncode)

									
										9

.ci/pytorch/windows/cuda126.bat
									
												View File
												
				@ -18,15 +18,6 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V126%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" (

				        set "CUDA_PATH_V126=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6"

									
										9

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -18,15 +18,6 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V128%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

									
										15

.ci/pytorch/windows/cuda129.bat
									
												View File
												
				@ -18,18 +18,9 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V129%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"

				        set "CUDA_PATH_V129=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"

				    ) ELSE (

				        echo CUDA 12.9 not found, failing

				        exit /b 1

				@ -37,10 +28,10 @@ IF "%CUDA_PATH_V129%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V129%"

									
										2

.ci/pytorch/windows/internal/check_deps.bat
									
												View File
												
				@ -65,7 +65,7 @@ for /F "usebackq delims=" %%i in (`python -c "import sys; print('{0[0]}{0[1]}'.f

				if  %PYVER% LSS 35 (

				    echo Warning: PyTorch for Python 2 under Windows is experimental.

				    echo Python x64 3.5 or up is recommended to compile PyTorch on Windows

				    echo Maybe you can create a virual environment if you have conda installed:

				    echo Maybe you can create a virtual environment if you have conda installed:

				    echo ^> conda create -n test python=3.6 pyyaml numpy

				    echo ^> activate test

				)

									
										1

.ci/pytorch/windows/internal/copy.bat
									
												View File
												
				@ -9,7 +9,6 @@ copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib

				copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib

				copy "%PYTHON_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib

				:: Should be set in build_pytorch.bat

									
										15

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -119,11 +119,6 @@ goto cuda_common

				:: If you cannot find the CUDA version you want to build for here then please

				:: add it @ https://github.com/pytorch/test-infra/tree/main/aws/ami/windows

				if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (

				    if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (

				        curl -k -L https://ossci-windows.s3.us-east-1.amazonaws.com/builder/NvToolsExt.7z --output "%SRC_DIR%\temp_build\NvToolsExt.7z"

				        if errorlevel 1 exit /b 1

				    )

				    if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.zip" (

				        curl -k -L "https://ossci-windows.s3.us-east-1.amazonaws.com/builder/additional_dlls.zip" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"

				        if errorlevel 1 exit /b 1

				@ -150,15 +145,6 @@ if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_

				        xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations"

				    )

				    echo Installing NvToolsExt...

				    7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    echo Installing cuDNN...

				    7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"

				    xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"

				@ -189,4 +175,3 @@ echo Setting up environment...

				set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"

				set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt"

									
										2

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -206,7 +206,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    BUILD_PYTHON_ONLY=1 BUILD_LIBTORCH_WHL=0 python setup.py bdist_wheel -d "$whl_tmp_dir" --cmake

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    python setup.py bdist_wheel -d "$whl_tmp_dir"

									
										4

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == cu128 ]]; then

				# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

									
										157

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
											
				@ -1,157 +0,0 @@

				# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0

				import json

				import os

				import re

				import sys

				import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

				PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"

				TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "main")

				TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")

				build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"

				s = requests.Session()

				s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})

				def submit_build(pipeline_id, project_id, source_branch, source_version):

				    print("Submitting build for branch: " + source_branch)

				    print("Commit SHA1: ", source_version)

				    run_build_raw = s.post(

				        build_base_url,

				        json={

				            "definition": {"id": pipeline_id},

				            "project": {"id": project_id},

				            "sourceBranch": source_branch,

				            "sourceVersion": source_version,

				        },

				    )

				    try:

				        run_build_json = run_build_raw.json()

				    except json.decoder.JSONDecodeError as e:

				        print(e)

				        print(

				            "Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."

				        )

				        sys.exit(-1)

				    build_id = run_build_json["id"]

				    print("Submitted bulid: " + str(build_id))

				    print("Bulid URL: " + run_build_json["url"])

				    return build_id

				def get_build(_id):

				    get_build_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"

				    )

				    get_build_raw = s.get(get_build_url)

				    return get_build_raw.json()

				def get_build_logs(_id):

				    get_build_logs_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"

				    )

				    get_build_logs_raw = s.get(get_build_logs_url)

				    return get_build_logs_raw.json()

				def get_log_content(url):

				    resp = s.get(url)

				    return resp.text

				def wait_for_build(_id):

				    build_detail = get_build(_id)

				    build_status = build_detail["status"]

				    while build_status == "notStarted":

				        print("Waiting for run to start: " + str(_id))

				        sys.stdout.flush()

				        try:

				            build_detail = get_build(_id)

				            build_status = build_detail["status"]

				        except Exception as e:

				            print("Error getting build")

				            print(e)

				        time.sleep(30)

				    print("Bulid started: ", str(_id))

				    handled_logs = set()

				    while build_status == "inProgress":

				        try:

				            print("Waiting for log: " + str(_id))

				            logs = get_build_logs(_id)

				        except Exception as e:

				            print("Error fetching logs")

				            print(e)

				            time.sleep(30)

				            continue

				        for log in logs["value"]:

				            log_id = log["id"]

				            if log_id in handled_logs:

				                continue

				            handled_logs.add(log_id)

				            print("Fetching log: \n" + log["url"])

				            try:

				                log_content = get_log_content(log["url"])

				                print(log_content)

				            except Exception as e:

				                print("Error getting log content")

				                print(e)

				            sys.stdout.flush()

				        build_detail = get_build(_id)

				        build_status = build_detail["status"]

				        time.sleep(30)

				    build_result = build_detail["result"]

				    print("Bulid status: " + build_status)

				    print("Bulid result: " + build_result)

				    return build_status, build_result

				if __name__ == "__main__":

				    # Convert the branch name for Azure DevOps

				    match = re.search(r"pull/(\d+)", TARGET_BRANCH)

				    if match is not None:

				        pr_num = match.group(1)

				        SOURCE_BRANCH = f"refs/pull/{pr_num}/head"

				    else:

				        SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"

				    MAX_RETRY = 2

				    retry = MAX_RETRY

				    while retry > 0:

				        build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)

				        build_status, build_result = wait_for_build(build_id)

				        if build_result != "succeeded":

				            retry = retry - 1

				            if retry > 0:

				                print("Retrying... remaining attempt: " + str(retry))

				                # Wait a bit before retrying

				                time.sleep((MAX_RETRY - retry) * 120)

				                continue

				            else:

				                print("No more chance to retry. Giving up.")

				                sys.exit(-1)

				        else:

				            break

									
										2

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -14,6 +14,7 @@ self-hosted-runner:

				    - linux.12xlarge

				    - linux.24xlarge

				    - linux.24xlarge.ephemeral

				    - linux.24xlarge.amd

				    - linux.arm64.2xlarge

				    - linux.arm64.2xlarge.ephemeral

				    - linux.arm64.m7g.4xlarge

				@ -49,6 +50,7 @@ self-hosted-runner:

				    # Organization-wide AMD-hosted runners

				    # MI2xx runners

				    - linux.rocm.gpu

				    - linux.rocm.gpu.mi250

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # MI300 runners

									
										2

.github/actions/build-android/action.yml
									
										vendored
									
												View File
												
				@ -9,7 +9,7 @@ inputs:

				  arch-for-build-env:

				    description: |

				      arch to pass to build environment.

				      This is currently different than the arch name we use elswhere, which

				      This is currently different than the arch name we use elsewhere, which

				      should be fixed.

				    required: true

				  github-secret:

									
										2

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -157,4 +157,4 @@ runs:

				        echo "Is keep-going label set? ${{ steps.filter.outputs.keep-going }}"

				        echo

				        echo "Renabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

				        echo "Reenabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

									
										2

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -153,7 +153,7 @@ runs:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    - name: Check for keep-going label and re-enabled test issues

				      # This uses the filter-test-configs action because it conviniently

				      # This uses the filter-test-configs action because it conveniently

				      # checks for labels and re-enabled test issues.  It does not actually do

				      # any filtering.  All filtering is done in the build step.

				      id: keep-going

									
										9

.github/actions/reuse-old-whl/action.yml
									
										vendored
									
												View File
												
				@ -13,6 +13,12 @@ inputs:

				  github-token:

				    description: GitHub token

				    required: true

				  job-id:

				    description: Job ID

				    required: true

				  job-name:

				    description: Job name

				    required: true

				outputs:

				  reuse:

				@ -30,8 +36,11 @@ runs:

				      continue-on-error: true

				      env:

				        GITHUB_TOKEN: ${{ inputs.github-token }}

				        JOB_ID: ${{ inputs.job-id }}

				        JOB_NAME: ${{ inputs.job-name }}

				      run: |

				        set -x

				        python3 -m pip install boto3==1.35.42

				        python3 ${GITHUB_ACTION_PATH}/reuse_old_whl.py \

				          --build-environment "${{ inputs.build-environment }}" \

				          --run-id "${{ inputs.run-id }}" \

									
										57

.github/actions/reuse-old-whl/reuse_old_whl.py
									
										vendored
									
												View File
												
				@ -1,6 +1,7 @@

				import argparse

				import os

				import subprocess

				import sys

				from functools import lru_cache

				from pathlib import Path

				from typing import Any, cast, Optional, Union

				@ -8,6 +9,14 @@ from typing import Any, cast, Optional, Union

				import requests

				REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent

				sys.path.insert(0, str(REPO_ROOT))

				from tools.stats.upload_metrics import emit_metric

				sys.path.remove(str(REPO_ROOT))  # Clean up sys.path after import

				FORCE_REBUILD_LABEL = "ci-force-rebuild"

				@ -123,17 +132,26 @@ def check_changed_files(sha: str) -> bool:

				    # Return true if all the changed files are in the list of allowed files to

				    # be changed to reuse the old whl

				    # Removing any files is not allowed since rysnc will not remove files

				    # Removing files in the torch folder is not allowed since rsync will not

				    # remove files

				    removed_files = (

				        subprocess.check_output(

				            ["git", "diff", "--name-only", sha, "HEAD", "--diff-filter=D"],

				            [

				                "git",

				                "diff",

				                "--name-only",

				                sha,

				                "HEAD",

				                "--diff-filter=D",

				                "--no-renames",

				            ],

				            text=True,

				            stderr=subprocess.DEVNULL,

				        )

				        .strip()

				        .split()

				    )

				    if removed_files:

				    if any(file.startswith("torch/") for file in removed_files):

				        print(

				            f"Removed files between {sha} and HEAD: {removed_files}, cannot reuse old whl"

				        )

				@ -141,7 +159,7 @@ def check_changed_files(sha: str) -> bool:

				    changed_files = (

				        subprocess.check_output(

				            ["git", "diff", "--name-only", sha, "HEAD"],

				            ["git", "diff", "--name-only", sha, "HEAD", "--no-renames"],

				            text=True,

				            stderr=subprocess.DEVNULL,

				        )

				@ -308,7 +326,7 @@ def parse_args() -> argparse.Namespace:

				    return parser.parse_args()

				def can_reuse_whl(args: argparse.Namespace) -> bool:

				def can_reuse_whl(args: argparse.Namespace) -> tuple[bool, str]:

				    if args.github_ref and any(

				        args.github_ref.startswith(x)

				        for x in [

				@ -318,37 +336,50 @@ def can_reuse_whl(args: argparse.Namespace) -> bool:

				        ]

				    ):

				        print("Release branch, rebuild whl")

				        return False

				        return (False, "Release branch")

				    if not check_changed_files(get_merge_base()):

				        print("Cannot use old whl due to the changed files, rebuild whl")

				        return False

				        return (False, "Changed files not allowed")

				    if check_labels_for_pr():

				        print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")

				        return False

				        return (False, "Found FORCE_REBUILD_LABEL on PR")

				    if check_issue_open():

				        print("Issue #153759 is open, rebuild whl")

				        return False

				        return (False, "Issue #153759 is open")

				    workflow_id = get_workflow_id(args.run_id)

				    if workflow_id is None:

				        print("No workflow ID found, rebuild whl")

				        return False

				        return (False, "No workflow ID found")

				    if not find_old_whl(workflow_id, args.build_environment, get_merge_base()):

				        print("No old whl found, rebuild whl")

				        return (False, "No old whl found")

				        # TODO: go backwards from merge base to find more runs

				        return False

				    return True

				    return (True, "Found old whl")

				if __name__ == "__main__":

				    args = parse_args()

				    if can_reuse_whl(args):

				    reuse_whl, reason = can_reuse_whl(args)

				    if reuse_whl:

				        print("Reusing old whl")

				        unzip_artifact_and_replace_files()

				        set_output()

				    emit_metric(

				        "reuse_old_whl",

				        {

				            "reuse_whl": reuse_whl,

				            "reason": reason,

				            "build_environment": args.build_environment,

				            "merge_base": get_merge_base(),

				            "head_sha": get_head_sha(),

				        },

				    )

									
										4

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -33,14 +33,14 @@ runs:

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Start docker if docker deamon is not running

				    - name: Start docker if docker daemon is not running

				      shell: bash

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      run: |

				        if systemctl is-active --quiet docker; then

				            echo "Docker daemon is running...";

				        else

				            echo "Starting docker deamon..." && sudo systemctl start docker;

				            echo "Starting docker daemon..." && sudo systemctl start docker;

				        fi

				    - name: Log in to ECR

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 cb7f57d31b0b288696f09b89e890e5fac092eed
 e94321c54617dd738a05bfedfc28bc0fa635b5c

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 a517a95f620dc0960d1feec97b50ac6ea7f1854
 r2.8

									
										1

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -116,7 +116,6 @@

				"release notes: inductor (aoti)":

				- torch/_C/_aoti.pyi

				- torch/_dynamo/repro/aoti.py

				- torch/_export/serde/aoti_schema.py

				- torch/_higher_order_ops/aoti_call_delegate.py

				- torch/_inductor/codegen/aoti_runtime/**

				- torch/_inductor/codegen/aoti_hipify_utils.py

									
										1

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -11,6 +11,7 @@ ciflow_push_tags:

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				- ciflow/inductor-perf-test-nightly-x86-zen

				- ciflow/inductor-cu126

				- ciflow/linux-aarch64

				- ciflow/mps

2

.github/requirements-gha-cache.txt vendored

View File

 @ -10,5 +10,5 @@ lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
 pyyaml==6.0
 requests==2.32.2
 requests==2.32.4
 rich==10.9.0

2

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,5 +1,5 @@
 boto3==1.35.42
 cmake==3.25.*
 cmake==3.27.*
 expecttest==0.3.0
 fbscribelogger==0.1.7
 filelock==3.6.0

									
										2

.github/scripts/amd/patch_triton_wheel.sh
									
										vendored
									
												View File
												
				@ -78,7 +78,7 @@ for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do

				        echo "Copied $filepath to $patchedpath"

				    done

				    # Go through all required shared objects and see if any of our other objects are dependants.  If so, replace so.ver wth so

				    # Go through all required shared objects and see if any of our other objects are dependants.  If so, replace so.ver with so

				    for ((i=0;i<${#deps[@]};++i)); do

				        echo "replacing "${deps_soname[i]} ${patched[i]}

				        replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}

									
										17

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -21,8 +21,11 @@ def read_triton_pin(device: str = "cuda") -> str:

				        return f.read().strip()

				def read_triton_version() -> str:

				    with open(REPO_DIR / ".ci" / "docker" / "triton_version.txt") as f:

				def read_triton_version(device: str = "cuda") -> str:

				    triton_version_file = "triton_version.txt"

				    if device == "xpu":

				        triton_version_file = "triton_xpu_version.txt"

				    with open(REPO_DIR / ".ci" / "docker" / triton_version_file) as f:

				        return f.read().strip()

				@ -91,7 +94,7 @@ def build_triton(

				        patch_init_py(

				            triton_pythondir / "triton" / "__init__.py",

				            version=f"{version}",

				            expected_version=None,

				            expected_version=read_triton_version(device),

				        )

				        if device == "rocm":

				@ -137,15 +140,19 @@ def main() -> None:

				    parser.add_argument("--py-version", type=str)

				    parser.add_argument("--commit-hash", type=str)

				    parser.add_argument("--with-clang-ldd", action="store_true")

				    parser.add_argument("--triton-version", type=str, default=read_triton_version())

				    parser.add_argument("--triton-version", type=str, default=None)

				    args = parser.parse_args()

				    triton_version = read_triton_version(args.device)

				    if args.triton_version:

				        triton_version = args.triton_version

				    build_triton(

				        device=args.device,

				        commit_hash=(

				            args.commit_hash if args.commit_hash else read_triton_pin(args.device)

				        ),

				        version=args.triton_version,

				        version=triton_version,

				        py_version=args.py_version,

				        release=args.release,

				        with_clang_ldd=args.with_clang_ldd,

									
										4

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -40,9 +40,9 @@ SUPPORTED_PERIODICAL_MODES: dict[str, Callable[[Optional[str]], bool]] = {

				}

				# The link to the published list of disabled jobs

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=HnkH0xQWnnsoeMsSIVf9291NE5c4jWSa"

				# and unstable jobs

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=iP_F8gBs60PfOMAJ8gnn1paVrzM1WYsK"

				# Some constants used to handle disabled and unstable jobs

				JOB_NAME_SEP = "/"

									
										22

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -17,7 +17,7 @@ from typing import Optional

				# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this

				CUDA_ARCHES = ["12.6", "12.8", "12.9"]

				CUDA_STABLE = "12.6"

				CUDA_STABLE = "12.8"

				CUDA_ARCHES_FULL_VERSION = {

				    "12.6": "12.6.3",

				    "12.8": "12.8.1",

				@ -38,7 +38,7 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]

				CPU_S390X_ARCH = ["cpu-s390x"]

				CUDA_AARCH64_ARCHES = ["12.8-aarch64"]

				CUDA_AARCH64_ARCHES = ["12.9-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				@ -53,8 +53,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"

				@ -70,8 +69,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"

				@ -86,8 +84,8 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"

				@ -222,13 +220,8 @@ def generate_libtorch_matrix(

				        if os == "linux":

				            arches += CUDA_ARCHES

				            arches += ROCM_ARCHES

				            # will add in a separate PR for 12.9

				            if "12.9" in arches:

				                arches.remove("12.9")

				        elif os == "windows":

				            arches += CUDA_ARCHES

				            if "12.9" in arches:

				                arches.remove("12.9")

				    if libtorch_variants is None:

				        libtorch_variants = [

				            "shared-with-deps",

				@ -294,9 +287,6 @@ def generate_wheels_matrix(

				            arches += CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES

				        elif os == "windows":

				            arches += CUDA_ARCHES + XPU_ARCHES

				            # skip CUDA 12.9 builds on Windows

				            if "12.9" in arches:

				                arches.remove("12.9")

				        elif os == "linux-aarch64":

				            # Separate new if as the CPU type is different and

				            # uses different build/test scripts

									
										2

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -152,7 +152,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["12.6", "12.8", "12.9"],

				            arches=["12.6", "12.8", "12.9", "6.4"],

				            python_versions=["3.9"],

				        ),

				        branches="main",

									
										2

.github/scripts/get_workflow_job_id.py
									
										vendored
									
												View File
												
				@ -64,7 +64,7 @@ def fetch_url(

				            )

				        exception_message = (

				            "Is github alright?",

				            f"Recieved status code '{err.code}' when attempting to retrieve {url}:\n",

				            f"Received status code '{err.code}' when attempting to retrieve {url}:\n",

				            f"{err.reason}\n\nheaders={err.headers}",

				        )

				        raise RuntimeError(exception_message) from err

									
										2

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -211,7 +211,7 @@ class GitRepo:

				        self, from_branch: str, to_branch: str

				    ) -> tuple[list[str], list[str]]:

				        """

				        Returns list of commmits that are missing in each other branch since their merge base

				        Returns list of commits that are missing in each other branch since their merge base

				        Might be slow if merge base is between two branches is pretty far off

				        """

				        from_ref = self.rev_parse(from_branch)

									
										2

.github/scripts/pr-sanity-check.sh
									
										vendored
									
												View File
												
				@ -12,7 +12,7 @@ BASE=${BASE:-HEAD~1}

				HEAD=${HEAD:-HEAD}

				ancestor=$(git merge-base "${BASE}" "${HEAD}")

				echo "INFO: Checking aginst the following stats"

				echo "INFO: Checking against the following stats"

				(

				    set -x

				    git diff --stat=10000 "$ancestor" "${HEAD}" | sed '$d' > "${TMPFILE}"

									
										10

.github/scripts/test_filter_test_configs.py
									
										vendored
									
												View File
												
				@ -347,26 +347,26 @@ class TestConfigFilter(TestCase):

				            {

				                "job_name": "a-ci-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config",

				                "description": "Replicate each periodic mode in a different config",

				            },

				            {

				                "job_name": "a-ci-cuda11.8-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config for a CUDA job",

				                "description": "Replicate each periodic mode in a different config for a CUDA job",

				            },

				            {

				                "job_name": "a-ci-rocm-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config for a ROCm job",

				                "description": "Replicate each periodic mode in a different config for a ROCm job",

				            },

				            {

				                "job_name": "",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Empty job name",

				                "description": "Empty job name",

				            },

				            {

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Missing job name",

				                "description": "Missing job name",

				            },

				        ]

									
										4

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -265,7 +265,7 @@ class DummyGitRepo(GitRepo):

				        return ["FakeCommitSha"]

				    def commit_message(self, ref: str) -> str:

				        return "super awsome commit message"

				        return "super awesome commit message"

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@ -433,7 +433,7 @@ class TestTryMerge(TestCase):

				        )

				    def test_cancelled_gets_ignored(self, *args: Any) -> None:

				        """Tests that cancelled workflow does not override existing successfull status"""

				        """Tests that cancelled workflow does not override existing successful status"""

				        pr = GitHubPR("pytorch", "pytorch", 110367)

				        conclusions = pr.get_checkrun_conclusions()

				        lint_checks = [name for name in conclusions.keys() if "Lint" in name]

									
										2

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -634,7 +634,7 @@ def _revlist_to_prs(

				            raise RuntimeError(

				                f"Found an unexpected number of PRs mentioned in commit {rev}: "

				                f"{len(all_matches)}.  This is probably because you are using an "

				                "old verion of ghstack.  Please update ghstack and resubmit "

				                "old version of ghstack.  Please update ghstack and resubmit "

				                "your PRs"

				            )

									
										3

.github/scripts/windows/build_magma.bat
									
										vendored
									
												View File
												
				@ -17,7 +17,6 @@ if errorlevel 1 exit /b 1

				set "PATH=C:\Tools;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\libnvvp;%PATH%"

				set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				mkdir magma_cuda%CUVER_NODOT%

				cd magma_cuda%CUVER_NODOT%

				@ -41,7 +40,7 @@ if "%CUVER_NODOT%" == "129" (

				if "%CUVER_NODOT%" == "128" (

				  set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				)

				if "%CUVER_NODOT:~0,2%" == "12" if NOT "%CUVER_NODOT%" == "128" (

				if "%CUVER_NODOT%" == "126" (

				  set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90

				)

									
										2

.github/scripts/windows/build_triton.bat
									
										vendored
									
												View File
												
				@ -15,4 +15,4 @@ call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmak

				dir "%VC_INSTALL_PATH%"

				call "%VC_INSTALL_PATH%\VC\Auxiliary\Build\vcvarsall.bat" x64

				call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %TRITON_VERSION% %RELEASE%

				call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %RELEASE%

2

.github/templates/common.yml.j2 vendored

View File

 @ -32,7 +32,7 @@ concurrency:
 {%- macro setup_ec2_windows() -%}
       !{{ display_ec2_information() }}
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: pytorch/test-infra/.github/actions/setup-ssh@main
         uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
         continue-on-error: true
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}

16

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -56,7 +56,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -150,10 +150,10 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -161,7 +161,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary
 @ -171,7 +171,7 @@ jobs:
       - name: Teardown XPU
         uses: ./.github/actions/teardown-xpu
     {%- else %}
     runs-on: linux.rocm.gpu
     runs-on: linux.rocm.gpu.mi250
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config) }}
     steps:
 @ -182,7 +182,7 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: ROCm set GPU_FLAG
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
 @ -196,7 +196,7 @@ jobs:
           role-duration-seconds: 18000
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -204,7 +204,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary

2

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -76,7 +76,7 @@ jobs:
           elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           fi
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -64,7 +64,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -135,7 +135,7 @@ jobs:
 {%- else %}
       !{{ set_runner_specific_vars() }}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
 {%- endif %}
       - name: Populate binary env
         shell: bash
 @ -211,7 +211,7 @@ jobs:
           "pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
 {%- else %}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ set_runner_specific_vars() }}
 {%- endif %}
       - uses: !{{ common.download_artifact_action }}

									
										14

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -47,7 +47,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -69,25 +69,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -97,7 +97,7 @@ jobs:

				        run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Output disk space left

				@ -209,5 +209,5 @@ jobs:

				          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always()

									
										13

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -151,13 +151,13 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -187,7 +187,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          show-progress: false

				@ -222,9 +221,9 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          # If doing this in main or release branch, use docker.io. Otherwise

				          # If doing this in release/2.8 or release branch, use docker.io. Otherwise

				          # use ECR

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -236,7 +235,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -293,7 +292,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										13

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -134,14 +134,14 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				        # Setup the environment

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -164,7 +164,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          show-progress: false

				          path: pytorch

				@ -195,7 +194,7 @@ jobs:

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: configure aws credentials

				@ -210,7 +209,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -220,7 +219,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -232,7 +231,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										2

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -89,7 +89,7 @@ jobs:

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

									
										10

.github/workflows/_docs.yml
									
										vendored
									
												View File
												
				@ -84,7 +84,7 @@ jobs:

				    name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -95,7 +95,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				@ -110,12 +110,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -222,5 +222,5 @@ jobs:

				          s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always()

									
										4

.github/workflows/_link_check.yml
									
										vendored
									
												View File
												
				@ -11,7 +11,7 @@ on:

				jobs:

				  lint-urls:

				    if: ${{ github.event_name != 'pull_request' || !contains(github.event.pull_request.labels.*.name, 'skip-url-lint') }}

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@release/2.8

				    with:

				      timeout: 120

				      runner: ${{ inputs.runner }}linux.2xlarge

				@ -36,7 +36,7 @@ jobs:

				  lint-xrefs:

				    if: ${{ github.event_name != 'pull_request' || !contains(github.event.pull_request.labels.*.name, 'skip-xref-lint') }}

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@release/2.8

				    with:

				      timeout: 60

				      runner: ${{ inputs.runner }}linux.2xlarge

									
										28

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -132,7 +132,7 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -142,7 +142,7 @@ jobs:

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

				@ -158,6 +158,13 @@ jobs:

				          role-session-name: gha-linux-build

				          aws-region: us-east-1

				      - name: Get workflow job id

				        id: get-job-id

				        uses: ./.github/actions/get-workflow-job-id

				        if: always()

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				      - name: Check if can use old whl build

				        id: use-old-whl

				        uses: ./.github/actions/reuse-old-whl

				@ -166,10 +173,12 @@ jobs:

				          build-environment: ${{ inputs.build-environment }}

				          run-id: ${{ github.run_id }}

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				          job-id: ${{ steps.get-job-id.outputs.job-id }}

				          job-name: ${{ steps.get-job-id.outputs.job-name }}

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				@ -181,11 +190,11 @@ jobs:

				          ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        shell: bash

				        run: |

				          tag=${ECR_DOCKER_IMAGE##*/}

				          tag=${ECR_DOCKER_IMAGE##*:}

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel' && steps.use-old-whl.outputs.reuse != 'true'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -194,13 +203,6 @@ jobs:

				        id: parse-ref

				        run: .github/scripts/parse_ref.py

				      - name: Get workflow job id

				        id: get-job-id

				        uses: ./.github/actions/get-workflow-job-id

				        if: always()

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				      # Apply the filter logic to the build step too if the test-config label is already there

				      - name: Select all requested test configurations (if the test matrix is available)

				        id: filter

				@ -408,7 +410,7 @@ jobs:

				          artifact_prefix: usage_log_build_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: Cleanup docker

									
										19

.github/workflows/_linux-test.yml
									
										vendored
									
												View File
												
				@ -92,7 +92,7 @@ jobs:

				    timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -101,7 +101,7 @@ jobs:

				              docker exec -it $(docker container ps --format '{{.ID}}') bash

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

				@ -119,7 +119,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				@ -131,11 +131,11 @@ jobs:

				          ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        shell: bash

				        run: |

				          tag=${ECR_DOCKER_IMAGE##*/}

				          tag=${ECR_DOCKER_IMAGE##*:}

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -147,7 +147,7 @@ jobs:

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        id: install-nvidia-driver

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Setup GPU_FLAG for docker run

				@ -207,7 +207,7 @@ jobs:

				        run: .github/scripts/parse_ref.py

				      - name: Check for keep-going label and re-enabled test issues

				        # This uses the filter-test-configs action because it conviniently

				        # This uses the filter-test-configs action because it conveniently

				        # checks for labels and re-enabled test issues.  It does not actually do

				        # any filtering.  All filtering is done in the build step.

				        id: keep-going

				@ -385,7 +385,8 @@ jobs:

				          job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				      - name: Upload the benchmark results

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.8

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          benchmark-results-dir: test/test-reports

				          dry-run: false

				@ -442,7 +443,7 @@ jobs:

				          workflow_attempt: ${{github.run_attempt}}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'

				      # NB: We are currently having an intermittent GPU-related issue on G5 runners with

									
										8

.github/workflows/_mac-build.yml
									
										vendored
									
												View File
												
				@ -67,11 +67,11 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.8

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Set xcode version

				        env:

				@ -82,7 +82,7 @@ jobs:

				          fi

				      - name: Setup Python

				        uses: pytorch/test-infra/.github/actions/setup-python@main

				        uses: pytorch/test-infra/.github/actions/setup-python@release/2.8

				        with:

				          python-version: ${{ inputs.python-version }}

				          pip-requirements-file: .github/requirements/pip-requirements-macOS.txt

				@ -191,4 +191,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.8

									
										58

.github/workflows/_mac-test.yml
									
										vendored
									
												View File
												
				@ -60,8 +60,6 @@ jobs:

				  test:

				    # Don't run on forked repos or empty test matrix

				    if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'

				    # For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179

				    # Also ensure that we always run with the right architecture

				    defaults:

				      run:

				        shell: bash -e -l {0}

				@ -90,6 +88,10 @@ jobs:

				            pkill "${PROCESS}" || true

				          done

				      - name: Clean up leftover miniconda installation

				        continue-on-error: true

				        run: brew uninstall miniconda || true

				      - name: Clean up leftover local python3 site-packages on MacOS pet runner

				        continue-on-error: true

				        run: |

				@ -99,11 +101,11 @@ jobs:

				          done

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.8

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Get workflow job id

				        id: get-job-id

				@ -124,8 +126,8 @@ jobs:

				          MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}

				          MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}

				        run: |

				          ${CONDA_RUN} python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7

				          ${CONDA_RUN} python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &

				          python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7

				          python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &

				          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

				      - name: Download build artifacts

				@ -140,11 +142,10 @@ jobs:

				        with:

				          use-gha: true

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				      - name: Setup Python

				        uses: pytorch/test-infra/.github/actions/setup-python@release/2.8

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-macOS-ARM64

				          pip-requirements-file: .github/requirements/pip-requirements-macOS.txt

				          default-packages: ""

				@ -153,7 +154,7 @@ jobs:

				        run: .github/scripts/parse_ref.py

				      - name: Check for keep-going label and re-enabled test issues

				        # This uses the filter-test-configs action because it conviniently

				        # This uses the filter-test-configs action because it conveniently

				        # checks for labels and re-enabled test issues.  It does not actually do

				        # any filtering.  All filtering is done in the build step.

				        id: keep-going

				@ -197,37 +198,32 @@ jobs:

				          # shellcheck disable=SC1090

				          set -ex

				          arch

				          if [[ -n "$CONDA_ENV" ]]; then

				            # Use binaries under conda environment

				            export PATH="$CONDA_ENV/bin":$PATH

				          fi

				          # TODO: Remove me later, and properly activate venv

				          PATH="$(dirname "$(which python)"):$PATH"

				          export PATH

				          # Print out some information about the test environment

				          which conda

				          conda --version

				          ${CONDA_RUN} which python3

				          ${CONDA_RUN} python3 --version

				          ${CONDA_RUN} which python

				          ${CONDA_RUN} python --version

				          for tool in python3 python; do

				            which $tool

				            $tool --version

				          done

				          ${CONDA_RUN} python3 -mpip install --no-index --no-deps dist/*.whl

				          python3 -mpip install --no-index --no-deps dist/*.whl

				          set +e

				          pushd "${RUNNER_TEMP}"

				          # Install pip dependencies if they are not found. This is to mitigate a peculiar

				          # flaky missing dependencies on MacOS

				          ${CONDA_RUN} python3 -c "import torch"

				          python3 -c "import torch"

				          RC=$?

				          popd

				          if [ "${RC}" -ne 0 ]; then

				            ${CONDA_RUN} python3 -mpip install --ignore-installed -r "${PIP_REQUIREMENTS_FILE}"

				            python3 -mpip install --ignore-installed -r "${PIP_REQUIREMENTS_FILE}"

				          fi

				          set -e

				          ${CONDA_RUN} .ci/pytorch/macos-test.sh

				          .ci/pytorch/macos-test.sh

				      - name: Print remaining test logs

				        shell: bash

				@ -239,11 +235,7 @@ jobs:

				        shell: bash

				        if: ${{ contains(steps.get-job-id.outputs.job-name, 'mps') }}

				        run: |

				          if [[ -n "$CONDA_ENV" ]]; then

				            # Use binaries under conda environment

				            export PATH="$CONDA_ENV/bin":$PATH

				          fi

				          ${CONDA_RUN} python3 test/bench_mps_ops.py

				          python3 test/bench_mps_ops.py

				      - name: Stop monitoring script

				@ -262,7 +254,7 @@ jobs:

				          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Upload the benchmark results

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.8

				        with:

				          benchmark-results-dir: test/test-reports

				          dry-run: false

				@ -284,4 +276,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.8

									
										10

.github/workflows/_rocm-test.yml
									
										vendored
									
												View File
												
				@ -81,7 +81,7 @@ jobs:

				    steps:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

				@ -103,12 +103,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -150,7 +150,7 @@ jobs:

				        run: .github/scripts/parse_ref.py

				      - name: Check for keep-going label and re-enabled test issues

				        # This uses the filter-test-configs action because it conviniently

				        # This uses the filter-test-configs action because it conveniently

				        # checks for labels and re-enabled test issues.  It does not actually do

				        # any filtering.  All filtering is done in the build step.

				        id: keep-going

				@ -320,7 +320,7 @@ jobs:

				          aws-region: us-east-1

				      - name: Upload the benchmark results

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.8

				        with:

				          benchmark-results-dir: test/test-reports

				          dry-run: false

									
										4

.github/workflows/_runner-determinator.yml
									
										vendored
									
												View File
												
				@ -7,7 +7,7 @@ on:

				        required: false

				        type: string

				        description: |

				          List of experiments for this workfow. If not defined, all default experiments are included.

				          List of experiments for this workflow. If not defined, all default experiments are included.

				      opt_out_experiments:

				        required: false

				        type: string

				@ -59,7 +59,7 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				    steps:

				      # - name: Checkout PyTorch

				      #   uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      #   uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      #   with:

				      #     fetch-depth: 1

				      #     submodules: true

									
										6

.github/workflows/_win-build.yml
									
										vendored
									
												View File
												
				@ -84,10 +84,10 @@ jobs:

				          git config --global core.fsmonitor false

				      - name: Clean up leftover processes on non-ephemeral Windows runner

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@main

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.8

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -102,7 +102,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

									
										8

.github/workflows/_win-test.yml
									
										vendored
									
												View File
												
				@ -77,10 +77,10 @@ jobs:

				          git config --global core.fsmonitor false

				      - name: Clean up leftover processes on non-ephemeral Windows runner

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@main

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.8

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -96,7 +96,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: true

				@ -158,7 +158,7 @@ jobs:

				        uses: ./.github/actions/download-td-artifacts

				      - name: Check for keep-going label and re-enabled test issues

				        # This uses the filter-test-configs action because it conviniently

				        # This uses the filter-test-configs action because it conveniently

				        # checks for labels and re-enabled test issues.  It does not actually do

				        # any filtering.  All filtering is done in the build step.

				        id: keep-going

									
										10

.github/workflows/_xpu-test.yml
									
										vendored
									
												View File
												
				@ -77,7 +77,7 @@ jobs:

				    steps:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Setup XPU

				        uses: ./.github/actions/setup-xpu

				@ -95,7 +95,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				@ -105,11 +105,11 @@ jobs:

				          ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        shell: bash

				        run: |

				          tag=${ECR_DOCKER_IMAGE##*/}

				          tag=${ECR_DOCKER_IMAGE##*:}

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -147,7 +147,7 @@ jobs:

				        run: .github/scripts/parse_ref.py

				      - name: Check for keep-going label and re-enabled test issues

				        # This uses the filter-test-configs action because it conviniently

				        # This uses the filter-test-configs action because it conveniently

				        # checks for labels and re-enabled test issues.  It does not actually do

				        # any filtering.  All filtering is done in the build step.

				        id: keep-going

									
										6

.github/workflows/build-almalinux-images.yml
									
										vendored
									
												View File
												
				@ -23,7 +23,7 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

				@ -32,14 +32,14 @@ concurrency:

				jobs:

				  build-docker:

				    if: github.repository_owner == 'pytorch'

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}

				    runs-on: linux.9xlarge.ephemeral

				    strategy:

				      matrix:

				        tag: ["cuda12.6", "cuda12.8", "cuda12.9", "rocm6.3", "rocm6.4", "cpu"]

				    steps:

				      - name: Build docker image

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@main

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@release/2.8

				        with:

				          docker-image-name: almalinux-builder

				          custom-tag-prefix: ${{matrix.tag}}

									
										8

.github/workflows/build-libtorch-images.yml
									
										vendored
									
												View File
												
				@ -22,7 +22,7 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

				@ -32,7 +32,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -40,7 +40,7 @@ jobs:

				      curr_ref_type: ${{ github.ref_type }}

				  build:

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}

				    needs: get-label-type

				    runs-on: ${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral

				    name: libtorch-cxx11-builder:${{ matrix.tag }}

				@ -57,7 +57,7 @@ jobs:

				        ]

				    steps:

				      - name: Build docker image

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@main

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@release/2.8

				        with:

				          docker-image-name: libtorch-cxx11-builder

				          custom-tag-prefix: ${{ matrix.tag }}

									
										2

.github/workflows/build-magma-rocm-linux.yml
									
										vendored
									
												View File
												
				@ -29,7 +29,7 @@ concurrency:

				jobs:

				  build-linux-magma-rocm:

				    if: github.repository_owner == 'pytorch'

				    runs-on: linux.12xlarge

				    runs-on: linux.2xlarge

				    permissions:

				      id-token: write

				    strategy:

									
										8

.github/workflows/build-manywheel-images-s390x.yml
									
										vendored
									
												View File
												
				@ -12,7 +12,7 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

				@ -21,11 +21,11 @@ concurrency:

				jobs:

				  build-docker-cpu-s390x:

				    if: github.repository_owner == 'pytorch'

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}

				    runs-on: linux.s390x

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          submodules: false

				          no-sudo: true

				@ -53,7 +53,7 @@ jobs:

				          docker tag "${CREATED_FULL_DOCKER_IMAGE_NAME}" "${DOCKER_IMAGE_NAME_PREFIX}-${GIT_COMMIT_SHA}"

				          docker tag "${CREATED_FULL_DOCKER_IMAGE_NAME}" "${DOCKER_IMAGE_NAME_PREFIX}-${CI_FOLDER_SHA}"

				          # Prety sure Github will mask tokens and I'm not sure if it will even be

				          # Pretty sure Github will mask tokens and I'm not sure if it will even be

				          # printed due to pipe, but just in case

				          set +x

				          if [[ "${WITH_PUSH:-false}" == "true" ]]; then

									
										9

.github/workflows/build-manywheel-images.yml
									
										vendored
									
												View File
												
				@ -23,8 +23,7 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

				  cancel-in-progress: true

				@ -33,7 +32,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -41,7 +40,7 @@ jobs:

				      curr_ref_type: ${{ github.ref_type }}

				  build:

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}

				    needs: get-label-type

				    strategy:

				      fail-fast: false

				@ -63,7 +62,7 @@ jobs:

				    name: ${{ matrix.name }}:${{ matrix.tag }}

				    steps:

				      - name: Build docker image

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@main

				        uses: pytorch/pytorch/.github/actions/binary-docker-build@release/2.8

				        with:

				          docker-image-name: ${{ matrix.name }}

				          custom-tag-prefix: ${{ matrix.tag }}

									
										22

.github/workflows/build-triton-wheel.yml
									
										vendored
									
												View File
												
				@ -3,7 +3,7 @@ name: Build Triton wheels

				on:

				  push:

				    branches:

				      - main

				      - release/2.8

				    tags:

				      # NOTE: Binary build pipelines should only get triggered on release candidate builds

				      # Release candidate tags look like: v1.11.0-rc1

				@ -35,7 +35,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -73,12 +73,12 @@ jobs:

				      PLATFORM: 'manylinux_2_28_x86_64'

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          submodules: false

				@ -86,7 +86,7 @@ jobs:

				        uses: ./.github/actions/setup-linux

				      - name: Pull Docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ env.DOCKER_IMAGE }}

				@ -156,9 +156,8 @@ jobs:

				          fi

				          if [[ "${BUILD_DEVICE}" == xpu ]]; then

				            TRITON_VERSION=$(cat .ci/docker/triton_xpu_version.txt)

				            docker exec -t "${container_name}" bash -c "dnf install -y gcc-toolset-13-gcc-c++"

				            docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE --triton-version=$TRITON_VERSION $RELEASE"

				            docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"

				          else

				            docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE $WITH_CLANG_LDD"

				          fi

				@ -178,7 +177,7 @@ jobs:

				          path: ${{ runner.temp }}/artifacts/wheelhouse/*

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always()

				  build-wheel-win:

				@ -211,7 +210,7 @@ jobs:

				          echo "instance-type: $(get_ec2_metadata instance-type)"

				          echo "system info $(uname -a)"

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -255,11 +254,6 @@ jobs:

				          if [[ "${IS_RELEASE_TAG}" == true ]]; then

				            export RELEASE="--release"

				          fi

				          export TRITON_VERSION=""

				          if [[ "${{ matrix.device }}" == "xpu" ]]; then

				            triton_version=$(cat .ci/docker/ci_commit_pins/triton-xpu.txt)

				            export TRITON_VERSION="--triton-version=${triton_version}"

				          fi

				          .github/scripts/windows/build_triton.bat

				          mkdir -p "${RUNNER_TEMP}/artifacts/"

				          mv ./*.whl "${RUNNER_TEMP}/artifacts/"

									
										2

.github/workflows/check-labels.yml
									
										vendored
									
												View File
												
				@ -38,7 +38,7 @@ jobs:

				    runs-on: linux.24_04.4x

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          submodules: false

				          fetch-depth: 1

									
										2

.github/workflows/close-nonexistent-disable-issues.yml
									
										vendored
									
												View File
												
				@ -13,7 +13,7 @@ jobs:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          submodules: false

				          fetch-depth: 1

Compare commits

636 Commits mlazos/tup ... v2.8.0

6 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

6 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

5 .ci/docker/build.sh Unescape Escape View File

1 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

14 .ci/docker/common/install_conda.sh Unescape Escape View File

2 .ci/docker/common/install_onnx.sh Unescape Escape View File

6 .ci/docker/common/install_openblas.sh Unescape Escape View File

19 .ci/docker/common/install_rocm.sh Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

2 .ci/docker/manywheel/build.sh Unescape Escape View File

10 .ci/docker/requirements-ci.txt Unescape Escape View File

10 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

2 .ci/docker/triton_xpu_version.txt Unescape Escape View File

1 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

6 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

4 .ci/manywheel/build_common.sh Unescape Escape View File

25 .ci/manywheel/build_cuda.sh Unescape Escape View File

1 .ci/manywheel/build_libtorch.sh Unescape Escape View File

2 .ci/manywheel/build_rocm.sh Unescape Escape View File

1 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/common.sh Unescape Escape View File

2 .ci/pytorch/common_utils.sh Unescape Escape View File

5 .ci/pytorch/macos-test.sh Unescape Escape View File

2 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

2 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

32 .ci/pytorch/test.sh Unescape Escape View File

2 .ci/pytorch/test_example_code/CMakeLists.txt Unescape Escape View File

2 .ci/pytorch/win-build.sh Unescape Escape View File

2 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py Unescape Escape View File

9 .ci/pytorch/windows/cuda126.bat Unescape Escape View File

9 .ci/pytorch/windows/cuda128.bat Unescape Escape View File

15 .ci/pytorch/windows/cuda129.bat Unescape Escape View File

2 .ci/pytorch/windows/internal/check_deps.bat Unescape Escape View File

1 .ci/pytorch/windows/internal/copy.bat Unescape Escape View File

15 .ci/pytorch/windows/internal/cuda_install.bat Unescape Escape View File

2 .ci/wheel/build_wheel.sh Unescape Escape View File

4 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

157 .circleci/scripts/trigger_azure_pipeline.py Unescape Escape View File

2 .github/actionlint.yaml vendored Unescape Escape View File

2 .github/actions/build-android/action.yml vendored Unescape Escape View File

2 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

2 .github/actions/linux-test/action.yml vendored Unescape Escape View File

9 .github/actions/reuse-old-whl/action.yml vendored Unescape Escape View File

57 .github/actions/reuse-old-whl/reuse_old_whl.py vendored Unescape Escape View File

4 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

1 .github/labeler.yml vendored Unescape Escape View File

1 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements-gha-cache.txt vendored Unescape Escape View File

2 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

2 .github/scripts/amd/patch_triton_wheel.sh vendored Unescape Escape View File

17 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

4 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

22 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

2 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

2 .github/scripts/get_workflow_job_id.py vendored Unescape Escape View File

2 .github/scripts/gitutils.py vendored Unescape Escape View File

2 .github/scripts/pr-sanity-check.sh vendored Unescape Escape View File

10 .github/scripts/test_filter_test_configs.py vendored Unescape Escape View File

4 .github/scripts/test_trymerge.py vendored Unescape Escape View File

2 .github/scripts/trymerge.py vendored Unescape Escape View File

3 .github/scripts/windows/build_magma.bat vendored Unescape Escape View File

2 .github/scripts/windows/build_triton.bat vendored Unescape Escape View File

2 .github/templates/common.yml.j2 vendored Unescape Escape View File

16 .github/templates/linux_binary_build_workflow.yml.j2 vendored Unescape Escape View File

2 .github/templates/macos_binary_build_workflow.yml.j2 vendored Unescape Escape View File

6 .github/templates/windows_binary_build_workflow.yml.j2 vendored Unescape Escape View File

14 .github/workflows/_bazel-build-test.yml vendored Unescape Escape View File

636 Commits

mlazos/tup ... v2.8.0

6

.ci/aarch64_linux/aarch64_ci_build.sh

View File

6

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

5

.ci/docker/build.sh

View File

1

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

14

.ci/docker/common/install_conda.sh

View File

2

.ci/docker/common/install_onnx.sh

View File

6

.ci/docker/common/install_openblas.sh

View File

19

.ci/docker/common/install_rocm.sh

View File

2

.ci/docker/manywheel/Dockerfile_2_28

View File

4

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

4

.ci/docker/manywheel/Dockerfile_s390x

View File

2

.ci/docker/manywheel/build.sh

View File

10

.ci/docker/requirements-ci.txt

View File

10

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

2

.ci/docker/triton_xpu_version.txt

View File

1

.ci/docker/ubuntu-rocm/Dockerfile

View File

6

.ci/docker/ubuntu/Dockerfile

View File

4

.ci/manywheel/build_common.sh

View File

25

.ci/manywheel/build_cuda.sh

View File

1

.ci/manywheel/build_libtorch.sh

View File

2

.ci/manywheel/build_rocm.sh

View File

1

.ci/pytorch/build.sh

View File

2

.ci/pytorch/common.sh

View File

2

.ci/pytorch/common_utils.sh

View File

5

.ci/pytorch/macos-test.sh

View File

2

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

2

.ci/pytorch/smoke_test/smoke_test.py

View File

32

.ci/pytorch/test.sh

View File

2

.ci/pytorch/test_example_code/CMakeLists.txt

View File

2

.ci/pytorch/win-build.sh

View File

2

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

2

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

9

.ci/pytorch/windows/cuda126.bat

View File

9

.ci/pytorch/windows/cuda128.bat

View File

15

.ci/pytorch/windows/cuda129.bat

View File

2

.ci/pytorch/windows/internal/check_deps.bat

View File

1

.ci/pytorch/windows/internal/copy.bat

View File

15

.ci/pytorch/windows/internal/cuda_install.bat

View File

2

.ci/wheel/build_wheel.sh

View File

4

.circleci/scripts/binary_populate_env.sh

View File

157

.circleci/scripts/trigger_azure_pipeline.py

View File

2

.github/actionlint.yaml vendored

View File

2

.github/actions/build-android/action.yml vendored

View File

2

.github/actions/filter-test-configs/action.yml vendored

View File

2

.github/actions/linux-test/action.yml vendored

View File

9

.github/actions/reuse-old-whl/action.yml vendored

View File

57

.github/actions/reuse-old-whl/reuse_old_whl.py vendored

View File

4

.github/actions/setup-linux/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

1

.github/labeler.yml vendored

View File

1

.github/pytorch-probot.yml vendored

View File

2

.github/requirements-gha-cache.txt vendored

View File

2

.github/requirements/pip-requirements-macOS.txt vendored

View File

2

.github/scripts/amd/patch_triton_wheel.sh vendored

View File

17

.github/scripts/build_triton_wheel.py vendored

View File

4

.github/scripts/filter_test_configs.py vendored

View File

22

.github/scripts/generate_binary_build_matrix.py vendored

View File

2

.github/scripts/generate_ci_workflows.py vendored

View File

2

.github/scripts/get_workflow_job_id.py vendored

View File

2

.github/scripts/gitutils.py vendored

View File

2

.github/scripts/pr-sanity-check.sh vendored

View File

10

.github/scripts/test_filter_test_configs.py vendored

View File

4

.github/scripts/test_trymerge.py vendored

View File

2

.github/scripts/trymerge.py vendored

View File

3

.github/scripts/windows/build_magma.bat vendored

View File

2

.github/scripts/windows/build_triton.bat vendored

View File

2

.github/templates/common.yml.j2 vendored

View File

16

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

2

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

14

.github/workflows/_bazel-build-test.yml vendored

View File

13

.github/workflows/_binary-build-linux.yml vendored

View File